-
Bug
-
Resolution: Duplicate
-
P4
-
8, 11, 12, 13
-
Cause Known
-
generic
-
generic
ADDITIONAL SYSTEM INFORMATION :
OpenJDK 8
Oracle JDK 8
Oracle JDK 11
A DESCRIPTION OF THE PROBLEM :
I find the documentation for Pattern confusing with respect to the effect of $.
When I write "the docs" I mean https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html, which seem to match Java8 and Java7.
The default behavior of $ is described in the docs as:
(1) $: The end of a line, which is later clarified to:
(2) By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.
I have tested this description of $ on OpenJDK 8, OracleJDK 8, and OracleJDK 11. All three agree, but seem inconsistent with the documented behavior.
In particular, I am creating a Pattern with no flags:
Pattern p = Pattern.compile("$\\s+", 0);
I then match it using p.matcher() against the string "x\r" -- the letter x, then a carriage return which is defined as a line termination character in the docs.
The match is reported as a success. A success is consistent with the MULTILINE behavior of $, in which it matches either (1) the end of the input, or (2) the end of a line (i.e. immediately before a line termination). However, as I understand the docs, a success is inconsistent in the default (non-MULTILINE) mode in which I created the Pattern. In the default mode it should only match the end of the input, and so a Pattern like "$\\s+" should be impossible to satisfy.
I am not the first person to be confused by this behavior. See the following OpenJDK bugs, which all describe this behavior. I do not understand the explanation that has been provided thus far.
https://bugs.openjdk.java.net/browse/JDK-8059325
https://bugs.openjdk.java.net/browse/JDK-8058923
https://bugs.openjdk.java.net/browse/JDK-8049849
https://bugs.openjdk.java.net/browse/JDK-8043255
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
See source code. Run javac, then "java ConfusingDollar", and it will print "Default: Matched" instead of "Default: Did not match".
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
On the string "x\r", the Pattern /$\s+/ should not match in default mode. It should only match in MULTILINE mode.
ACTUAL -
It matches in default mode.
---------- BEGIN SOURCE ----------
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class ConfusingDollar {
public static void main(String[] args)
{
Pattern p_def = Pattern.compile("$\\s+", 0);
Matcher m_def = p_def.matcher("x\r");
if (m_def.find()) {
System.out.println("Default: Matched");
} else {
System.out.println("Default: Did not match");
}
Pattern p_mult = Pattern.compile("$\\s+", Pattern.MULTILINE);
Matcher m_mult = p_mult.matcher("x\r");
if (m_mult.find()) {
System.out.println("MULTILINE: Matched");
} else {
System.out.println("MULTILINE: Did not match");
}
}
}
---------- END SOURCE ----------
FREQUENCY : always
OpenJDK 8
Oracle JDK 8
Oracle JDK 11
A DESCRIPTION OF THE PROBLEM :
I find the documentation for Pattern confusing with respect to the effect of $.
When I write "the docs" I mean https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html, which seem to match Java8 and Java7.
The default behavior of $ is described in the docs as:
(1) $: The end of a line, which is later clarified to:
(2) By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.
I have tested this description of $ on OpenJDK 8, OracleJDK 8, and OracleJDK 11. All three agree, but seem inconsistent with the documented behavior.
In particular, I am creating a Pattern with no flags:
Pattern p = Pattern.compile("$\\s+", 0);
I then match it using p.matcher() against the string "x\r" -- the letter x, then a carriage return which is defined as a line termination character in the docs.
The match is reported as a success. A success is consistent with the MULTILINE behavior of $, in which it matches either (1) the end of the input, or (2) the end of a line (i.e. immediately before a line termination). However, as I understand the docs, a success is inconsistent in the default (non-MULTILINE) mode in which I created the Pattern. In the default mode it should only match the end of the input, and so a Pattern like "$\\s+" should be impossible to satisfy.
I am not the first person to be confused by this behavior. See the following OpenJDK bugs, which all describe this behavior. I do not understand the explanation that has been provided thus far.
https://bugs.openjdk.java.net/browse/JDK-8059325
https://bugs.openjdk.java.net/browse/JDK-8058923
https://bugs.openjdk.java.net/browse/JDK-8049849
https://bugs.openjdk.java.net/browse/JDK-8043255
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
See source code. Run javac, then "java ConfusingDollar", and it will print "Default: Matched" instead of "Default: Did not match".
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
On the string "x\r", the Pattern /$\s+/ should not match in default mode. It should only match in MULTILINE mode.
ACTUAL -
It matches in default mode.
---------- BEGIN SOURCE ----------
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class ConfusingDollar {
public static void main(String[] args)
{
Pattern p_def = Pattern.compile("$\\s+", 0);
Matcher m_def = p_def.matcher("x\r");
if (m_def.find()) {
System.out.println("Default: Matched");
} else {
System.out.println("Default: Did not match");
}
Pattern p_mult = Pattern.compile("$\\s+", Pattern.MULTILINE);
Matcher m_mult = p_mult.matcher("x\r");
if (m_mult.find()) {
System.out.println("MULTILINE: Matched");
} else {
System.out.println("MULTILINE: Did not match");
}
}
}
---------- END SOURCE ----------
FREQUENCY : always
- duplicates
-
JDK-8237533 replaceAll is asymmetric for regex's '^' and '$'
-
- Closed
-
-
JDK-8296292 Document the default behavior of '$' in regular expressions correctly
-
- Resolved
-
-
JDK-8251864 Java Matcher.find $ matches newline in single line mode
-
- Closed
-
-
JDK-8278742 Erroneous capture possible if string ends with new-line character
-
- Closed
-