-
Bug
-
Resolution: Not an Issue
-
P3
-
None
-
11, 15
ADDITIONAL SYSTEM INFORMATION :
Tested on Windows and MacOS
A DESCRIPTION OF THE PROBLEM :
Using regex for natural language processing, tokenization, to find places between non-repeating punctuations and symbol characters now finds breaks between different whitespace characters.
REGRESSION : Last worked in version 8u251
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Results for 1st and 2nd regex patterns show different behavior for Java 8 vs Java 11 runtimes.
3rd and 4th patterns seem stable but should be equivalent.
Non alpha numeric, non - whitespace character not followed by the same character
In Java 11, the 1st pattern is now matching spaces and the 2nd is also matching alphanumeric
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Java HotSpot(TM) 64-Bit Server VM
Oracle Corporation
25.221-b11
Running Test
Text: aabcc<space><newline><newline><tab><tab><newline><space><newline><newline>..,;;
Pattern: ([^0-9a-z&&[^\s]])(?!\1)
Index: 15, 16 Group: . Next Char: ,
Index: 16, 17 Group: , Next Char: ;
Index: 18, 19 Group: ; Next Char:
Pattern: ([^0-9a-z&&[\s]])(?!\1)
Index: 5, 6 Group: <space> Next Char: <newline>
Index: 7, 8 Group: <newline> Next Char: <tab>
Index: 9, 10 Group: <tab> Next Char: <newline>
Index: 10, 11 Group: <newline> Next Char: <space>
Index: 11, 12 Group: <space> Next Char: <newline>
Index: 13, 14 Group: <newline> Next Char: .
Pattern: ([\S&&[\w]])(?!\1)
Index: 1, 2 Group: a Next Char: b
Index: 2, 3 Group: b Next Char: c
Index: 4, 5 Group: c Next Char: <space>
Pattern: ([\S&&[\W]])(?!\1)
Index: 15, 16 Group: . Next Char: ,
Index: 16, 17 Group: , Next Char: ;
Index: 18, 19 Group: ; Next Char:
ACTUAL -
Java HotSpot(TM) 64-Bit Server VM
Oracle Corporation
11.0.7+8-LTS
Running Test
Text: aabcc<space><newline><newline><tab><tab><newline><space><newline><newline>..,;;
Pattern: ([^0-9a-z&&[^\s]])(?!\1)
Index: 5, 6 Group: <space> Next Char: <newline>
Index: 7, 8 Group: <newline> Next Char: <tab>
Index: 9, 10 Group: <tab> Next Char: <newline>
Index: 10, 11 Group: <newline> Next Char: <space>
Index: 11, 12 Group: <space> Next Char: <newline>
Index: 13, 14 Group: <newline> Next Char: .
Index: 15, 16 Group: . Next Char: ,
Index: 16, 17 Group: , Next Char: ;
Index: 18, 19 Group: ; Next Char:
Pattern: ([^0-9a-z&&[\s]])(?!\1)
Index: 1, 2 Group: a Next Char: b
Index: 2, 3 Group: b Next Char: c
Index: 4, 5 Group: c Next Char: <space>
Index: 5, 6 Group: <space> Next Char: <newline>
Index: 7, 8 Group: <newline> Next Char: <tab>
Index: 9, 10 Group: <tab> Next Char: <newline>
Index: 10, 11 Group: <newline> Next Char: <space>
Index: 11, 12 Group: <space> Next Char: <newline>
Index: 13, 14 Group: <newline> Next Char: .
Index: 15, 16 Group: . Next Char: ,
Index: 16, 17 Group: , Next Char: ;
Index: 18, 19 Group: ; Next Char:
Pattern: ([\S&&[\w]])(?!\1)
Index: 1, 2 Group: a Next Char: b
Index: 2, 3 Group: b Next Char: c
Index: 4, 5 Group: c Next Char: <space>
Pattern: ([\S&&[\W]])(?!\1)
Index: 15, 16 Group: . Next Char: ,
Index: 16, 17 Group: , Next Char: ;
Index: 18, 19 Group: ; Next Char:
---------- BEGIN SOURCE ----------
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TestRegex {
public static void main(String[] args) {
System.out.println(System.getProperty("java.vm.name"));
System.out.println(System.getProperty("java.vm.vendor"));
System.out.println(System.getProperty("java.vm.version"));
System.out.println("Running Test");
String[] testRegex = new String[] { "([^0-9a-z&&[^\\s]])(?!\\1)", "([^0-9a-z&&[\\s]])(?!\\1)", "([\\S&&[\\w]])(?!\\1)", "([\\S&&[\\W]])(?!\\1)" };
String text = "aabcc \n\n\t\t\n \n\n..,;;";
System.out.println("Text: " + printable(text));
System.out.println();
for (String regex : testRegex) {
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
System.out.println("Pattern: " + pattern.pattern());
Matcher m = pattern.matcher(text);
while (m.find()) {
System.out
.println("\tIndex: " + m.start() + ", " + m.end() + " Group: " + printable(m.group()) + " Next Char: " + printable((m.end() < text.length() ? "" + text.charAt(m.end()) : "")));
}
}
}
public static String printable(String text) {
text = text.replaceAll("\t", "<tab>");
text = text.replaceAll("\n", "<newline>");
text = text.replaceAll(" ", "<space>");
return text;
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Have to use the \S and \W character classes that also allow the underscore character, which is a loss of precision.
FREQUENCY : always
Tested on Windows and MacOS
A DESCRIPTION OF THE PROBLEM :
Using regex for natural language processing, tokenization, to find places between non-repeating punctuations and symbol characters now finds breaks between different whitespace characters.
REGRESSION : Last worked in version 8u251
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Results for 1st and 2nd regex patterns show different behavior for Java 8 vs Java 11 runtimes.
3rd and 4th patterns seem stable but should be equivalent.
Non alpha numeric, non - whitespace character not followed by the same character
In Java 11, the 1st pattern is now matching spaces and the 2nd is also matching alphanumeric
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Java HotSpot(TM) 64-Bit Server VM
Oracle Corporation
25.221-b11
Running Test
Text: aabcc<space><newline><newline><tab><tab><newline><space><newline><newline>..,;;
Pattern: ([^0-9a-z&&[^\s]])(?!\1)
Index: 15, 16 Group: . Next Char: ,
Index: 16, 17 Group: , Next Char: ;
Index: 18, 19 Group: ; Next Char:
Pattern: ([^0-9a-z&&[\s]])(?!\1)
Index: 5, 6 Group: <space> Next Char: <newline>
Index: 7, 8 Group: <newline> Next Char: <tab>
Index: 9, 10 Group: <tab> Next Char: <newline>
Index: 10, 11 Group: <newline> Next Char: <space>
Index: 11, 12 Group: <space> Next Char: <newline>
Index: 13, 14 Group: <newline> Next Char: .
Pattern: ([\S&&[\w]])(?!\1)
Index: 1, 2 Group: a Next Char: b
Index: 2, 3 Group: b Next Char: c
Index: 4, 5 Group: c Next Char: <space>
Pattern: ([\S&&[\W]])(?!\1)
Index: 15, 16 Group: . Next Char: ,
Index: 16, 17 Group: , Next Char: ;
Index: 18, 19 Group: ; Next Char:
ACTUAL -
Java HotSpot(TM) 64-Bit Server VM
Oracle Corporation
11.0.7+8-LTS
Running Test
Text: aabcc<space><newline><newline><tab><tab><newline><space><newline><newline>..,;;
Pattern: ([^0-9a-z&&[^\s]])(?!\1)
Index: 5, 6 Group: <space> Next Char: <newline>
Index: 7, 8 Group: <newline> Next Char: <tab>
Index: 9, 10 Group: <tab> Next Char: <newline>
Index: 10, 11 Group: <newline> Next Char: <space>
Index: 11, 12 Group: <space> Next Char: <newline>
Index: 13, 14 Group: <newline> Next Char: .
Index: 15, 16 Group: . Next Char: ,
Index: 16, 17 Group: , Next Char: ;
Index: 18, 19 Group: ; Next Char:
Pattern: ([^0-9a-z&&[\s]])(?!\1)
Index: 1, 2 Group: a Next Char: b
Index: 2, 3 Group: b Next Char: c
Index: 4, 5 Group: c Next Char: <space>
Index: 5, 6 Group: <space> Next Char: <newline>
Index: 7, 8 Group: <newline> Next Char: <tab>
Index: 9, 10 Group: <tab> Next Char: <newline>
Index: 10, 11 Group: <newline> Next Char: <space>
Index: 11, 12 Group: <space> Next Char: <newline>
Index: 13, 14 Group: <newline> Next Char: .
Index: 15, 16 Group: . Next Char: ,
Index: 16, 17 Group: , Next Char: ;
Index: 18, 19 Group: ; Next Char:
Pattern: ([\S&&[\w]])(?!\1)
Index: 1, 2 Group: a Next Char: b
Index: 2, 3 Group: b Next Char: c
Index: 4, 5 Group: c Next Char: <space>
Pattern: ([\S&&[\W]])(?!\1)
Index: 15, 16 Group: . Next Char: ,
Index: 16, 17 Group: , Next Char: ;
Index: 18, 19 Group: ; Next Char:
---------- BEGIN SOURCE ----------
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TestRegex {
public static void main(String[] args) {
System.out.println(System.getProperty("java.vm.name"));
System.out.println(System.getProperty("java.vm.vendor"));
System.out.println(System.getProperty("java.vm.version"));
System.out.println("Running Test");
String[] testRegex = new String[] { "([^0-9a-z&&[^\\s]])(?!\\1)", "([^0-9a-z&&[\\s]])(?!\\1)", "([\\S&&[\\w]])(?!\\1)", "([\\S&&[\\W]])(?!\\1)" };
String text = "aabcc \n\n\t\t\n \n\n..,;;";
System.out.println("Text: " + printable(text));
System.out.println();
for (String regex : testRegex) {
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
System.out.println("Pattern: " + pattern.pattern());
Matcher m = pattern.matcher(text);
while (m.find()) {
System.out
.println("\tIndex: " + m.start() + ", " + m.end() + " Group: " + printable(m.group()) + " Next Char: " + printable((m.end() < text.length() ? "" + text.charAt(m.end()) : "")));
}
}
}
public static String printable(String text) {
text = text.replaceAll("\t", "<tab>");
text = text.replaceAll("\n", "<newline>");
text = text.replaceAll(" ", "<space>");
return text;
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Have to use the \S and \W character classes that also allow the underscore character, which is a loss of precision.
FREQUENCY : always
- relates to
-
JDK-6609854 Regex does not match correctly for negative nested character classes
- Resolved