-
Bug
-
Resolution: Unresolved
-
P4
-
8, 9
-
generic
-
generic
FULL PRODUCT VERSION :
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Darwin 32770 15.4.0 Darwin Kernel Version 15.4.0: Fri Feb 26 22:08:05 PST 2016; root:xnu-3248.40.184~3/RELEASE_X86_64 x86_64
A DESCRIPTION OF THE PROBLEM :
Regex patterns that do not contain isolated surrogate code patterns match the second half of complete surrogate pairs. Example:
pattern: "[^\\x{10000}]"
target: "\\ud800\\udc00"
This pattern matches the low surrogate unit of the target pair when it should only consider the surrogate pair as a whole.
Closely related is this bug: https://bugs.openjdk.java.net/browse/JDK-8149446. This expands on that by using a regex pattern that does not contain isolated surrogate points, which I would argue makes it higher priority.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Compile and run the source code observing the unexpected result.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Stdout:
true
true
false
ACTUAL -
Stdout:
true
true
true
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class TestCase {
public static void main(String[] args) {
String text = "\ud800\udc00"; // U+010000
// Expected behaviour
System.out.println(Pattern.compile("\\x{10000}").matcher(text).find()); // true
System.out.println(Pattern.compile("[\\x{10000}]").matcher(text).find()); // true
// Unexpected behaviour
System.out.println(Pattern.compile("[^\\x{10000}]").matcher(text).find()); // true
}
}
---------- END SOURCE ----------
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Darwin 32770 15.4.0 Darwin Kernel Version 15.4.0: Fri Feb 26 22:08:05 PST 2016; root:xnu-3248.40.184~3/RELEASE_X86_64 x86_64
A DESCRIPTION OF THE PROBLEM :
Regex patterns that do not contain isolated surrogate code patterns match the second half of complete surrogate pairs. Example:
pattern: "[^\\x{10000}]"
target: "\\ud800\\udc00"
This pattern matches the low surrogate unit of the target pair when it should only consider the surrogate pair as a whole.
Closely related is this bug: https://bugs.openjdk.java.net/browse/JDK-8149446. This expands on that by using a regex pattern that does not contain isolated surrogate points, which I would argue makes it higher priority.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Compile and run the source code observing the unexpected result.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Stdout:
true
true
false
ACTUAL -
Stdout:
true
true
true
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class TestCase {
public static void main(String[] args) {
String text = "\ud800\udc00"; // U+010000
// Expected behaviour
System.out.println(Pattern.compile("\\x{10000}").matcher(text).find()); // true
System.out.println(Pattern.compile("[\\x{10000}]").matcher(text).find()); // true
// Unexpected behaviour
System.out.println(Pattern.compile("[^\\x{10000}]").matcher(text).find()); // true
}
}
---------- END SOURCE ----------