-
Bug
-
Resolution: Unresolved
-
P4
-
None
-
7u45
-
windows_8
FULL PRODUCT VERSION :
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
A DESCRIPTION OF THE PROBLEM :
Regex patterns don't support supplementary characters matching unless there is a supplementary character in the pattern itself (this is how the method compile() in the java.util.regex.Pattern class works). However, there are certain patterns which are supposed to take supplementary characters into account, but they don't have such characters in them.
For example, a pattern "(?U)[\W]" is supposed to match all the non-word characters taking all Unicode code points into account. It shouldn't match surrogate pairs which represent word characters, but the low surrogates still get matched.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Run the following code:
String str = "\ud84c\udfb4";
System.out.println(Pattern.compile("(?U)[\\W]").matcher(str).find());
System.out.println(Pattern.compile("(?U)(?!\uDB80\uDC00)[\\W]").matcher(str).find());
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The program should print:
false
false
ACTUAL -
The program prints:
true
false
REPRODUCIBILITY :
This bug can be reproduced always.
CUSTOMER SUBMITTED WORKAROUND :
As shown above, you can include supplementary characters in the pattern that don't affect matching. For example, you could add (?!\uDB80\uDC00), which is a negative lookahead that matches a supplementary character in the private range.
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
A DESCRIPTION OF THE PROBLEM :
Regex patterns don't support supplementary characters matching unless there is a supplementary character in the pattern itself (this is how the method compile() in the java.util.regex.Pattern class works). However, there are certain patterns which are supposed to take supplementary characters into account, but they don't have such characters in them.
For example, a pattern "(?U)[\W]" is supposed to match all the non-word characters taking all Unicode code points into account. It shouldn't match surrogate pairs which represent word characters, but the low surrogates still get matched.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Run the following code:
String str = "\ud84c\udfb4";
System.out.println(Pattern.compile("(?U)[\\W]").matcher(str).find());
System.out.println(Pattern.compile("(?U)(?!\uDB80\uDC00)[\\W]").matcher(str).find());
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The program should print:
false
false
ACTUAL -
The program prints:
true
false
REPRODUCIBILITY :
This bug can be reproduced always.
CUSTOMER SUBMITTED WORKAROUND :
As shown above, you can include supplementary characters in the pattern that don't affect matching. For example, you could add (?!\uDB80\uDC00), which is a negative lookahead that matches a supplementary character in the private range.