-
Bug
-
Resolution: Won't Fix
-
P4
-
None
-
8u72, 9
-
generic
-
generic
FULL PRODUCT VERSION :
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Linux tigermilk 3.13.0-77-generic #121-Ubuntu SMP Wed Jan 20 10:50:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
A DESCRIPTION OF THE PROBLEM :
Regex patterns or partial patterns representing low surrogate codes match the latter half of complete surrogate pairs.
Example patterns which cause the problem (in Java string literal):
- "\\udc00"
- "\\x{dc00}"
- "[\\udc00-\\udfff]"
- "[\\x{dc00}-\\x{dfff}]"
- "[\\p{blk=Low Surrogates}]"
Above patterns match the latter half of complete surrogate pairs such as "\ud800\udc00", which represents a single codepoint U+010000.
This behavior violates the requirement by "RL1.7 Supplementary Code Points" in "Unicode Technical Standard #18" (http://www.unicode.org/reports/tr18/), which says:
"A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units."
and:
"a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching."
The standard permits match against "an isolated surrogate code point," but does not permit match against parts of complete surrogate pairs.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Compile and execute the source code below.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Standard output:
false
false
false
false
false
false
false
ACTUAL -
Standard output:
true
true
true
true
true
false
false
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Regex {
public static void main(String[] args) {
String text = "\ud800\udc00"; // U+010000
// Patterns which wrongly match the latter half of the surrogate pair.
System.out.println(Pattern.compile("\\udc00").matcher(text).find());
System.out.println(Pattern.compile("\\x{dc00}").matcher(text).find());
System.out.println(Pattern.compile("[\\udc00-\\udfff]").matcher(text).find());
System.out.println(Pattern.compile("[\\x{dc00}-\\x{dfff}]").matcher(text).find());
System.out.println(Pattern.compile("[\\p{blk=Low Surrogates}]").matcher(text).find());
// These patterns do not cause the problem.
System.out.println(Pattern.compile("\udc00").matcher(text).find());
System.out.println(Pattern.compile("[\udc00-\udfff]").matcher(text).find());
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Pass unescaped surrogate codes to Pattern.compile, such as "\udc00", "[\udc00-\udfff]".
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Linux tigermilk 3.13.0-77-generic #121-Ubuntu SMP Wed Jan 20 10:50:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
A DESCRIPTION OF THE PROBLEM :
Regex patterns or partial patterns representing low surrogate codes match the latter half of complete surrogate pairs.
Example patterns which cause the problem (in Java string literal):
- "\\udc00"
- "\\x{dc00}"
- "[\\udc00-\\udfff]"
- "[\\x{dc00}-\\x{dfff}]"
- "[\\p{blk=Low Surrogates}]"
Above patterns match the latter half of complete surrogate pairs such as "\ud800\udc00", which represents a single codepoint U+010000.
This behavior violates the requirement by "RL1.7 Supplementary Code Points" in "Unicode Technical Standard #18" (http://www.unicode.org/reports/tr18/), which says:
"A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units."
and:
"a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching."
The standard permits match against "an isolated surrogate code point," but does not permit match against parts of complete surrogate pairs.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Compile and execute the source code below.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Standard output:
false
false
false
false
false
false
false
ACTUAL -
Standard output:
true
true
true
true
true
false
false
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Regex {
public static void main(String[] args) {
String text = "\ud800\udc00"; // U+010000
// Patterns which wrongly match the latter half of the surrogate pair.
System.out.println(Pattern.compile("\\udc00").matcher(text).find());
System.out.println(Pattern.compile("\\x{dc00}").matcher(text).find());
System.out.println(Pattern.compile("[\\udc00-\\udfff]").matcher(text).find());
System.out.println(Pattern.compile("[\\x{dc00}-\\x{dfff}]").matcher(text).find());
System.out.println(Pattern.compile("[\\p{blk=Low Surrogates}]").matcher(text).find());
// These patterns do not cause the problem.
System.out.println(Pattern.compile("\udc00").matcher(text).find());
System.out.println(Pattern.compile("[\udc00-\udfff]").matcher(text).find());
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Pass unescaped surrogate codes to Pattern.compile, such as "\udc00", "[\udc00-\udfff]".
- relates to
-
JDK-8235495 Match error for Pattern containing an inverted range of surrogate pairs
-
- Open
-