Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Fix
Priority: P4
Fix Version/s: None
Affects Version/s: 8u72, 9
Component/s: core-libs
Labels:

Subcomponent:
java.util.regex
CPU:

generic
OS:

generic

FULL PRODUCT VERSION :
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
Linux tigermilk 3.13.0-77-generic #121-Ubuntu SMP Wed Jan 20 10:50:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

A DESCRIPTION OF THE PROBLEM :
Regex patterns or partial patterns representing low surrogate codes match the latter half of complete surrogate pairs.

Example patterns which cause the problem (in Java string literal):

- "\\udc00"
- "\\x{dc00}"
- "[\\udc00-\\udfff]"
- "[\\x{dc00}-\\x{dfff}]"
- "[\\p{blk=Low Surrogates}]"

Above patterns match the latter half of complete surrogate pairs such as "\ud800\udc00", which represents a single codepoint U+010000.

This behavior violates the requirement by "RL1.7 Supplementary Code Points" in "Unicode Technical Standard #18" (http://www.unicode.org/reports/tr18/), which says:

"A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units."

and:

"a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching."

The standard permits match against "an isolated surrogate code point," but does not permit match against parts of complete surrogate pairs.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Compile and execute the source code below.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Standard output:

false
false
false
false
false
false
false
ACTUAL -
Standard output:

true
true
true
true
true
false
false

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Regex {
    public static void main(String[] args) {
        String text = "\ud800\udc00"; // U+010000

        // Patterns which wrongly match the latter half of the surrogate pair.
        System.out.println(Pattern.compile("\\udc00").matcher(text).find());
        System.out.println(Pattern.compile("\\x{dc00}").matcher(text).find());
        System.out.println(Pattern.compile("[\\udc00-\\udfff]").matcher(text).find());
        System.out.println(Pattern.compile("[\\x{dc00}-\\x{dfff}]").matcher(text).find());
        System.out.println(Pattern.compile("[\\p{blk=Low Surrogates}]").matcher(text).find());

        // These patterns do not cause the problem.
        System.out.println(Pattern.compile("\udc00").matcher(text).find());
        System.out.println(Pattern.compile("[\udc00-\udfff]").matcher(text).find());
    }
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Pass unescaped surrogate codes to Pattern.compile, such as "\udc00", "[\udc00-\udfff]".

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

JI9029322.java
0.8 kB
2016-02-09 02:00

relates to

JDK-8235495 Match error for Pattern containing an inverted range of surrogate pairs

Open

Assignee:: Xueming Shen
Reporter:: Webbug Group
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: 2016-02-07 04:47
Updated:: 2024-10-09 12:36
Resolved:: 2016-05-25 14:37

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates