Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8149446

Low surrogates in regex patterns match the latter half of complete surrogate pairs

XMLWordPrintable

      FULL PRODUCT VERSION :
      java version "1.8.0_72"
      Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)


      ADDITIONAL OS VERSION INFORMATION :
      Linux tigermilk 3.13.0-77-generic #121-Ubuntu SMP Wed Jan 20 10:50:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

      A DESCRIPTION OF THE PROBLEM :
      Regex patterns or partial patterns representing low surrogate codes match the latter half of complete surrogate pairs.

      Example patterns which cause the problem (in Java string literal):

      - "\\udc00"
      - "\\x{dc00}"
      - "[\\udc00-\\udfff]"
      - "[\\x{dc00}-\\x{dfff}]"
      - "[\\p{blk=Low Surrogates}]"

      Above patterns match the latter half of complete surrogate pairs such as "\ud800\udc00", which represents a single codepoint U+010000.

      This behavior violates the requirement by "RL1.7 Supplementary Code Points" in "Unicode Technical Standard #18" (http://www.unicode.org/reports/tr18/), which says:

      "A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units."

      and:

      "a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching."

      The standard permits match against "an isolated surrogate code point," but does not permit match against parts of complete surrogate pairs.


      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Compile and execute the source code below.

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      Standard output:

      false
      false
      false
      false
      false
      false
      false
      ACTUAL -
      Standard output:

      true
      true
      true
      true
      true
      false
      false


      REPRODUCIBILITY :
      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------
      import java.util.regex.Pattern;
      import java.util.regex.Matcher;

      public class Regex {
          public static void main(String[] args) {
              String text = "\ud800\udc00"; // U+010000

              // Patterns which wrongly match the latter half of the surrogate pair.
              System.out.println(Pattern.compile("\\udc00").matcher(text).find());
              System.out.println(Pattern.compile("\\x{dc00}").matcher(text).find());
              System.out.println(Pattern.compile("[\\udc00-\\udfff]").matcher(text).find());
              System.out.println(Pattern.compile("[\\x{dc00}-\\x{dfff}]").matcher(text).find());
              System.out.println(Pattern.compile("[\\p{blk=Low Surrogates}]").matcher(text).find());

              // These patterns do not cause the problem.
              System.out.println(Pattern.compile("\udc00").matcher(text).find());
              System.out.println(Pattern.compile("[\udc00-\udfff]").matcher(text).find());
          }
      }

      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      Pass unescaped surrogate codes to Pattern.compile, such as "\udc00", "[\udc00-\udfff]".

            sherman Xueming Shen
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: