Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8029966

Broken supplementary character support in regex

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: P4 P4
    • None
    • 7u45
    • core-libs

      FULL PRODUCT VERSION :
      java version "1.7.0_45"
      Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
      Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

      A DESCRIPTION OF THE PROBLEM :
      Regex patterns don't support supplementary characters matching unless there is a supplementary character in the pattern itself (this is how the method compile() in the java.util.regex.Pattern class works). However, there are certain patterns which are supposed to take supplementary characters into account, but they don't have such characters in them.

      For example, a pattern "(?U)[\W]" is supposed to match all the non-word characters taking all Unicode code points into account. It shouldn't match surrogate pairs which represent word characters, but the low surrogates still get matched.

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Run the following code:

      String str = "\ud84c\udfb4";
      System.out.println(Pattern.compile("(?U)[\\W]").matcher(str).find());
      System.out.println(Pattern.compile("(?U)(?!\uDB80\uDC00)[\\W]").matcher(str).find());

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      The program should print:

      false
      false
      ACTUAL -
      The program prints:

      true
      false

      REPRODUCIBILITY :
      This bug can be reproduced always.

      CUSTOMER SUBMITTED WORKAROUND :
      As shown above, you can include supplementary characters in the pattern that don't affect matching. For example, you could add (?!\uDB80\uDC00), which is a negative lookahead that matches a supplementary character in the private range.

            sherman Xueming Shen
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: