Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8235495

Match error for Pattern containing an inverted range of surrogate pairs

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: P4 P4
    • tbd
    • 8, 9, 11, 13, 14
    • core-libs
    • None

      It seems that java.util.regex.Pattern/java.util.regex.Matcher have problems matching a pattern which is an inverted range of surrogate pairs.

      The following example creates two patterns. The first one is the range of Unicode code points (i.e. CJK ideographs) 0x205A0 to 0x205AF (see [1,2]) which are represented by the surrogate pairs [0xD841 0xDDA0] and [0xD841 0xDDAF] respectively. The second pattern is a negation of this range (i.e. all characters which are not within the range.

      Then, we try to match the string consisting of the CJK ideograph 0x205AC [3], represented by the surrogate pair [0xD841 0xDDAC] (which is obviously within the first pattern range) against the two patterns. Therefore, the first pattern should match, while the second shouldn't.

      import java.lang.String;
      import java.util.regex.Pattern;
      import java.util.regex.Matcher;

      public class InvertSurrogateMatch {
        public static void main(String[] args) {
          String s = new StringBuilder().appendCodePoint(0x205AC).toString();
          System.out.printf("s.length() = %d, s[0] = %x, s[1] = %x\n", s.length(), (int)s.charAt(0), (int)s.charAt(1));
          Pattern normal = Pattern.compile("[\\x{205A0}-\\x{205AF}]");
          Pattern inverted = Pattern.compile("[^\\x{205A0}-\\x{205AF}]");
          Matcher m;

          m = normal.matcher(s);
          if (m.find()) System.out.println("Normal: Found match at: " + m.start());

          m = inverted.matcher(s);
          if (m.find()) System.out.println("Inverted: Found match at: " + m.start());
        }
      }

      The output of the example is as follows (no matter which version of Java 8 to 14 we use):

      s.length() = 2, s[0] = d841, s[1] = ddac
      Normal: Found match at: 0
      Inverted: Found match at: 1

      As you can see, not only the first, but also the second pattern matches, which is strange because logically one character can not be within a character range and in the inverted (or negated) range at the same time.

      I'm not sure, but I think this issue may be related to "JDK-8149446: Low surrogates in regex patterns match the latter half of complete surrogate pairs" [4]. One may argue that "0xDDAC", the the low surrogate part of 0x205AC, is an isolated surrogate code point which is obviously not in the range "[0x205A0-0x205AF]" and thus part of the inverted range "[^0x205A0-0x205AF]". And now we're exactly at the point where JDK-8149446 comes into play: "Low surrogates in regex patterns match the latter half of complete surrogate pairs" :)

      So maybe we should really fix JDK-8149446 ?

      [1] https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=205A0
      [2] https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=205AF
      [3] https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=205AC
      [4] https://bugs.openjdk.java.net/browse/JDK-8149446

            rgiulietti Raffaello Giulietti
            simonis Volker Simonis
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: