Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8249446

CANON_EQ causes StringIndexOutOfBoundsException when pattern contains supplementary codepoint

XMLWordPrintable

      ADDITIONAL SYSTEM INFORMATION :
      > uname -a
      Linux marcy 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u3 (2019-06-16) x86_64 GNU/Linux

      > java -version
      openjdk version "12.0.1" 2019-04-16
      OpenJDK Runtime Environment AdoptOpenJDK (build 12.0.1+12)
      OpenJDK 64-Bit Server VM AdoptOpenJDK (build 12.0.1+12, mixed mode, sharing)


      A DESCRIPTION OF THE PROBLEM :
      Pattern.compile generates a StringIndexOutOfBoundsException if the pattern contains a supplementary codepoint and the flags include CANON_EQ. The problem does not occur if only the \x{...} notation is used in the pattern.

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Pass a String containing a valid surrogate pair to Pattern.compile, along with a flags argument that includes CANON_EQ.

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      Pattern should compile successfully.
      ACTUAL -
      Pattern.compile method throws this exception:

      Exception in thread "main" java.lang.StringIndexOutOfBoundsException: index 2,length 2
      at java.base/java.lang.String.checkIndex(String.java:3369)
      at java.base/java.lang.String.codePointAt(String.java:736)
      at java.base/java.util.regex.Pattern.normalizeSlice(Pattern.java:1517)
      at java.base/java.util.regex.Pattern.normalize(Pattern.java:1475)
      at java.base/java.util.regex.Pattern.compile(Pattern.java:1740)
      at java.base/java.util.regex.Pattern.<init>(Pattern.java:1427)
      at java.base/java.util.regex.Pattern.compile(Pattern.java:1094)


      ---------- BEGIN SOURCE ----------
      import java.util.regex.Pattern;

      public class RegexSupplementaryBugDemo {
          public static void main(String[] args) {
              System.out.println("Testing escaped codepoint with CANON_EQ");
              Pattern.compile("\\x{1d434}", Pattern.CANON_EQ);

              System.out.println("Testing codepoint without CANON_EQ");
              Pattern.compile("\ud835\udc34");

              System.out.println("Testing codepoint with CANON_EQ");
              Pattern.compile("\ud835\udc34", Pattern.CANON_EQ);
          }
      }

      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      Use \x{...} notation instead of directly embedding supplementary codepoints. However, this is not a viable option if the text is going to be passed to the Pattern.quote method.

      FREQUENCY : always

            jjose Johny Jose
            pnarayanaswa Praveen Narayanaswamy
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: