Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8281315

Unicode, (?i) flag and backreference throwing IndexOutOfBounds Exception

XMLWordPrintable

    • b11
    • generic
    • generic
    • Verified

      A DESCRIPTION OF THE PROBLEM :
      I stumbled upon the problem outlined in this stackoverflow issue: https://stackoverflow.com/questions/16008974/strange-java-unicode-regular-expression-stringindexoutofboundsexception

      When the (?i) flag is present with a backreference, certain unicode sequences (most notably emoji) cause an IndexOutOfBounds exception to be raised. One of the answers even identifies the issue and proposes a fix. Yet, the question was asked years ago and the issue still persists in the current implementation.

      The issue is not just limited to replaceAll, but happens when one attempts to search the string too.

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Run the code provided in the source code section, three textboxes below this one, and observe the issue.



      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      Both print statements should produce the same result, true.
      ACTUAL -
      The first call succeeds, but the second one fails, throwing an IndexOutOfBounds exception.

      Exception in thread "main" java.lang.StringIndexOutOfBoundsException: index 6, length 6
      at java.base/java.lang.String.checkIndex(String.java:3710)
      at java.base/java.lang.StringUTF16.checkIndex(StringUTF16.java:1624)
      at java.base/java.lang.StringUTF16.charAt(StringUTF16.java:1421)
      at java.base/java.lang.String.charAt(String.java:713)
      at java.base/java.lang.Character.codePointAt(Character.java:8874)
      at java.base/java.util.regex.Pattern$CIBackRef.match(Pattern.java:5077)
      at java.base/java.util.regex.Pattern$Curly.match(Pattern.java:4369)
      at java.base/java.util.regex.Pattern$GroupTail.match(Pattern.java:4832)
      at java.base/java.util.regex.Pattern$CharProperty.match(Pattern.java:3943)
      at java.base/java.util.regex.Pattern$GroupHead.match(Pattern.java:4801)
      at java.base/java.util.regex.Pattern$Start.match(Pattern.java:3620)
      at java.base/java.util.regex.Matcher.search(Matcher.java:1728)
      at java.base/java.util.regex.Matcher.find(Matcher.java:745)
      at com.company.Main.main(Main.java:13)

      ---------- BEGIN SOURCE ----------
      import java.util.regex.Pattern;

      public class Main {
          public static void main(String[] args) {
              String line = "💕💕💕";

              var pattern1 = Pattern.compile("(.)\\1{2,}");
              System.out.println(pattern1.matcher(line).find());

              var pattern2 = Pattern.compile("(?i)(.)\\1{2,}");
              System.out.println(pattern2.matcher(line).find());
          }
      }
      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      Not using the (?i) flag but instead running toLowerCase on the string before it is fed to the Regex for processing.

      FREQUENCY : always


            igraves Ian Graves
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: