Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8032926

Regular Expressions: CANON_EQ is broken if used with \Q\E

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: P3 P3
    • None
    • 7u51
    • core-libs

      FULL PRODUCT VERSION :
      java version "1.7.0_51"
      Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
      Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


      ADDITIONAL OS VERSION INFORMATION :
      Linux mclane 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

      A DESCRIPTION OF THE PROBLEM :
      If the CANON_EQ flag of java.util.regex.Pattern is used in conjunction with the quotation escapes \Q and \E, the resulting matcher does not operate as expected, i.e. decomposable characters within the quotes are not recognized.

      Debugging and some analysis of the source code have shown that the problem likely lies in the Pattern.normalize method. This method seems to analyse the original pattern string and to replace decomposable characters with a non-capturing group containing various alternatives. However, it does not seem to take care about the problem that the replacement may take place within a \Q...\E sequence. As these escapes seem to be retained, the later processing assumes the content to be literal, i.e.

      \Q??\E becomes \Q(?:...|...|...)\E

      so that the resulting matcher actually looks for an open parenthesis, a question mark, a colon and so on.


      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Run the supplied test program.


      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      p1 matches: true
      p2 matches: true

      ACTUAL -
      p1 matches: true
      p2 matches: false


      REPRODUCIBILITY :
      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------
      import java.util.regex.Pattern;
      import java.util.regex.Matcher;

      public class Test
      {
        public static void main (String[] args)
        {
          String test = "\u00fc"; // u umlaut

          Pattern p1 = Pattern.compile ("\u00fc", Pattern.CANON_EQ);
          System.out.println ("p1 matches: " + p1.matcher (test).matches ());

          Pattern p2 = Pattern.compile ("\\Q\u00fc\\E", Pattern.CANON_EQ);
          System.out.println ("p2 matches: " + p2.matcher (test).matches ());
        }

      }


      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      None, except not using CANON_EQ or \Q\E at the same time

            sherman Xueming Shen
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: