Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4829857

java.util.Pattern does not support Unicode classes Pi, Pf, LC

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P3 P3
    • 6
    • 1.4.1, 6
    • core-libs
    • b14
    • generic, x86
    • generic, windows_2000

      Name: rmT116609 Date: 03/10/2003


      FULL PRODUCT VERSION :
      java version "1.4.1_02"
      Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1_02-b06)
      Java HotSpot(TM) Client VM (build 1.4.1_02-b06, mixed mode)


      FULL OS VERSION :
      Microsoft Windows 2000 [Version 5.00.2195] (sp2)

      A DESCRIPTION OF THE PROBLEM :
      JavaDoc for java.util.Pattern says
      " The supported blocks and categories are those of The Unicode Standard, Version 3.0. ... The category names are those defined in table 4-5 of the Standard (p. 88), both normative and informative."
      http://java.sun.com/j2se/1.4.1/docs/api/java/util/regex/Pattern.html#ubc

      The categories listed in this table include
        Pi = Punctuation, initial quote
        Pf = Punctuation, final quote
      http://www.unicode.org/book/ch04.pdf (page 88)

      These are used to identify quotation marks as listed in
      http://www.unicode.org/Public/UNIDATA/PropList.txt

      However, categories Pi and Pf are not supported by java.util.Pattern.


      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Pattern.compile("\\p{Pi}[^\\p{Pf}]*\\p{Pf}").matcher("\u201Cquoted text\u201D").matches()

      java.util.regex.PatternSyntaxException: Unknown character category {Pi} near index 5
      \p{Pi}[^\p{Pf}]*\p{Pf}
           ^

      ERROR MESSAGES/STACK TRACES THAT OCCUR :
      java.util.regex.PatternSyntaxException: Unknown character category {Pi} near index 5
      \p{Pi}[^\p{Pf}]*\p{Pf}
           ^
      at java.util.regex.Pattern.error(Pattern.java:1489)
      at java.util.regex.Pattern.familyError(Pattern.java:2160)
      at java.util.regex.Pattern.retrieveCategoryNode(Pattern.java:2151)
      at java.util.regex.Pattern.family(Pattern.java:2123)
      at java.util.regex.Pattern.sequence(Pattern.java:1559)
      at java.util.regex.Pattern.expr(Pattern.java:1506)
      at java.util.regex.Pattern.compile(Pattern.java:1274)
      at java.util.regex.Pattern.<init>(Pattern.java:1030)
      at java.util.regex.Pattern.compile(Pattern.java:777)


      REPRODUCIBILITY :
      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------
      import java.util.regex.Pattern;

      public class PatternPiPf {
        public static void main(String[] ignore) {
          System.out.println(Pattern.compile("\\p{Pi}[^\\p{Pf}]*\\p{Pf}")
      .matcher("\u201Cquoted text\u201D").matches());
        }
      }

      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      import java.util.regex.Pattern;

      Use data from http://www.unicode.org/Public/UNIDATA/PropList.txt

      public class PatternPiPfWorkaround {
        public static void main(String[] ignore) {
      // System.out.println(Pattern.compile("\\p{Pi}[^\\p{Pf}]*\\p{Pf}")
      // .matcher("\u201Cquoted text\u201D").matches());
          System.out.println(Pattern.compile("[\u00AB\u2018\u201B\u201C\u201F\u2039][^\u00BB\u2019\u201D\u203A]*[\u00BB\u2019\u201D\u203A]")
      .matcher("\u201Cquoted text\u201D").matches());
        }
      }
      (Review ID: 182377)
      ======================================================================
      ###@###.### 10/12/04 02:42 GMT

            martin Martin Buchholz
            rmandalasunw Ranjith Mandala (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: