Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6486934

RegEx case_insensitive match is broken

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P3 P3
    • 7
    • 5.0, 6
    • core-libs
    • b06
    • generic, x86
    • generic, windows_xp
    • Verified

        The case folding spec in regex clearly says

        CASE_INSENSITIVE
         By default, case-insensitive matching assumes that only characters
        in the US-ASCII charset are being matched. Unicode-aware case-insensitive
        matching can be enabled by specifying the UNICODE_CASE flag in conjunction
        with this flag.

        UNICODE_CASE
         When this flag is specified then case-insensitive matching, when enabled
        by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode
        Standard. By default, case-insensitive matching assumes that only characters
        in the US-ASCII charset are being matched.

        But our implementation totally disagrees with our own spec at

        (1)The UNICODE_CASE is mostly treated as UNICODE_CASE_INSENSITIVE,
        which means the match is case insensitive no matter whether or
        no the CASE_INSENSITIVE is enabled. We only "accidently" follow
        the spec in character class case when the specified character is
        basic latin (ascii) and latine-1 supplement (<=0xff).

        1.4.x does follow the spec, the "regression" started from Tiger. Based
        on the sccs history, the change was introduced in by the fix for
        #4908476 (which I believe is a mistake, the test cases showed in the
        bug report use (?u) alone instead of (?iu)).

        (2)When CASE_INSENSITIVE is not companying with a UNI_CODE_CASE,
        case insensitive match is still being done for

        a)Class_Single Latin-1 supplement
        b)Class_Range Non-ASCII
        c)BackReference Non-ASCII

        We have this buggy behavior from day-one. It might be OK (really???) to extend
        the interpretation of ASCII a little to cover all characters less thatn \u00ff, but
        the inconsistency between different constructs is really a big deal.


        Attached is the test cases.

        import java.util.regex.*;
        public class Foo {
           public static void main(String[] args) {
           int failCount = 0;
           Pattern pattern;
           Matcher matcher;
           int flags = 0;
            // ASCII \u0061 "a"
               // Latin-1 Supplement \u00e0 "a" + grave
               // Cyrillic \u0431 cyrillic "a"
           String[] patterns = new String[] {
               //single char
               "a", "\u00e0", "\u0430",
               //slice of chars
               "ab", "\u00e0\u00e1", "\u0430\u0431",
               //class single
               "[a]", "[\u00e0]", "[\u0430]",
               //class range
               "[a-b]", "[\u00e0-\u00e5]", "[\u0430-\u0431]",
               //back reference
               "(a)\\1", "(\u00e0)\\1", "(\u0430)\\1"
           };
           String[] texts = new String[] {
                   "A", "\u00c0", "\u0410",
                   "AB", "\u00c0\u00c1", "\u0410\u0411",
                   "A", "\u00c0", "\u0410",
                   "B", "\u00c2", "\u0411",
                   "aA", "\u00e0\u00c0", "\u0430\u0410"
           };
           boolean[] expected = new boolean[] {
               true, false, false,
               true, false, false,
               true, false, false,
               true, false, false,
               true, false, false
           };

               flags = Pattern.CASE_INSENSITIVE;
           for (int i = 0; i < patterns.length; i++) {
                   pattern = Pattern.compile(patterns[i], flags);
               matcher = pattern.matcher(texts[i]);
               if (matcher.matches() != expected[i]) {
               System.out.println("<CI> Failed at " + i);
               failCount++;
               }
           }

           flags = Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE;
           for (int i = 0; i < patterns.length; i++) {
                   pattern = Pattern.compile(patterns[i], flags);
               matcher = pattern.matcher(texts[i]);
               if (!matcher.matches()) {
               System.out.println("<CI+UC> Failed at " + i);
               failCount++;
               }
           }
           // flag unicode_case alone should do nothing
           flags = Pattern.UNICODE_CASE;
           for (int i = 0; i < patterns.length; i++) {
                   pattern = Pattern.compile(patterns[i], flags);
               matcher = pattern.matcher(texts[i]);
               if (matcher.matches()) {
               System.out.println("<UC> Failed at " + i);
               failCount++;
               }
           }
           System.out.println("Total failure :" + failCount);
           }
        }

              sherman Xueming Shen
              sherman Xueming Shen
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: