Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: P3
Fix Version/s: 7
Affects Version/s: 5.0, 6
Component/s: core-libs
Labels:
- verify-pit

Subcomponent:
java.util.regex
Resolved In Build:
b06
CPU:

generic, x86
OS:

generic, windows_xp
Verification:
Verified

Issue	Fix Version	Assignee	Priority	Status	Resolution	Resolved In Build
JDK-2145560	6u2	Xueming Shen	P3	Resolved	Fixed	b01
JDK-2152938	OpenJDK6	Xueming Shen	P3	Closed	Not an Issue

The case folding spec in regex clearly says

CASE_INSENSITIVE
By default, case-insensitive matching assumes that only characters
in the US-ASCII charset are being matched. Unicode-aware case-insensitive
matching can be enabled by specifying the UNICODE_CASE flag in conjunction
with this flag.

UNICODE_CASE
When this flag is specified then case-insensitive matching, when enabled
by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode
Standard. By default, case-insensitive matching assumes that only characters
in the US-ASCII charset are being matched.

But our implementation totally disagrees with our own spec at

(1)The UNICODE_CASE is mostly treated as UNICODE_CASE_INSENSITIVE,
which means the match is case insensitive no matter whether or
no the CASE_INSENSITIVE is enabled. We only "accidently" follow
the spec in character class case when the specified character is
basic latin (ascii) and latine-1 supplement (<=0xff).

1.4.x does follow the spec, the "regression" started from Tiger. Based
on the sccs history, the change was introduced in by the fix for
#4908476 (which I believe is a mistake, the test cases showed in the
bug report use (?u) alone instead of (?iu)).

(2)When CASE_INSENSITIVE is not companying with a UNI_CODE_CASE,
case insensitive match is still being done for

a)Class_Single Latin-1 supplement
b)Class_Range Non-ASCII
c)BackReference Non-ASCII

We have this buggy behavior from day-one. It might be OK (really???) to extend
the interpretation of ASCII a little to cover all characters less thatn \u00ff, but
the inconsistency between different constructs is really a big deal.

Attached is the test cases.

import java.util.regex.*;
public class Foo {
   public static void main(String[] args) {
   int failCount = 0;
   Pattern pattern;
   Matcher matcher;
   int flags = 0;
    // ASCII \u0061 "a"
       // Latin-1 Supplement \u00e0 "a" + grave
       // Cyrillic \u0431 cyrillic "a"
   String[] patterns = new String[] {
       //single char
       "a", "\u00e0", "\u0430",
       //slice of chars
       "ab", "\u00e0\u00e1", "\u0430\u0431",
       //class single
       "[a]", "[\u00e0]", "[\u0430]",
       //class range
       "[a-b]", "[\u00e0-\u00e5]", "[\u0430-\u0431]",
       //back reference
       "(a)\\1", "(\u00e0)\\1", "(\u0430)\\1"
   };
   String[] texts = new String[] {
           "A", "\u00c0", "\u0410",
           "AB", "\u00c0\u00c1", "\u0410\u0411",
           "A", "\u00c0", "\u0410",
           "B", "\u00c2", "\u0411",
           "aA", "\u00e0\u00c0", "\u0430\u0410"
   };
   boolean[] expected = new boolean[] {
       true, false, false,
       true, false, false,
       true, false, false,
       true, false, false,
       true, false, false
   };

       flags = Pattern.CASE_INSENSITIVE;
   for (int i = 0; i < patterns.length; i++) {
           pattern = Pattern.compile(patterns[i], flags);
       matcher = pattern.matcher(texts[i]);
       if (matcher.matches() != expected[i]) {
       System.out.println("<CI> Failed at " + i);
       failCount++;
       }
   }

   flags = Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE;
   for (int i = 0; i < patterns.length; i++) {
           pattern = Pattern.compile(patterns[i], flags);
       matcher = pattern.matcher(texts[i]);
       if (!matcher.matches()) {
       System.out.println("<CI+UC> Failed at " + i);
       failCount++;
       }
   }
   // flag unicode_case alone should do nothing
   flags = Pattern.UNICODE_CASE;
   for (int i = 0; i < patterns.length; i++) {
           pattern = Pattern.compile(patterns[i], flags);
       matcher = pattern.matcher(texts[i]);
       if (matcher.matches()) {
       System.out.println("<UC> Failed at " + i);
       failCount++;
       }
   }
   System.out.println("Total failure :" + failCount);
   }
}

backported by

JDK-2145560 RegEx case_insensitive match is broken

Resolved

JDK-2152938 RegEx case_insensitive match is broken

Closed

duplicates

JDK-6487160 Pattern.UNICODE_CASE makes character class ranges case insensitive

Closed

relates to

JDK-4908476 UNICODE_CASE doesn't work

Resolved

Assignee:: Xueming Shen

Reporter:: Xueming Shen

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Created:: 2006-10-26 15:49

Updated:: 2017-05-16 16:19

Resolved:: 2011-03-07 16:10

Imported:: 17/Sep/12 11:21 PM

Indexed:: 22/Aug/12 3:16 AM

Details

Backports

Description

Attachments

Issue Links

Activity

People

Dates