-
Bug
-
Resolution: Fixed
-
P3
-
5.0, 6
-
b06
-
generic, x86
-
generic, windows_xp
-
Verified
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-2145560 | 6u2 | Xueming Shen | P3 | Resolved | Fixed | b01 |
JDK-2152938 | OpenJDK6 | Xueming Shen | P3 | Closed | Not an Issue |
The case folding spec in regex clearly says
CASE_INSENSITIVE
By default, case-insensitive matching assumes that only characters
in the US-ASCII charset are being matched. Unicode-aware case-insensitive
matching can be enabled by specifying the UNICODE_CASE flag in conjunction
with this flag.
UNICODE_CASE
When this flag is specified then case-insensitive matching, when enabled
by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode
Standard. By default, case-insensitive matching assumes that only characters
in the US-ASCII charset are being matched.
But our implementation totally disagrees with our own spec at
(1)The UNICODE_CASE is mostly treated as UNICODE_CASE_INSENSITIVE,
which means the match is case insensitive no matter whether or
no the CASE_INSENSITIVE is enabled. We only "accidently" follow
the spec in character class case when the specified character is
basic latin (ascii) and latine-1 supplement (<=0xff).
1.4.x does follow the spec, the "regression" started from Tiger. Based
on the sccs history, the change was introduced in by the fix for
#4908476 (which I believe is a mistake, the test cases showed in the
bug report use (?u) alone instead of (?iu)).
(2)When CASE_INSENSITIVE is not companying with a UNI_CODE_CASE,
case insensitive match is still being done for
a)Class_Single Latin-1 supplement
b)Class_Range Non-ASCII
c)BackReference Non-ASCII
We have this buggy behavior from day-one. It might be OK (really???) to extend
the interpretation of ASCII a little to cover all characters less thatn \u00ff, but
the inconsistency between different constructs is really a big deal.
Attached is the test cases.
import java.util.regex.*;
public class Foo {
public static void main(String[] args) {
int failCount = 0;
Pattern pattern;
Matcher matcher;
int flags = 0;
// ASCII \u0061 "a"
// Latin-1 Supplement \u00e0 "a" + grave
// Cyrillic \u0431 cyrillic "a"
String[] patterns = new String[] {
//single char
"a", "\u00e0", "\u0430",
//slice of chars
"ab", "\u00e0\u00e1", "\u0430\u0431",
//class single
"[a]", "[\u00e0]", "[\u0430]",
//class range
"[a-b]", "[\u00e0-\u00e5]", "[\u0430-\u0431]",
//back reference
"(a)\\1", "(\u00e0)\\1", "(\u0430)\\1"
};
String[] texts = new String[] {
"A", "\u00c0", "\u0410",
"AB", "\u00c0\u00c1", "\u0410\u0411",
"A", "\u00c0", "\u0410",
"B", "\u00c2", "\u0411",
"aA", "\u00e0\u00c0", "\u0430\u0410"
};
boolean[] expected = new boolean[] {
true, false, false,
true, false, false,
true, false, false,
true, false, false,
true, false, false
};
flags = Pattern.CASE_INSENSITIVE;
for (int i = 0; i < patterns.length; i++) {
pattern = Pattern.compile(patterns[i], flags);
matcher = pattern.matcher(texts[i]);
if (matcher.matches() != expected[i]) {
System.out.println("<CI> Failed at " + i);
failCount++;
}
}
flags = Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE;
for (int i = 0; i < patterns.length; i++) {
pattern = Pattern.compile(patterns[i], flags);
matcher = pattern.matcher(texts[i]);
if (!matcher.matches()) {
System.out.println("<CI+UC> Failed at " + i);
failCount++;
}
}
// flag unicode_case alone should do nothing
flags = Pattern.UNICODE_CASE;
for (int i = 0; i < patterns.length; i++) {
pattern = Pattern.compile(patterns[i], flags);
matcher = pattern.matcher(texts[i]);
if (matcher.matches()) {
System.out.println("<UC> Failed at " + i);
failCount++;
}
}
System.out.println("Total failure :" + failCount);
}
}
CASE_INSENSITIVE
By default, case-insensitive matching assumes that only characters
in the US-ASCII charset are being matched. Unicode-aware case-insensitive
matching can be enabled by specifying the UNICODE_CASE flag in conjunction
with this flag.
UNICODE_CASE
When this flag is specified then case-insensitive matching, when enabled
by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode
Standard. By default, case-insensitive matching assumes that only characters
in the US-ASCII charset are being matched.
But our implementation totally disagrees with our own spec at
(1)The UNICODE_CASE is mostly treated as UNICODE_CASE_INSENSITIVE,
which means the match is case insensitive no matter whether or
no the CASE_INSENSITIVE is enabled. We only "accidently" follow
the spec in character class case when the specified character is
basic latin (ascii) and latine-1 supplement (<=0xff).
1.4.x does follow the spec, the "regression" started from Tiger. Based
on the sccs history, the change was introduced in by the fix for
#4908476 (which I believe is a mistake, the test cases showed in the
bug report use (?u) alone instead of (?iu)).
(2)When CASE_INSENSITIVE is not companying with a UNI_CODE_CASE,
case insensitive match is still being done for
a)Class_Single Latin-1 supplement
b)Class_Range Non-ASCII
c)BackReference Non-ASCII
We have this buggy behavior from day-one. It might be OK (really???) to extend
the interpretation of ASCII a little to cover all characters less thatn \u00ff, but
the inconsistency between different constructs is really a big deal.
Attached is the test cases.
import java.util.regex.*;
public class Foo {
public static void main(String[] args) {
int failCount = 0;
Pattern pattern;
Matcher matcher;
int flags = 0;
// ASCII \u0061 "a"
// Latin-1 Supplement \u00e0 "a" + grave
// Cyrillic \u0431 cyrillic "a"
String[] patterns = new String[] {
//single char
"a", "\u00e0", "\u0430",
//slice of chars
"ab", "\u00e0\u00e1", "\u0430\u0431",
//class single
"[a]", "[\u00e0]", "[\u0430]",
//class range
"[a-b]", "[\u00e0-\u00e5]", "[\u0430-\u0431]",
//back reference
"(a)\\1", "(\u00e0)\\1", "(\u0430)\\1"
};
String[] texts = new String[] {
"A", "\u00c0", "\u0410",
"AB", "\u00c0\u00c1", "\u0410\u0411",
"A", "\u00c0", "\u0410",
"B", "\u00c2", "\u0411",
"aA", "\u00e0\u00c0", "\u0430\u0410"
};
boolean[] expected = new boolean[] {
true, false, false,
true, false, false,
true, false, false,
true, false, false,
true, false, false
};
flags = Pattern.CASE_INSENSITIVE;
for (int i = 0; i < patterns.length; i++) {
pattern = Pattern.compile(patterns[i], flags);
matcher = pattern.matcher(texts[i]);
if (matcher.matches() != expected[i]) {
System.out.println("<CI> Failed at " + i);
failCount++;
}
}
flags = Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE;
for (int i = 0; i < patterns.length; i++) {
pattern = Pattern.compile(patterns[i], flags);
matcher = pattern.matcher(texts[i]);
if (!matcher.matches()) {
System.out.println("<CI+UC> Failed at " + i);
failCount++;
}
}
// flag unicode_case alone should do nothing
flags = Pattern.UNICODE_CASE;
for (int i = 0; i < patterns.length; i++) {
pattern = Pattern.compile(patterns[i], flags);
matcher = pattern.matcher(texts[i]);
if (matcher.matches()) {
System.out.println("<UC> Failed at " + i);
failCount++;
}
}
System.out.println("Total failure :" + failCount);
}
}
- backported by
-
JDK-2145560 RegEx case_insensitive match is broken
-
- Resolved
-
-
JDK-2152938 RegEx case_insensitive match is broken
-
- Closed
-
- duplicates
-
JDK-6487160 Pattern.UNICODE_CASE makes character class ranges case insensitive
-
- Closed
-
- relates to
-
JDK-4908476 UNICODE_CASE doesn't work
-
- Resolved
-