-
Bug
-
Resolution: Unresolved
-
P4
-
8, 25
-
generic
-
generic
ADDITIONAL SYSTEM INFORMATION :
Reproducible at least from OpenJDK 1.8.0_452 to 24.0.1.
Tested on Ubuntu Linux, but probably generic wrt. OS.
A DESCRIPTION OF THE PROBLEM :
When using a `Pattern` with `CASE_INSENSITIVE | UNICODE_CASE`, if the pattern contains a character class with a *range* of non-ASCII characters, and if one of the characters in the range case-folds to an ASCII character, then the `Pattern` will *not* match ASCII letters it should match.
For example, the character class in the pattern `"[\u017F-\u0180]"` contains `\u017F` LATIN SMALL LETTER LONG S, which case-folds to 's' (ASCII). It should therefore match a single `"s"` or `"S"`, but it does not.
* If the character appears as a non-range in the character class, the pattern matches.
* If the range contains the ASCII characters `s` or `S` instead, the pattern matches.
* All alternatives do match `"\u017F"`.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Save the provided "Test Case Code" to `BugUnicodeCase.java`.
2. Compile with `javac BugUnicodeCase.java`
3. Run with `java -cp . BugUnicodeCase`
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
All `true` lines (21 in total).
ACTUAL -
```
true
...
true // 18 so far
false
false
true
```
---------- BEGIN SOURCE ----------
```java
import java.util.regex.Pattern;
import static java.util.regex.Pattern.*;
public class BugUnicodeCase {
public static void main(String[] args) {
// U+017F is LATIN SMALL LETTER LONG S, which folds to 's' in CaseFolding.txt
// Expected: true everywhere.
// Actual: the two marked lines for p7 with "s" and "S" are false.
Pattern p1 = Pattern.compile("s", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p1.matcher("s").matches());
System.out.println(p1.matcher("S").matches());
System.out.println(p1.matcher("\u017F").matches());
Pattern p2 = Pattern.compile("S", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p2.matcher("s").matches());
System.out.println(p2.matcher("S").matches());
System.out.println(p2.matcher("\u017F").matches());
Pattern p3 = Pattern.compile("\u017F", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p3.matcher("s").matches());
System.out.println(p3.matcher("S").matches());
System.out.println(p3.matcher("\u017F").matches());
Pattern p4 = Pattern.compile("[p-u]", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p4.matcher("s").matches());
System.out.println(p4.matcher("S").matches());
System.out.println(p4.matcher("\u017F").matches());
Pattern p5 = Pattern.compile("[P-U]", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p5.matcher("s").matches());
System.out.println(p5.matcher("S").matches());
System.out.println(p5.matcher("\u017F").matches());
Pattern p6 = Pattern.compile("[\u017F\u0180]", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p6.matcher("s").matches());
System.out.println(p6.matcher("S").matches());
System.out.println(p6.matcher("\u017F").matches());
Pattern p7 = Pattern.compile("[\u017F-\u0180]", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p7.matcher("s").matches()); // false!
System.out.println(p7.matcher("S").matches()); // false!
System.out.println(p7.matcher("\u017F").matches());
}
}
```
---------- END SOURCE ----------
Reproducible at least from OpenJDK 1.8.0_452 to 24.0.1.
Tested on Ubuntu Linux, but probably generic wrt. OS.
A DESCRIPTION OF THE PROBLEM :
When using a `Pattern` with `CASE_INSENSITIVE | UNICODE_CASE`, if the pattern contains a character class with a *range* of non-ASCII characters, and if one of the characters in the range case-folds to an ASCII character, then the `Pattern` will *not* match ASCII letters it should match.
For example, the character class in the pattern `"[\u017F-\u0180]"` contains `\u017F` LATIN SMALL LETTER LONG S, which case-folds to 's' (ASCII). It should therefore match a single `"s"` or `"S"`, but it does not.
* If the character appears as a non-range in the character class, the pattern matches.
* If the range contains the ASCII characters `s` or `S` instead, the pattern matches.
* All alternatives do match `"\u017F"`.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Save the provided "Test Case Code" to `BugUnicodeCase.java`.
2. Compile with `javac BugUnicodeCase.java`
3. Run with `java -cp . BugUnicodeCase`
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
All `true` lines (21 in total).
ACTUAL -
```
true
...
true // 18 so far
false
false
true
```
---------- BEGIN SOURCE ----------
```java
import java.util.regex.Pattern;
import static java.util.regex.Pattern.*;
public class BugUnicodeCase {
public static void main(String[] args) {
// U+017F is LATIN SMALL LETTER LONG S, which folds to 's' in CaseFolding.txt
// Expected: true everywhere.
// Actual: the two marked lines for p7 with "s" and "S" are false.
Pattern p1 = Pattern.compile("s", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p1.matcher("s").matches());
System.out.println(p1.matcher("S").matches());
System.out.println(p1.matcher("\u017F").matches());
Pattern p2 = Pattern.compile("S", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p2.matcher("s").matches());
System.out.println(p2.matcher("S").matches());
System.out.println(p2.matcher("\u017F").matches());
Pattern p3 = Pattern.compile("\u017F", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p3.matcher("s").matches());
System.out.println(p3.matcher("S").matches());
System.out.println(p3.matcher("\u017F").matches());
Pattern p4 = Pattern.compile("[p-u]", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p4.matcher("s").matches());
System.out.println(p4.matcher("S").matches());
System.out.println(p4.matcher("\u017F").matches());
Pattern p5 = Pattern.compile("[P-U]", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p5.matcher("s").matches());
System.out.println(p5.matcher("S").matches());
System.out.println(p5.matcher("\u017F").matches());
Pattern p6 = Pattern.compile("[\u017F\u0180]", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p6.matcher("s").matches());
System.out.println(p6.matcher("S").matches());
System.out.println(p6.matcher("\u017F").matches());
Pattern p7 = Pattern.compile("[\u017F-\u0180]", CASE_INSENSITIVE | UNICODE_CASE);
System.out.println(p7.matcher("s").matches()); // false!
System.out.println(p7.matcher("S").matches()); // false!
System.out.println(p7.matcher("\u017F").matches());
}
}
```
---------- END SOURCE ----------