Summary
Support supplementary characters' case-mappings in java.lang.String
methods that perform case-insensitive comparing/matching.
Problem
String.regionMatches(ignoreCase=true, ...)
, String.equalsIgnoreCase()
, and String.compareToIgnoreCase()
are supposed to match/compare strings in a case-insensitive manner. However, their specs and implementations are char
based, which cannot handle supplementary characters correctly. For example,
"\ud83a\udd2e".regionMatches(true, 0, "\ud83a\udd0c", 0, 2)
Returns false
(conforming to the existing spec), although "\ud83a\udd2e"
is the 'ADLAM SMALL LETTER O'
character which has the code point U+1E92E
, and "\ud83a\udd0c"
is the 'ADLAM CAPITAL LETTER O'
character which has the code point U+1E90C
. Thus it should return true
if it is true to the meaning of "ignore case." This behavior contradicts to the fact that:
"\ud83a\udd2e".toUpperCase(Locale.ROOT).equals("\ud83a\udd0c")
Character.toUpperCase(0x1e92e) == 0x1e90c
each statement returns true
.
Solution
Change those specs for String.regionMatches(boolean, ...)
, String.equalsIgnoreCase()
, and String.compareToIgnoreCase()
to perform "code point" comparison in case for supplementary characters. Characters in Basic Multilingual Plane (<= \uFFFF
) are continued to be compared with code units got from charAt()
method.
Although this change will alter the semantics in traversing the string to compare, the rationale to change it is that these String methods should consistently behave across characters (code points) whether they are in Basic Multilingual Plane or not. There should be no reason to exclude supplementary characters from comparing strings in a case-insensitive manner.
Specification
Change the method description of String.regionMatches(boolean, ...)
method as:
* A substring of this {@code String} object is compared to a substring
* of the argument {@code other}. The result is {@code true} if these
- * substrings represent character sequences that are the same, ignoring
- * case if and only if {@code ignoreCase} is true. The substring of
- * this {@code String} object to be compared begins at index
- * {@code toffset} and has length {@code len}. The substring of
- * {@code other} to be compared begins at index {@code ooffset} and
- * has length {@code len}. The result is {@code false} if and only if
- * at least one of the following is true:
- * <ul><li>{@code toffset} is negative.
- * <li>{@code ooffset} is negative.
- * <li>{@code toffset+len} is greater than the length of this
+ * substrings represent Unicode code point sequences that are the same,
+ * ignoring case if and only if {@code ignoreCase} is true.
+ * The sequences {@code tsequence} and {@code osequence} are compared,
+ * where {@code tsequence} is the sequence produced as if by calling
+ * {@code this.substring(toffset, len).codePoints()} and {@code osequence}
+ * is the sequence produced as if by calling
+ * {@code other.substring(ooffset, len).codePoints()}.
+ * The result is {@code true} if and only if all of the following
+ * are true:
+ * <ul><li>{@code toffset} is non-negative.
+ * <li>{@code ooffset} is non-negative.
+ * <li>{@code toffset+len} is less than or equal to the length of this
* {@code String} object.
- * <li>{@code ooffset+len} is greater than the length of the other
+ * <li>{@code ooffset+len} is less than or equal to the length of the other
* argument.
- * <li>{@code ignoreCase} is {@code false} and there is some nonnegative
- * integer <i>k</i> less than {@code len} such that:
- * <blockquote><pre>
- * this.charAt(toffset+k) != other.charAt(ooffset+k)
- * </pre></blockquote>
- * <li>{@code ignoreCase} is {@code true} and there is some nonnegative
- * integer <i>k</i> less than {@code len} such that:
- * <blockquote><pre>
- * Character.toLowerCase(Character.toUpperCase(this.charAt(toffset+k))) !=
- * Character.toLowerCase(Character.toUpperCase(other.charAt(ooffset+k)))
- * </pre></blockquote>
+ * <li>if {@code ignoreCase} is {@code false}, all pairs of corresponding Unicode
+ * code points are equal integer values; or if {@code ignoreCase} is {@code true},
+ * {@link Character#toLowerCase(int) Character.toLowerCase(}
+ * {@link Character#toUpperCase(int)}{@code )} on all pairs of Unicode code points
+ * results in equal integer values.
* </ul>
- * @param len the number of characters to compare.
+ * @param len the number of characters (Unicode code units -
+ * 16bit {@code char} value) to compare.
* @return {@code true} if the specified subregion of this string
* matches the specified subregion of the string argument;
* {@code false} otherwise. Whether the matching is exact
* or case insensitive depends on the {@code ignoreCase}
* argument.
+ * @see #codePoints()
*/
Change the method description of String.equalsIgnoreCase()
method as:
/**
* Compares this {@code String} to another {@code String}, ignoring case
* considerations. Two strings are considered equal ignoring case if they
- * are of the same length and corresponding characters in the two strings
- * are equal ignoring case.
+ * are of the same length and corresponding Unicode code points in the two
+ * strings are equal ignoring case.
*
- * <p> Two characters {@code c1} and {@code c2} are considered the same
+ * <p> Two Unicode code points are considered the same
* ignoring case if at least one of the following is true:
* <ul>
- * <li> The two characters are the same (as compared by the
+ * <li> The two Unicode code points are the same (as compared by the
* {@code ==} operator)
- * <li> Calling {@code Character.toLowerCase(Character.toUpperCase(char))}
- * on each character produces the same result
+ * <li> Calling {@code Character.toLowerCase(Character.toUpperCase(int))}
+ * on each Unicode code point produces the same result
* </ul>
*
* @see #equals(Object)
+ * @see #codePoints()
*/
Change the method description of String.compareToIgnoreCase()
method as:
/**
* Compares two strings lexicographically, ignoring case
* differences. This method returns an integer whose sign is that of
- * calling {@code compareTo} with normalized versions of the strings
+ * calling {@code compareTo} with case folded versions of the strings
* where case differences have been eliminated by calling
- * {@code Character.toLowerCase(Character.toUpperCase(character))} on
- * each character.
+ * {@code Character.toLowerCase(Character.toUpperCase(int))} on
+ * each Unicode code point.
* <p>
* @see java.text.Collator
+ * @see #codePoints()
* @since 1.2
*/
- csr of
-
JDK-8248655 Support supplementary characters in String case insensitive operations
-
- Resolved
-