Loading...

Type: CSR
Resolution: Approved
Priority: P4
Fix Version/s: 16
Component/s: core-libs
Labels:
None

Subcomponent:
java.lang
Compatibility Kind:

behavioral
Compatibility Risk:
low
Compatibility Risk Description:

Hide
Applications that expect the current behavior would break with those supplementary code points that have case-mappings. The proposed operations in this CSR should have been the way when supplementary character support was introduced in the JDK. Thus even this is technically an incompatibility, very few applications would be impacted by it.

Show
Applications that expect the current behavior would break with those supplementary code points that have case-mappings. The proposed operations in this CSR should have been the way when supplementary character support was introduced in the JDK. Thus even this is technically an incompatibility, very few applications would be impacted by it.
Interface Kind:

Java API
Scope:
SE

Summary

Support supplementary characters' case-mappings in java.lang.String methods that perform case-insensitive comparing/matching.

Problem

String.regionMatches(ignoreCase=true, ...), String.equalsIgnoreCase(), and String.compareToIgnoreCase() are supposed to match/compare strings in a case-insensitive manner. However, their specs and implementations are char based, which cannot handle supplementary characters correctly. For example,

"\ud83a\udd2e".regionMatches(true, 0, "\ud83a\udd0c", 0, 2)

Returns false (conforming to the existing spec), although "\ud83a\udd2e" is the 'ADLAM SMALL LETTER O' character which has the code point U+1E92E, and "\ud83a\udd0c" is the 'ADLAM CAPITAL LETTER O' character which has the code point U+1E90C. Thus it should return true if it is true to the meaning of "ignore case." This behavior contradicts to the fact that:

"\ud83a\udd2e".toUpperCase(Locale.ROOT).equals("\ud83a\udd0c")
Character.toUpperCase(0x1e92e) == 0x1e90c

each statement returns true.

Solution

Change those specs for String.regionMatches(boolean, ...), String.equalsIgnoreCase(), and String.compareToIgnoreCase() to perform "code point" comparison in case for supplementary characters. Characters in Basic Multilingual Plane (<= \uFFFF) are continued to be compared with code units got from charAt() method.

Although this change will alter the semantics in traversing the string to compare, the rationale to change it is that these String methods should consistently behave across characters (code points) whether they are in Basic Multilingual Plane or not. There should be no reason to exclude supplementary characters from comparing strings in a case-insensitive manner.

Specification

Change the method description of String.regionMatches(boolean, ...) method as:

   * A substring of this {@code String} object is compared to a substring
   * of the argument {@code other}. The result is {@code true} if these
-  * substrings represent character sequences that are the same, ignoring
-  * case if and only if {@code ignoreCase} is true. The substring of 
-  * this {@code String} object to be compared begins at index
-  * {@code toffset} and has length {@code len}. The substring of 
-  * {@code other} to be compared begins at index {@code ooffset} and
-  * has length {@code len}. The result is {@code false} if and only if 
-  * at least one of the following is true:
-  * <ul><li>{@code toffset} is negative.
-  * <li>{@code ooffset} is negative.
-  * <li>{@code toffset+len} is greater than the length of this
+  * substrings represent Unicode code point sequences that are the same,
+  * ignoring case if and only if {@code ignoreCase} is true.
+  * The sequences {@code tsequence} and {@code osequence} are compared,
+  * where {@code tsequence} is the sequence produced as if by calling
+  * {@code this.substring(toffset, len).codePoints()} and {@code osequence}
+  * is the sequence produced as if by calling
+  * {@code other.substring(ooffset, len).codePoints()}.
+  * The result is {@code true} if and only if all of the following
+  * are true:
+  * <ul><li>{@code toffset} is non-negative.
+  * <li>{@code ooffset} is non-negative.
+  * <li>{@code toffset+len} is less than or equal to the length of this
   * {@code String} object.
-  * <li>{@code ooffset+len} is greater than the length of the other
+  * <li>{@code ooffset+len} is less than or equal to the length of the other
   * argument.
-  * <li>{@code ignoreCase} is {@code false} and there is some nonnegative
-  * integer <i>k</i> less than {@code len} such that:
-  * <blockquote><pre>
-  * this.charAt(toffset+k) != other.charAt(ooffset+k)
-  * </pre></blockquote>
-  * <li>{@code ignoreCase} is {@code true} and there is some nonnegative
-  * integer <i>k</i> less than {@code len} such that:
-  * <blockquote><pre>
-  * Character.toLowerCase(Character.toUpperCase(this.charAt(toffset+k))) != 
-  * Character.toLowerCase(Character.toUpperCase(other.charAt(ooffset+k)))
-  * </pre></blockquote>
+  * <li>if {@code ignoreCase} is {@code false}, all pairs of corresponding Unicode
+  * code points are equal integer values; or if {@code ignoreCase} is {@code true},
+  * {@link Character#toLowerCase(int) Character.toLowerCase(}
+  * {@link Character#toUpperCase(int)}{@code )} on all pairs of Unicode code points
+  * results in equal integer values.
   * </ul>

-  * @param   len          the number of characters to compare.
+  * @param   len          the number of characters (Unicode code units -
+  *                       16bit {@code char} value) to compare.
   * @return  {@code true} if the specified subregion of this string
   *          matches the specified subregion of the string argument;
   *          {@code false} otherwise. Whether the matching is exact
   *          or case insensitive depends on the {@code ignoreCase}
   *          argument.
+  * @see     #codePoints()
   */

Change the method description of String.equalsIgnoreCase() method as:

  /**
   * Compares this {@code String} to another {@code String}, ignoring case
   * considerations.  Two strings are considered equal ignoring case if they
-  * are of the same length and corresponding characters in the two strings
-  * are equal ignoring case.
+  * are of the same length and corresponding Unicode code points in the two
+  * strings are equal ignoring case.
   *
-  * <p> Two characters {@code c1} and {@code c2} are considered the same
+  * <p> Two Unicode code points are considered the same
   * ignoring case if at least one of the following is true:
   * <ul>
-  *   <li> The two characters are the same (as compared by the
+  *   <li> The two Unicode code points are the same (as compared by the
   *        {@code ==} operator)
-  *   <li> Calling {@code Character.toLowerCase(Character.toUpperCase(char))}
-  *        on each character produces the same result
+  *   <li> Calling {@code Character.toLowerCase(Character.toUpperCase(int))}
+  *        on each Unicode code point produces the same result
   * </ul>
   *

   * @see  #equals(Object)
+  * @see  #codePoints()
   */

Change the method description of String.compareToIgnoreCase() method as:

  /**
   * Compares two strings lexicographically, ignoring case
   * differences. This method returns an integer whose sign is that of
-  * calling {@code compareTo} with normalized versions of the strings
+  * calling {@code compareTo} with case folded versions of the strings
   * where case differences have been eliminated by calling
-  * {@code Character.toLowerCase(Character.toUpperCase(character))} on
-  * each character.
+  * {@code Character.toLowerCase(Character.toUpperCase(int))} on
+  * each Unicode code point.
   * <p>

   * @see     java.text.Collator
+  * @see     #codePoints()
   * @since   1.2
   */

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

specdiff_v2.zip
115 kB
2020-07-14 14:22
specdiff.zip
214 kB
2020-07-13 16:06

csr of

JDK-8248655 Support supplementary characters in String case insensitive operations

Resolved

Details

Description

Summary

Problem

Solution

Specification

Attachments

Attachments

Issue Links

Activity

People

Dates