Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8365675

Add String Unicode Case-Folding Support

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Unresolved
    • Icon: P3 P3
    • 26
    • None
    • core-libs
    • None
    • Fix Understood

      Summary

      Case folding is a key operation for case-insensitive matching (e.g., string comparison, regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.

      Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:

      (1) String.equalsIgnoreCase(String)
          - Unicode-aware, locale-independent.
          - Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per code point.
          - Limited: does not support 1:M mapping defined in Unicode case folding.

      (2) Character.toLowerCase(int) / Character.toUpperCase(int)
          - Locale-independent, single code point only.
          - No support for 1:M mappings.

      (3) String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)
          - Based on Unicode SpecialCasing.txt, supports 1:M mappings.
          - Intended primarily for presentation/display, not structural case-insensitive matching.
          - Requires full string conversion before comparison, which is less efficient and not intended for structural matching.


      Example. 1:M mapping, U+00DF (ß)

          - String.toUpperCase(Locale.ROOT, "ß") → "SS"
          - Case folding produces "ss", matching Unicode caseless comparison rules.

      jshell> "\u00df".equalsIgnoreCase("ss")
      $22 ==> false

      jshell> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
      $24 ==> true

      Motivation & Direction

      Adding the unicode compliant comparison methods in JDK brings Java in line with other languages / libraries and makes Unicode-compliant case-less matching simpler and more efficient.

          - Unicode-compliant full case folding.
          - Simpler, stable and more efficient case-less matching without workarounds.
          - Consistency with other programming languages/libraries.

      This enhancement proposes to introduce the following comparison APIs in String class,

          - boolean equalsCaseFold(String anotherString)
          - int compareToCaseFold(String anotherString)
          - Comparator UNICODE_CASEFOLD_ORDER

      These methods are intended to be the preferred choice when Unicode-compliant case-less matching is required.

      Note:

      An early draft also proposed a String.toCaseFold() method returning a new case-folded string.
      However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then passed into APIs where case-folding semantics are not appropriate.


      The New API

          /**
           * Compares this {@code String} to another {@code String} for equality,
           * using <em>Unicode case folding</em>. Two strings are considered equal
           * by this method if their case-folded forms are identical.
           * <p>
           * Case folding is defined by the Unicode Standard in
           * <a href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a>,
           * including 1:M mappings. For example, {@code "Maße".equalsFoldCase("MASSE")}
           * returns {@code true}, since the character {@code U+00DF} (sharp s) folds
           * to {@code "ss"}.
           * <p>
           * Case folding is locale-independent and language-neutral, unlike
           * locale-sensitive transformations such as {@link #toLowerCase()} or
           * {@link #toUpperCase()}. It is intended for caseless matching,
           * searching, and indexing.
           *
           * @apiNote
           * This method is the Unicode-compliant alternative to
           * {@link #equalsIgnoreCase(String)}. It implements full case folding as
           * defined by the Unicode Standard, which may differ from the simpler
           * per-character mapping performed by {@code equalsIgnoreCase}.
           * For example:
           * <pre>{@snippet lang=java :
           * String a = "Maße";
           * String b = "MASSE";
           * boolean equalsFoldCase = a.equalsFoldCase(b); // returns true
           * boolean equalsIgnoreCase = a.equalsIgnoreCase(b); // returns false
           * }</pre>
           *
           * @param anotherString
           * The {@code String} to compare this {@code String} against
           *
           * @return {@code true} if the given object is not {@code null} and represents
           * the same sequence of characters as this string under Unicode case
           * folding; {@code false} otherwise.
           *
           * @see #compareToFoldCase(String)
           * @see #equalsIgnoreCase(String)
           * @since 26
           */
          public boolean equalsFoldCase(String anotherString)

          /**
           * Compares two strings lexicographically using <em>Unicode case folding</em>.
           * This method returns an integer whose sign is that of calling {@code compareTo}
           * on the Unicode case folded version of the strings. Unicode Case folding
           * eliminates differences in case according to the Unicode Standard, using the
           * mappings defined in
           * <a href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a>,
           * including 1:M mappings, such as {@code"ß"} → {@code }"ss"}.
           * <p>
           * Case folding is a locale-independent, language-neutral form of case mapping,
           * primarily intended for caseless matching. Unlike {@link #compareToIgnoreCase(String)},
           * which applies a simpler locale-insensitive uppercase mapping. This method
           * follows the Unicode <em>full</em> case folding, providing stable and
           * consistent results across all environments.
           * <p>
           * Note that this method does <em>not</em> take locale into account, and may
           * produce results that differ from locale-sensitive ordering. Use
           * {@link java.text.Collator} for locale-sensitive comparison.
           *
           * @apiNote
           * This method is the Unicode-compliant alternative to
           * {@link #compareToIgnoreCase(String)}. It implements the <em>full</em> case folding
           * as defined by the Unicode Standard, which may differ from the simpler
           * per-character mapping performed by {@code compareToIgnoreCase}.
           * For example:
           * <pre>{@snippet lang=java :
           * String a = "Maße";
           * String b = "MASSE";
           * int cmpFoldCase = a.compareToFoldCase(b); // returns 0
           * int cmpIgnoreCase = a.compareToIgnoreCase(b); // returns > 0
           * }</pre>
           *
           * @param str the {@code String} to be compared.
           * @return a negative integer, zero, or a positive integer as the specified
           * String is greater than, equal to, or less than this String,
           * ignoring case considerations by case folding.
           * @see #equalsFoldCase(String)
           * @see #compareToIgnoreCase(String)
           * @see java.text.Collator
           * @since 26
           */
          public int compareToFoldCase(String str)

          /**
           * A Comparator that orders {@code String} objects as by
           * {@link #compareToFoldCase(String) compareToFoldCase()}.
           *
           * @see #compareToFoldCase(String)
           * @since 26
           */
          public static final Comparator<String> UNICODE_CASEFOLD_ORDER;


      Refs

       - Unicode Standard 5.18.4 Caseless Matching (https://www.unicode.org/versions/latest/core-spec/chapter-5/#G21790)
       - Unicode® Standard Annex #44: 5.6 Case and Case Mapping (https://www.unicode.org/reports/tr44/#Casemapping)
       - Unicode Technical Standard #18: Unicode Regular Expressions RL1.5: Simple Loose Matches (https://www.unicode.org/reports/tr18/#Simple_Loose_Matches)
       - Unicode SpecialCasing.txt (https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt)
       - Unicode CaseFolding.txt (https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt)

      Other Languages

      (1) Python string.casefold()

      The str.casefold() method in Python returns a casefolded version of a string. Casefolding is a more aggressive form of lowercasing, designed to remove all case distinctions in a string, particularly for the purpose of caseless string comparisons.

      (2) Perl’s fc()

      Returns the casefolded version of EXPR. This is the internal function implementing the \F escape in double-quoted strings.
      Casefolding is the process of mapping strings to a form where case differences are erased; comparing two strings in their casefolded form is effectively a way of asking if two strings are equal, regardless of case.
      Perl only implements the full form of casefolding, but you can access the simple folds using "casefold()" in Unicode::UCD] ad "prop_invmap()" in Unicode::UCD].

      (3) ICU4J UCharacter.foldCase (Java)

      Purpose: Provides extensions to the standard Java Character class, including support for more Unicode properties and handling of supplementary characters (code points beyond U+FFFF).
      Method Signature (String based): public static String foldCase(String str, int options)
      Method Signature (CharSequence & Appendable based): public static A foldCase(CharSequence src, A dest, int options, Edits edits)
      Key Features:
      Case Folding: Converts a string to its case-folded equivalent.
      Locale Independent: Case folding in UCharacter.foldCase is generally not dependent on locale settings.
      Context Insensitive: The mapping of a character is not affected by surrounding characters.
      Turkic Option: An option exists to include or exclude special mappings for Turkish/Azerbaijani text.
      Result Length: The resulting string can be longer or shorter than the original.
      Edits Recording: Allows for recording of edits for index mapping, styled text, and getting only changes.

      (4) u_strFoldCase (C/C++)

      A lower-level C API function for case folding a string.
      Case Folding Options: Similar options as UCharacter.foldCase for controlling case folding behavior.
      Availability: Found in the ustring.h and unistr.h headers in the ICU4C library.

            sherman Xueming Shen
            sherman Xueming Shen
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: