Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8365675

Add String Unicode Case-Folding Support #26892

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Unresolved
    • Icon: P3 P3
    • 26
    • None
    • core-libs
    • None
    • Fix Understood

      Summary

      Case folding is a key operation for case-insensitive matching (e.g., string equality, hashing, indexing, or regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.

      Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:

      (1) String.equalsIgnoreCase(String)

          - Unicode-aware, locale-independent.
          - Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per code point.
          - Limited: does not support 1:M mapping defined in Unicode case folding.

      (2) Character.toLowerCase(int) / Character.toUpperCase(int)

          - Locale-independent, single code point only.
          - No support for 1:M mappings.

      (3) String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)

          - Based on Unicode SpecialCasing.txt, supports 1:M mappings.
          - Intended primarily for presentation/display, not structural case-insensitive matching.
          - Not fully aligned with Unicode case folding rules.

      Examples of differences

      Some cases where current APIs differ from Unicode case folding:

      (1) Greek sigma forms

          - U+03A3 (Σ), U+03C2 (ς), U+03C3 (σ)
          - equalsIgnoreCase() matches correctly
          - toUpperCase().toLowerCase not unify final sigma (ς) with normal sigma (σ)
          - Case folding maps all forms consistently.

      jshell> "ΜΙΚΡΟΣ Σ".equalsIgnoreCase("μικροσ σ")
      $20 ==> true

      jshell> "ΜΙΚΡΟΣ Σ".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("μικροσ σ")
      $21 ==> false

      (2) 1:M mappings, e.g. U+00DF (ß)

          - String.toUpperCase(Locale.ROOT, "ß") → "SS"
          - Case folding produces "ss", matching Unicode caseless comparison rules.

      jshell> "\u00df".equalsIgnoreCase("ss")
      $22 ==> false

      jshell> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
      $24 ==> true


      Motivation & Direction

      Adding a direct API in the JDK aligns Java with other languages and makes Unicode-compliant case-less matching simpler and more efficient.

          - Unicode-compliant full case folding.
          - Simpler, stable and more efficient case-less matching without workarounds.
          - Consistency with other programming languages/libraries (Python str.casefold(), Perl fc(), icu4j UCharacter.foldCase etc.).

      The initial proposal included a String.toCaseFold() method returning a new case-folded string.
      However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then passed into APIs where case-folding semantics are not appropriate.

      Instead, the PR now introduces only the comparison APIs,

          - boolean equalsCaseFold(String anotherString)
          - int compareToCaseFold(String anotherString)
          - Comparator CASE_FOLD_ORDER

      with the intent of guiding developers toward these methods when Unicode-compliant caseless matching is required.

      The New API

      /**
       * Compares this {@code String} to another {@code String} for equality,
       * using <em>Unicode case folding</em>.
       * <p>
       * Two strings are considered equal by this method if their case-folded
       * forms are identical. Case folding is defined by the Unicode Standard in
       * <a href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a>,
       * including 1:M mappings. For example, {@code "Maße".equalsCaseFold("MASSE")}
       * returns {@code true}, since the character {@code U+00DF} (sharp s) folds
       * to {@code "ss"}.
       * <p>
       * Case folding is locale-independent and language-neutral, unlike
       * locale-sensitive transformations such as {@link #toLowerCase()} or
       * {@link #toUpperCase()}. It is intended for caseless matching,
       * searching, and indexing.
       *
       * @apiNote
       * This method is the Unicode-compliant alternative to
       * {@link #equalsIgnoreCase(String)}. It implements full case folding as
       * defined by the Unicode Standard, which may differ from the simpler
       * per-character mapping performed by {@code equalsIgnoreCase}.
       * For example:
       * <pre>{@code
       * String a = "Maße";
       * String b = "MASSE";
       * boolean equalCaseFold = a.equalsCaseFold(b); // returns true
       * boolean equalIgnoreCase = a.equalsIgnoreCase(b); // returns false
       * }</pre>
       *
       * @param anotherString
       * The {@code String} to compare this {@code String} against
       *
       * @return {@code true} if the given object is a {@code String}
       * that represents the same sequence of characters as this
       * string under Unicode case folding; {@code false} otherwise.
       *
       * @see #compareToCaseFold(String)
       * @see #equalsIgnoreCase(String)
       * @see java.text.Collator
       * @since 26
       */
      public boolean equalsCaseFold(String anotherString)

      /**
       * Compares two strings lexicographically using Unicode case folding.
       * <p>
       * This method returns an integer whose sign is that of calling {@code compareTo}
       * on the case folded versions of the strings. Unicode Case folding eliminates
       * differences in case according to the Unicode Standard, using the mappings
       * defined in
       * <a href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a>,
       * including 1:M mappings, such as {@code"ß"} → {@code }"ss"}.
       * <p>
       * Case folding is a locale-independent, language-neutral form of case mapping,
       * primarily intended for caseless matching. Unlike {@link #compareToIgnoreCase(String)},
       * which applies a simpler locale-insensitive uppercase mapping. This method
       * follows the Unicode-defined <em>full</em> case folding, providing stable and
       * consistent results across all environments.
       * <p>
       * Note that this method does <em>not</em> take locale into account, and may
       * produce results that differ from locale-sensitive ordering. For locale-aware
       * comparisons, use {@link java.text.Collator}.
       *
       * @apiNote
       * This method is the Unicode-compliant alternative to
       * {@link #compareToIgnoreCase(String)}. It implements full case folding
       * as defined by the Unicode Standard, which may differ from the simpler
       * per-character mapping performed by {@code compareToIgnoreCase}.
       * For example:
       * <pre>{@code
       * String a = "Maße";
       * String b = "MASSE";
       * int cmpCaseFold = a.compareToCaseFold(b); // returns 0
       * int cmpIgnoreCase = a.compareToIgnoreCase(b); // returns > 0
       * }</pre>
       *
       * @param str the {@code String} to be compared.
       * @return a negative integer, zero, or a positive integer as the specified
       * String is greater than, equal to, or less than this String,
       * ignoring case considerations by case folding.
       * @see #equalsCaseFold(String)
       * @see #compareToIgnoreCase(String)
       * @see java.text.Collator
       * @since 26
       */
      public int compareToCaseFold(String str)

      /**
       * A Comparator that orders {@code String} objects as by
       * {@link #compareToCaseFold(String) compareToCaseFold()}.
       *
       * @see #compareToCaseFold(String)
       * @since 26
       */
      public static final Comparator<String> CASE_FOLD_ORDER;
      Usage Examples
      // Sharp s (U+00DF) case-folds to "ss"
      "straße".toCaseFold().equals("strasse"); // true

      // Greek sigma variants fold consistently
      "ΜΙΚΡΟΣ Σ".toCaseFold().equals("μικροσ σ"); // true
      // Kelvin sign (U+212A) is case-folded to "k"
      "K".toCaseFold().equals("k"); // true

      Refs

          - Unicode Standard 5.18.4 Caseless Matching
          - Unicode® Standard Annex #44: 5.6 Case and Case Mapping
          - Unicode Technical Standard #18: Unicode Regular Expressions RL1.5: Simple Loose Matches
          - Unicode SpecialCasing.txt
          - Unicode CaseFolding.txt

      Other Languages

      (1) Python string.casefold()

      The str.casefold() method in Python returns a casefolded version of a string. Casefolding is a more aggressive form of lowercasing, designed to remove all case distinctions in a string, particularly for the purpose of caseless string comparisons.

      (2) Perl’s fc()

      Returns the casefolded version of EXPR. This is the internal function implementing the \F escape in double-quoted strings.
      Casefolding is the process of mapping strings to a form where case differences are erased; comparing two strings in their casefolded form is effectively a way of asking if two strings are equal, regardless of case.
      Perl only implements the full form of casefolding, but you can access the simple folds using "casefold()" in Unicode::UCD] ad "prop_invmap()" in Unicode::UCD].

      (3) ICU4J UCharacter.foldCase (Java)

      Purpose: Provides extensions to the standard Java Character class, including support for more Unicode properties and handling of supplementary characters (code points beyond U+FFFF).
      Method Signature (String based): public static String foldCase(String str, int options)
      Method Signature (CharSequence & Appendable based): public static A foldCase(CharSequence src, A dest, int options, Edits edits)
      Key Features:
      Case Folding: Converts a string to its case-folded equivalent.
      Locale Independent: Case folding in UCharacter.foldCase is generally not dependent on locale settings.
      Context Insensitive: The mapping of a character is not affected by surrounding characters.
      Turkic Option: An option exists to include or exclude special mappings for Turkish/Azerbaijani text.
      Result Length: The resulting string can be longer or shorter than the original.
      Edits Recording: Allows for recording of edits for index mapping, styled text, and getting only changes.

      (4) u_strFoldCase (C/C++)

      A lower-level C API function for case folding a string.
      Case Folding Options: Similar options as UCharacter.foldCase for controlling case folding behavior.
      Availability: Found in the ustring.h and unistr.h headers in the ICU4C library.

            sherman Xueming Shen
            sherman Xueming Shen
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: