Context
Java currently provides String.toLowerCase(Locale) and String.toUpperCase(Locale) for case mapping. These methods are primarily intended for locale-sensitive text transformations and display. For example, in Turkish, "I".toLowerCase(new Locale("tr")) produces "ı" (dotless i), which is linguistically correct but not stable across locales.
For case-insensitive comparisons, applications today often rely on toLowerCase(Locale.ROOT) or toUpperCase(Locale.ROOT). While effective in many cases, these methods are not fully aligned with the Unicode Standard’s recommendations for caseless matching, and can produce inconsistent results across environments.
Proposal
As the first step, introduce a new API for locale-independent, Unicode-compliant case folding, based on the Unicode Character Database’s CaseFolding.txt mappings:
/**
* Returns a string whose characters are case folded according to
* the Unicode Standard's CaseFolding.txt mappings.
*
* This transformation is locale-independent and language-neutral, designed
* for case-insensitive comparisons, search, and canonicalization.
*/
public String toCaseFold();
Rationale/Refs: Unicode Standard, Chapter 5 §5.18.4 “Caseless Matching”
Definition: Caseless matching is implemented using case folding, the process of mapping characters of different case to a single form so that case distinctions are erased. This allows fast comparisons using simple binary equality.
More than lowercase: Case folding is not equivalent to toLowerCase(). For example:
Greek sigma: "όσος" and "ΌΣΟΣ" match only with case folding, not naive lowercase.
German sharp s: ß (U+00DF) case folds to "ss".
Stability & consistency: Case folding is locale-independent and language-neutral, avoiding the instability of locale-sensitive transformations. Unicode guarantees the stability of case folding mappings across versions (≥ Unicode 5.0).
Preserves semantics: Typically, applications store the original string alongside a case-folded form for fast comparisons. The folded form is not meant to replace the source string, but to enable reliable equality, search, and indexing.
Implementation source: CaseFolding.txt (part of the Unicode Character Database) defines the mappings, including single-character and multi-character foldings. Special-cased behavior for dotted/dotless I (Turkic languages) is explicitly handled for best default matching.
Efficiency: The algorithm is context-insensitive and language-independent, which allows it to be implemented efficiently and consistently across platforms.
Next Steps
1) Prototype String.toCaseFold() using generated mappings from CaseFolding.txt.
2) Provide microbenchmarks comparing toCaseFold() vs. toLowerCase(Locale.ROOT).
3) Maybe evaluate optional equalsIgnoreCaseUnicode() convenience method?
Java currently provides String.toLowerCase(Locale) and String.toUpperCase(Locale) for case mapping. These methods are primarily intended for locale-sensitive text transformations and display. For example, in Turkish, "I".toLowerCase(new Locale("tr")) produces "ı" (dotless i), which is linguistically correct but not stable across locales.
For case-insensitive comparisons, applications today often rely on toLowerCase(Locale.ROOT) or toUpperCase(Locale.ROOT). While effective in many cases, these methods are not fully aligned with the Unicode Standard’s recommendations for caseless matching, and can produce inconsistent results across environments.
Proposal
As the first step, introduce a new API for locale-independent, Unicode-compliant case folding, based on the Unicode Character Database’s CaseFolding.txt mappings:
/**
* Returns a string whose characters are case folded according to
* the Unicode Standard's CaseFolding.txt mappings.
*
* This transformation is locale-independent and language-neutral, designed
* for case-insensitive comparisons, search, and canonicalization.
*/
public String toCaseFold();
Rationale/Refs: Unicode Standard, Chapter 5 §5.18.4 “Caseless Matching”
Definition: Caseless matching is implemented using case folding, the process of mapping characters of different case to a single form so that case distinctions are erased. This allows fast comparisons using simple binary equality.
More than lowercase: Case folding is not equivalent to toLowerCase(). For example:
Greek sigma: "όσος" and "ΌΣΟΣ" match only with case folding, not naive lowercase.
German sharp s: ß (U+00DF) case folds to "ss".
Stability & consistency: Case folding is locale-independent and language-neutral, avoiding the instability of locale-sensitive transformations. Unicode guarantees the stability of case folding mappings across versions (≥ Unicode 5.0).
Preserves semantics: Typically, applications store the original string alongside a case-folded form for fast comparisons. The folded form is not meant to replace the source string, but to enable reliable equality, search, and indexing.
Implementation source: CaseFolding.txt (part of the Unicode Character Database) defines the mappings, including single-character and multi-character foldings. Special-cased behavior for dotted/dotless I (Turkic languages) is explicitly handled for best default matching.
Efficiency: The algorithm is context-insensitive and language-independent, which allows it to be implemented efficiently and consistently across platforms.
Next Steps
1) Prototype String.toCaseFold() using generated mappings from CaseFolding.txt.
2) Provide microbenchmarks comparing toCaseFold() vs. toLowerCase(Locale.ROOT).
3) Maybe evaluate optional equalsIgnoreCaseUnicode() convenience method?