Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8041791

String.toLowerCase regression - violates Unicode standard

XMLWordPrintable

    • b14
    • generic
    • generic
    • Verified

        The change JDK-8020037 "String.toLowerCase incorrectly increases length, if string contains \u0130 char" seems to be wrong, according to my reading of the Unicode standard.

        The text "String.toLowerCase incorrectly increases length" makes the assumption that this is a problem, but of course it isn't: The documentation specifically says "Since case mappings are not always 1:1 char mappings, the resulting String may be a different length than the original String."

        I look at http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt and see:

        # Preserve canonical equivalence for I with dot. Turkic is handled below.

        0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE

        My understanding of this is that in all locales *except* the ones handled specially (which are 'az', 'lt', and 'tr') we should bi-directionally convert "\u0130" <-> "\u0069\u0307".
        I.e. lowercasing "\u0130" should result in "\u0069\u0307";
        converting "\u0069\u0307" to uppercase or titlecase should yield "\u0130".

        Note this allows round-trip conversions, which is why it is specified this way.

        Java 7 correctly does the former conversion, but not the latter.
        Java 8 does neither.

              naoto Naoto Sato
              pbothner Per Bothner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: