Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8138824

java.lang.String: spec doesn't match impl when ignoring case - equalsIgnoreCase(), regionMatches()

    XMLWordPrintable

Details

    • b89
    • Verified

    Backports

      Description

        The spec for String.equalsIgnoreCase() and String.regionMatches(boolean ignoreCase, ...) does not match what the code does.

        From the equalsIgnoreCase() JavaDoc:
        --
        "Two characters c1 and c2 are considered the same ignoring case if at least one of the following is true:
            The two characters are the same (as compared by the == operator)
            Applying the method Character.toUpperCase(char) to each character produces the same result
            Applying the method Character.toLowerCase(char) to each character produces the same result"
        --

        From regionMatches(boolean ignoreCase, ...):
        --
        "The result is {@code false} if and only if at least one of the following is true:
        ...
        ignoreCase is true and there is some nonnegative integer k less than len such that:
             Character.toLowerCase(this.charAt(toffset+k)) !=
                 Character.toLowerCase(other.charAt(ooffset+k))
        and:
             Character.toUpperCase(this.charAt(toffset+k)) !=
                     Character.toUpperCase(other.charAt(ooffset+k))"
        --

        These methods compare Strings one character at a time. The stated procedure for ignoring case is to call toUpperCase() and toLowerCase() for each character in the Strings, and compare the respective results.

        However, the code does something slightly different. From regionMatches():
          if (ignoreCase) {
              // If characters don't match but case may be ignored,
              // try converting both characters to uppercase.
              // If the results match, then the comparison scan should
              // continue.
              char u1 = Character.toUpperCase(c1);
              char u2 = Character.toUpperCase(c2);
              if (u1 == u2) {
                  continue;
              }
              // Unfortunately, conversion to uppercase does not work properly
              // for the Georgian alphabet, which has strange rules about case
              // conversion. So we need to make one last check before
              // exiting.
              if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
                  continue;
              }
          }

        After comparing the result of toUpperCase(), toLowerCase() is called not on the original characters, but *on the result of toUpperCase()*.

        I've not found a specific reason for calling toLowerCase() with the result of toUpperCase(), instead of with the original character (beyond the "Georgian alphabet" comment). But the code has worked like this since JDK 1.0.2, and is consistent with String.compareToIgnoreCase(), added in JDK 1.2.

        I presume we did the best we could with the Unicode rules of the time. The long-standing behavior should be maintained for compatibility. Unicode's case mapping rules have evolved over time (addition of SpecialCasing and CaseFolding), as has the Unicode support in the JDK (addition of facilities for context- and locale-aware text handling in java.text).

        Over the years, bugs (e.g. JDK-4146417, JDK-4120540) have popped up questioning the Character.toLowerCase(Character.toUpperCase(char)) approach used by equalsIgnoreCase/regionMatches/compareToIgnoreCase. They were all determined to be "Not an Issue". Where the String API does not account for locale/language as people would want or expect, the answer has been to use locale-sensitive API (java.text, specifically Collator - JDK-4204589, JDK-4425387, JDK-4120540).

        A JavaDoc update for equalsIgnoreCase() and regionMatches() is in order, to something along the lines of String.compareToIgnoreCase():
        "...with normalized versions of the strings where case differences have been eliminated by calling Character.toLowerCase(Character.toUpperCase(character)) on each character."

        It would also be worth adding references to java.text.Collator.

        Attachments

          Issue Links

            Activity

              People

                bchristi Brent Christian
                bchristi Brent Christian
                Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved: