Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8166994

Improve sun.util.locale.LocaleMatcher

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P3 P3
    • 9
    • None
    • core-libs
    • None

      A colleague reports:
      -----------
       (1) is an observable bug
      and (2) is fragile code that only works accidentally; it could easily
      cause bugs in future.

      I'm attaching two patches:

       - LocaleMatcher-bugfixonly.patch fixes (1)
       - LocaleMatcher-bugfix-plus-regex-cleanup.patch fixes (1) and
         also cleans up (2).

      To verify that each patch fixes the issue (1), I have run the test
      case below which demonstrates the bug on OpenJDK9. I have not run any
      other tests or performed any other verification so my patch for (2)
      may not be correct.

      Test case that fails against OpenJDK9 but passes after the patch:
      === cut ===
      public void testParseConsistency() {
        assertEquals(Arrays.asList("ccq-aa", "ybd-aa", "rki-aa"), parseRange("ccq-aa"));
             
        // Second invocation fails with:
        // Expected :[ccq-aa, ybd-aa, rki-aa]
        // Actual :[ccq-aa, ybd-aa-aa, rki-aa-aa]
        assertEquals(Arrays.asList("ccq-aa", "ybd-aa", "rki-aa"), parseRange("ccq-aa"));
      }

      private List<String> parseRange(String s) {
        return LocaleMatcher.parse(s).stream().map(LanguageRange::getRange).collect(Collectors.toList());
      }
      === cut ===


      ===================================================
      Issue (1):

      LocaleMatcher.getEquivalentsForLanguage(String range) (line 346)
      modifies the statically referenced LocaleEquivalentMaps.multiEquivsMap
      which is intended to be constant. This is why LocaleMatcher behaves
      differently in the second invocation from the test case.

      Here's a code snippet of the buggy area in LocaleMatcher.java:

      334 private static String[] getEquivalentsForLanguage(String range) {
      335 String r = range;
      336
      337 while (r.length() > 0) {
      338 if (LocaleEquivalentMaps.singleEquivMap.containsKey(r)) {
      339 String equiv = LocaleEquivalentMaps.singleEquivMap.get(r);
      340 // Return immediately for performance if the first matching
      341 // subtag is found.
      342 return new String[] {range.replaceFirst(r, equiv)};
      343 } else if (LocaleEquivalentMaps.multiEquivsMap.containsKey(r)) {
      344 String[] equivs = LocaleEquivalentMaps.multiEquivsMap.get(r);
      345 for (int i = 0; i < equivs.length; i++) {
      346 equivs[i] = range.replaceFirst(r, equivs[i]);
      347 }
      348 return equivs;
      349 }

      ===================================================

      Issue (2):

      A few lines in LocaleMatcher.java (including the one with the bug)
      call String.replaceFirst(a, b) for values a that are literal Strings,
      thus they should be String.replaceFirst(Pattern.quote(a), b) or some
      other literal String matching logic.

      Other lines in LocaleMatcher use String.replaceAll() that could
      be rewritten to faster and more readable String.replace().

      Specifically:

       - there are two occurrences of
           range.replaceAll("\\x2A", "\\\\p{Alnum}*");
         Because unicode codepoint # 0x2A is '*', this is equivalent to:
           range.replace("*", "\\\\p{Alnum}*");
         which is both more readable and faster.

       - LocaleMatcher.filterTags() contains the line:
           range = range.replaceAll("-[*]", "");
         which I believe is exactly equivalent to the faster:
           range = range.replace("-*", "");

       - LocaleMatcher.getEquivalentForRegionAndVariant(String) in line 376
         reads:
           return range.replaceFirst(subtag, LocaleEquivalentMaps.regionVariantEquivMap.get(subtag));
         but subtag is a literal string rather than a regex. This happens
         to work because it will only consist of characters from [-a-z]
         but this is fragile.

       - LocaleMatcher.getEquivalentsForLanguage(String range) lines
         342 and 346 call range.replaceFirst(r, ...). The value of r
         here is again a literal string. It happens to work because r
         contains only alphanumeric characters but it is fragile.

            nishjain Nishit Jain
            martin Martin Buchholz
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: