Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4688797

[Col] Collator has problems with Turkish Locale and SECONDARY or PRIMARY strength

    XMLWordPrintable

Details

    • b96
    • x86
    • windows_2000, windows_xp

    Description

      Name: nt126004 Date: 05/21/2002


      FULL PRODUCT VERSION :
      java version "1.4.0"
      Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-b92)
      Java HotSpot(TM) Client VM (build 1.4.0-b92, mixed mode)

      FULL OPERATING SYSTEM VERSION :
      Microsoft Windows 2000 [Version 5.00.2195]

      A DESCRIPTION OF THE PROBLEM :
      Turkish has 2 unique letter pairs:
      '\u0130' & 'i' ('İ' & 'i') which correspond to
      English 'I', & 'i'
      &
      'I', & '\u0130' ('I' & 'ı') which don't exist as
      letters in English and represent back-vowel pairs of
      English 'I', & 'i'.

      If you didn't get them above, you can check them out at:
      http://www.prustinteractive.com/toolbox/font/

      In other words, English I i are both with a dot in Turkish,
      and the back-vowel versions of them are both dotless.

        From the API it appears that either:
      langCollator.setStrength(Collator.PRIMARY)
      or
      langCollator.setStrength
      (Collator.SECONDARY|Collator.CANONICAL_DECOMPOSITION);
      or
      langCollator.setStrength(Collator.SECONDARY);

      should be capturing the difference between the 2 pairs, but
      none does.

      All combinations of containing PRIMARY & SECONDARY fail to
      distinguish between the dotfulls and the dotless. The only
      thing that gets both of them to compare != 0 is TERTIARY or
      (a logical | with) Collator.FULL_DECOMPOSITION. But the
      moment i do that i am no longer able to ignore case.
      Besides, the Collator still treats the 2 pairs as the same
      letter and mingles, for example, the words starting with
      any of them, when sorted.

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      1. Use the source code below
      2. Compare results
      3.

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      In the source code below:
      1) should be != 0
      2) should be == 0

      Actual:
      1) == 0
      it can be made != 0 with TERTIARY or FULL_DECOMPOSITION,
      but then 2) becomes != 0
      And 2 letter pairs are considered as 1 pair in sorting.

      getRules() returns a string identical to that for US
      Locale, which might be root of problem.

      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------
      import java.text.*;
      import java.util.*;
        
      public class collate {
        public static void main(String args[])
        {
          Collator coll = Collator.getInstance(new Locale("tr", "TR"));
      //workaround place
        
          coll.setStrength(Collator.TERTIARY);
          System.out.println(coll.compare("a","A"));//false
          coll.setStrength(Collator.SECONDARY);
          System.out.println(coll.compare("a","A"));//true
        
          coll.setStrength(Collator.SECONDARY);
          System.out.println(coll.compare("\u0131","i"));//1) should be != 0
          System.out.println(coll.compare("\u0130","i"));//2) should be == 0

          coll.setStrength(Collator.PRIMARY);
          System.out.println(coll.compare("a","\u00e0"));//true
        
          coll.setStrength(Collator.IDENTICAL);
          System.out.println(coll.compare("a","b"));//false

          CollationKey key1 = coll.getCollationKey("abc");
          CollationKey key2 = coll.getCollationKey("def");
          System.out.println(key1.compareTo(key2));//false
        }
      }

      ---------- END SOURCE ----------

      CUSTOMER WORKAROUND :
      The line indicated above as workaround place should be
      replaced with:

      RuleBasedCollator tr_Collator;
      try {
        tr_Collator = new
           RuleBasedCollator
      (""<a,A<b,B<c,C<?,?<d,D<e,E<f,F<g,G<\u011f,\u011e<?,?<h,H<?;
      \u0131,I<i,\u0130;?<j,J" +

      "<k,K<l,L<m,M<n,N<o,O<?,?<p,P<r,R<s,S<\u015f,\u015e<
      ?,?<t,T<u,U<?,?<v,V<y,Y<z,Z<'-'<' '<q,Q<w,W<x,X"");
      } catch (ParseException ex) {
        ex.printStackTrace();
      }
      turkishCollator.setStrength
      (Collator.SECONDARY|Collator.CANONICAL_DECOMPOSITION);//this
       line is optional, as rule ensures letter-grade difference
      /*
      letters ?,?, ?, ?, ?,? are not part of Turkish alphabet,
      but are ASCII correspondences, and are included with an
      attempt to provide for their ordering as well under
      CANONICAL_DECOMPOSITION. Letters q,Q, w,W, x,X are not part
      of Turkish alphabet, so they follow Z.
      Note: while spec says "All non-mentioned Unicode characters
      are at the end of the collation order. ", my '?' characters
      (included only for testing) got ranked at the end of a-
      words, not after 'Z', or 'X'. That might be another bug,
      but one that won't concern most users of Turkish version of
      Collator.
      */
      (Review ID: 146774)
      ======================================================================
      ###@###.### 10/14/04 00:39 GMT

      Attachments

        Issue Links

          Activity

            People

              jtusla Jiri Tusla (Inactive)
              nthompsosunw Nathanael Thompson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: