-
Bug
-
Resolution: Fixed
-
P3
-
1.4.0, 6
-
b96
-
x86
-
windows_2000, windows_xp
Name: nt126004 Date: 05/21/2002
FULL PRODUCT VERSION :
java version "1.4.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-b92)
Java HotSpot(TM) Client VM (build 1.4.0-b92, mixed mode)
FULL OPERATING SYSTEM VERSION :
Microsoft Windows 2000 [Version 5.00.2195]
A DESCRIPTION OF THE PROBLEM :
Turkish has 2 unique letter pairs:
'\u0130' & 'i' ('İ' & 'i') which correspond to
English 'I', & 'i'
&
'I', & '\u0130' ('I' & 'ı') which don't exist as
letters in English and represent back-vowel pairs of
English 'I', & 'i'.
If you didn't get them above, you can check them out at:
http://www.prustinteractive.com/toolbox/font/
In other words, English I i are both with a dot in Turkish,
and the back-vowel versions of them are both dotless.
From the API it appears that either:
langCollator.setStrength(Collator.PRIMARY)
or
langCollator.setStrength
(Collator.SECONDARY|Collator.CANONICAL_DECOMPOSITION);
or
langCollator.setStrength(Collator.SECONDARY);
should be capturing the difference between the 2 pairs, but
none does.
All combinations of containing PRIMARY & SECONDARY fail to
distinguish between the dotfulls and the dotless. The only
thing that gets both of them to compare != 0 is TERTIARY or
(a logical | with) Collator.FULL_DECOMPOSITION. But the
moment i do that i am no longer able to ignore case.
Besides, the Collator still treats the 2 pairs as the same
letter and mingles, for example, the words starting with
any of them, when sorted.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Use the source code below
2. Compare results
3.
EXPECTED VERSUS ACTUAL BEHAVIOR :
In the source code below:
1) should be != 0
2) should be == 0
Actual:
1) == 0
it can be made != 0 with TERTIARY or FULL_DECOMPOSITION,
but then 2) becomes != 0
And 2 letter pairs are considered as 1 pair in sorting.
getRules() returns a string identical to that for US
Locale, which might be root of problem.
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
import java.text.*;
import java.util.*;
public class collate {
public static void main(String args[])
{
Collator coll = Collator.getInstance(new Locale("tr", "TR"));
//workaround place
coll.setStrength(Collator.TERTIARY);
System.out.println(coll.compare("a","A"));//false
coll.setStrength(Collator.SECONDARY);
System.out.println(coll.compare("a","A"));//true
coll.setStrength(Collator.SECONDARY);
System.out.println(coll.compare("\u0131","i"));//1) should be != 0
System.out.println(coll.compare("\u0130","i"));//2) should be == 0
coll.setStrength(Collator.PRIMARY);
System.out.println(coll.compare("a","\u00e0"));//true
coll.setStrength(Collator.IDENTICAL);
System.out.println(coll.compare("a","b"));//false
CollationKey key1 = coll.getCollationKey("abc");
CollationKey key2 = coll.getCollationKey("def");
System.out.println(key1.compareTo(key2));//false
}
}
---------- END SOURCE ----------
CUSTOMER WORKAROUND :
The line indicated above as workaround place should be
replaced with:
RuleBasedCollator tr_Collator;
try {
tr_Collator = new
RuleBasedCollator
(""<a,A<b,B<c,C<?,?<d,D<e,E<f,F<g,G<\u011f,\u011e<?,?<h,H<?;
\u0131,I<i,\u0130;?<j,J" +
"<k,K<l,L<m,M<n,N<o,O<?,?<p,P<r,R<s,S<\u015f,\u015e<
?,?<t,T<u,U<?,?<v,V<y,Y<z,Z<'-'<' '<q,Q<w,W<x,X"");
} catch (ParseException ex) {
ex.printStackTrace();
}
turkishCollator.setStrength
(Collator.SECONDARY|Collator.CANONICAL_DECOMPOSITION);//this
line is optional, as rule ensures letter-grade difference
/*
letters ?,?, ?, ?, ?,? are not part of Turkish alphabet,
but are ASCII correspondences, and are included with an
attempt to provide for their ordering as well under
CANONICAL_DECOMPOSITION. Letters q,Q, w,W, x,X are not part
of Turkish alphabet, so they follow Z.
Note: while spec says "All non-mentioned Unicode characters
are at the end of the collation order. ", my '?' characters
(included only for testing) got ranked at the end of a-
words, not after 'Z', or 'X'. That might be another bug,
but one that won't concern most users of Turkish version of
Collator.
*/
(Review ID: 146774)
======================================================================
###@###.### 10/14/04 00:39 GMT
FULL PRODUCT VERSION :
java version "1.4.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-b92)
Java HotSpot(TM) Client VM (build 1.4.0-b92, mixed mode)
FULL OPERATING SYSTEM VERSION :
Microsoft Windows 2000 [Version 5.00.2195]
A DESCRIPTION OF THE PROBLEM :
Turkish has 2 unique letter pairs:
'\u0130' & 'i' ('İ' & 'i') which correspond to
English 'I', & 'i'
&
'I', & '\u0130' ('I' & 'ı') which don't exist as
letters in English and represent back-vowel pairs of
English 'I', & 'i'.
If you didn't get them above, you can check them out at:
http://www.prustinteractive.com/toolbox/font/
In other words, English I i are both with a dot in Turkish,
and the back-vowel versions of them are both dotless.
From the API it appears that either:
langCollator.setStrength(Collator.PRIMARY)
or
langCollator.setStrength
(Collator.SECONDARY|Collator.CANONICAL_DECOMPOSITION);
or
langCollator.setStrength(Collator.SECONDARY);
should be capturing the difference between the 2 pairs, but
none does.
All combinations of containing PRIMARY & SECONDARY fail to
distinguish between the dotfulls and the dotless. The only
thing that gets both of them to compare != 0 is TERTIARY or
(a logical | with) Collator.FULL_DECOMPOSITION. But the
moment i do that i am no longer able to ignore case.
Besides, the Collator still treats the 2 pairs as the same
letter and mingles, for example, the words starting with
any of them, when sorted.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Use the source code below
2. Compare results
3.
EXPECTED VERSUS ACTUAL BEHAVIOR :
In the source code below:
1) should be != 0
2) should be == 0
Actual:
1) == 0
it can be made != 0 with TERTIARY or FULL_DECOMPOSITION,
but then 2) becomes != 0
And 2 letter pairs are considered as 1 pair in sorting.
getRules() returns a string identical to that for US
Locale, which might be root of problem.
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
import java.text.*;
import java.util.*;
public class collate {
public static void main(String args[])
{
Collator coll = Collator.getInstance(new Locale("tr", "TR"));
//workaround place
coll.setStrength(Collator.TERTIARY);
System.out.println(coll.compare("a","A"));//false
coll.setStrength(Collator.SECONDARY);
System.out.println(coll.compare("a","A"));//true
coll.setStrength(Collator.SECONDARY);
System.out.println(coll.compare("\u0131","i"));//1) should be != 0
System.out.println(coll.compare("\u0130","i"));//2) should be == 0
coll.setStrength(Collator.PRIMARY);
System.out.println(coll.compare("a","\u00e0"));//true
coll.setStrength(Collator.IDENTICAL);
System.out.println(coll.compare("a","b"));//false
CollationKey key1 = coll.getCollationKey("abc");
CollationKey key2 = coll.getCollationKey("def");
System.out.println(key1.compareTo(key2));//false
}
}
---------- END SOURCE ----------
CUSTOMER WORKAROUND :
The line indicated above as workaround place should be
replaced with:
RuleBasedCollator tr_Collator;
try {
tr_Collator = new
RuleBasedCollator
(""<a,A<b,B<c,C<?,?<d,D<e,E<f,F<g,G<\u011f,\u011e<?,?<h,H<?;
\u0131,I<i,\u0130;?<j,J" +
"<k,K<l,L<m,M<n,N<o,O<?,?<p,P<r,R<s,S<\u015f,\u015e<
?,?<t,T<u,U<?,?<v,V<y,Y<z,Z<'-'<' '<q,Q<w,W<x,X"");
} catch (ParseException ex) {
ex.printStackTrace();
}
turkishCollator.setStrength
(Collator.SECONDARY|Collator.CANONICAL_DECOMPOSITION);//this
line is optional, as rule ensures letter-grade difference
/*
letters ?,?, ?, ?, ?,? are not part of Turkish alphabet,
but are ASCII correspondences, and are included with an
attempt to provide for their ordering as well under
CANONICAL_DECOMPOSITION. Letters q,Q, w,W, x,X are not part
of Turkish alphabet, so they follow Z.
Note: while spec says "All non-mentioned Unicode characters
are at the end of the collation order. ", my '?' characters
(included only for testing) got ranked at the end of a-
words, not after 'Z', or 'X'. That might be another bug,
but one that won't concern most users of Turkish version of
Collator.
*/
(Review ID: 146774)
======================================================================
###@###.### 10/14/04 00:39 GMT
- relates to
-
JDK-6328620 Collation of characters non-existing in Turkish locale
-
- Closed
-