-
Bug
-
Resolution: Unresolved
-
P4
-
7u45
-
x86_64
-
windows_7
FULL PRODUCT VERSION :
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows [Version 6.1.7601]
A DESCRIPTION OF THE PROBLEM :
Sorting texts using the current Romanian collation gives incorrect results. There are two issues here:
1. The Romanian letter "â Ã" is sorted before the Romanian letter "Ä Ä" which is incorrect. The correct order is: a, A < Ä, Ä < â, Ã. It seems that java considers the romanian character "â Ã" as beeing equal to "a A" even if
collator.setStrength(Collator.IDENTICAL) is used. This is an output sorting some strings using Collator.IDENTICAL strength: a, a ,â ,aa, âi ,az, Ät
The expected result is: a, a , aa, az, Ät, â , âi
2. The Romanian letters "È È" (unicode name scomma) and "È È" (unicode name tcomma) are considered equals to "s" and "t" respectively. This is incorrect, the correct order is: s, S < È, È < t, T < È, È < u, U.
This is an output sorting some strings using Collator.IDENTICAL strength: ti, Èi, tt
The expected result is: ti, tt, Èi
Using the following code:
Locale locale = new Locale("ro", "RO");
Collator collator = Collator.getInstance(locale);
System.out.println(((RuleBasedCollator)collator).getRules());
reveals that java uses the erroneous legacy characters "Å" and "Å£" (unicode name scedilla and tcedilla) instead of the correct Romanian characters "È" and "È".
This is a legacy problem, the incorrect characters were used up to Windows XP SP3 time to define Romanian fonts and Romanian Keyboard mappings but this issue got resolved after that and now the correct mappings for fonts and keyboards are used.
The fix should not break though backward compatibility, there are still old fonts used today that did not update these mappings and there is old data using the incorrect characters.
A proposed solution is:
s, S < È=Å, È=Å < t, T < È=Å£, È=Å¢ < u, U
This way we support the correct Romanian alphabetic order and we are backward compatible.
REPRODUCIBILITY :
This bug can be reproduced always.
CUSTOMER SUBMITTED WORKAROUND :
//I am not sure this is enough but this works for now
String romanian = "< a, A < Ä, Ä < â, à < b, B < c, C < d, D < e, E < f, F < g, G < h, H < i, I" +
"< î, à < j, J < k, K < l, L < m, M < n, N < o, O < p, P < q, Q < r, R" +
"< s, S < È=Å, È=Å < t, T < È=Å£, È=Å¢ < u, U < v, V < w, W < x, X < y, Y < z, Z";
Collator collator = new RuleBasedCollator(romanian);
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows [Version 6.1.7601]
A DESCRIPTION OF THE PROBLEM :
Sorting texts using the current Romanian collation gives incorrect results. There are two issues here:
1. The Romanian letter "â Ã" is sorted before the Romanian letter "Ä Ä" which is incorrect. The correct order is: a, A < Ä, Ä < â, Ã. It seems that java considers the romanian character "â Ã" as beeing equal to "a A" even if
collator.setStrength(Collator.IDENTICAL) is used. This is an output sorting some strings using Collator.IDENTICAL strength: a, a ,â ,aa, âi ,az, Ät
The expected result is: a, a , aa, az, Ät, â , âi
2. The Romanian letters "È È" (unicode name scomma) and "È È" (unicode name tcomma) are considered equals to "s" and "t" respectively. This is incorrect, the correct order is: s, S < È, È < t, T < È, È < u, U.
This is an output sorting some strings using Collator.IDENTICAL strength: ti, Èi, tt
The expected result is: ti, tt, Èi
Using the following code:
Locale locale = new Locale("ro", "RO");
Collator collator = Collator.getInstance(locale);
System.out.println(((RuleBasedCollator)collator).getRules());
reveals that java uses the erroneous legacy characters "Å" and "Å£" (unicode name scedilla and tcedilla) instead of the correct Romanian characters "È" and "È".
This is a legacy problem, the incorrect characters were used up to Windows XP SP3 time to define Romanian fonts and Romanian Keyboard mappings but this issue got resolved after that and now the correct mappings for fonts and keyboards are used.
The fix should not break though backward compatibility, there are still old fonts used today that did not update these mappings and there is old data using the incorrect characters.
A proposed solution is:
s, S < È=Å, È=Å < t, T < È=Å£, È=Å¢ < u, U
This way we support the correct Romanian alphabetic order and we are backward compatible.
REPRODUCIBILITY :
This bug can be reproduced always.
CUSTOMER SUBMITTED WORKAROUND :
//I am not sure this is enough but this works for now
String romanian = "< a, A < Ä, Ä < â, à < b, B < c, C < d, D < e, E < f, F < g, G < h, H < i, I" +
"< î, à < j, J < k, K < l, L < m, M < n, N < o, O < p, P < q, Q < r, R" +
"< s, S < È=Å, È=Å < t, T < È=Å£, È=Å¢ < u, U < v, V < w, W < x, X < y, Y < z, Z";
Collator collator = new RuleBasedCollator(romanian);