-
Bug
-
Resolution: Fixed
-
P4
-
1.4.0
-
b64
-
x86
-
linux
Name: gm110360 Date: 07/15/2003
FULL PRODUCT VERSION :
java version "1.4.0_01"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0_01-b03)
Java HotSpot(TM) Client VM (build 1.4.0_01-b03, mixed mode)
FULL OPERATING SYSTEM VERSION : Red Hat Linux 8.0
ADDITIONAL OPERATING SYSTEMS :
This bug is INDEPENDENT of OS and happens under
all OS', but I'm just specifying
Linux because there's no choice for "ALL OS'".
As such, kernel version, glibc veersion and other
details are not relevant to this bug.
A DESCRIPTION OF THE PROBLEM :
EUC_KR and JOHAB converters in JDK have not
been updated to include two new characters
added to KS X 1001:1998 in 1998 (the previous versions
of this character set standard were issued
under the designation of KS C 5601-1987,
KS C 5601-1992 and KS X 1001:1997).
Two new characters added were
GL GR
U+20AC EURO Sign (0x22,0x66) (0xa2,0xe6)
U+00AE Registered Sign (0x22, 0x67) (0xa2,0xe7)
For EUC-KR converters, they have to be in GR positions.
For JOHAB converters, their code points have to be
translated following the same was as other code points
for symbol characters are translated from GL or GR
position. Their positions in JOHAB are 0xD9E6 and
0xD9E7.
Last March,
I also contacted Solaris I18N team and was told that
next release of Solaris would add these two characters
to EUC-KR codeset of Solaris. Mozilla/Netscape was updated
(http://bugzilla.mozilla.org/show_bug.cgi?id=134749)
and Sybase would do the same in their products.
Linux Glibc fixed this problem a long time ago (in
late 1999 or early 2000). Microsoft was probably
the first to add these two characters to their
Windows-949. MS949 converter for JDK 1.4 correctly
handles these two characters, too.
IBM and Oracle were also notified.
It would be nice if Java take care of this
problem soon.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1.Make a simple UTF-8 file with only two characters
U+20AC and U+00AE with your favorite text editor
(capable of UTF-8 handling) and save it to a file
'test'.
2. run the following three commands
$ native2ascii -encoding UTF-8 text | native2ascii -reverse -encoding EUC_KR
$ native2ascii -encoding UTF-8 text | native2ascii -reverse -encoding Johab
$ native2ascii -encoding UTF-8 text | native2ascii -reverse -encoding MS949
EXPECTED VERSUS ACTUAL BEHAVIOR :
Expected results:
The first and the last commands emit out octet streams
made of '0xA2 0xE6 0xA2 0xE7' and the second
command outputs the octet seq. of
'0xD9 0xE6 0xD9 0xE7'. ('use hexdump' in Solaris/Linux
and other binary-viewing tools of your choice
to examine the output)
Actual results:
Instead, The first two
command output
\u20ac\u00ae which means they're not representable
in EUC_KR and Johab as far as JDK is concerned.
On the other hand, the last command (before
piping through 'hexdump') emits
octet seqeunce of '0xA2 0xE6 0xA2 0xE7' as expected.
This clearly show that MS949 converter was updated
to include two new characters in KS X 1001:1998,
but EUC_KR and Johab converters haven't been.
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
N/A.
I believe testing this problem with native2ascii is sufficient to
demonstrate the issue at hand.
---------- END SOURCE ----------
(Incident Review ID: 166984)
======================================================================
FULL PRODUCT VERSION :
java version "1.4.0_01"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0_01-b03)
Java HotSpot(TM) Client VM (build 1.4.0_01-b03, mixed mode)
FULL OPERATING SYSTEM VERSION : Red Hat Linux 8.0
ADDITIONAL OPERATING SYSTEMS :
This bug is INDEPENDENT of OS and happens under
all OS', but I'm just specifying
Linux because there's no choice for "ALL OS'".
As such, kernel version, glibc veersion and other
details are not relevant to this bug.
A DESCRIPTION OF THE PROBLEM :
EUC_KR and JOHAB converters in JDK have not
been updated to include two new characters
added to KS X 1001:1998 in 1998 (the previous versions
of this character set standard were issued
under the designation of KS C 5601-1987,
KS C 5601-1992 and KS X 1001:1997).
Two new characters added were
GL GR
U+20AC EURO Sign (0x22,0x66) (0xa2,0xe6)
U+00AE Registered Sign (0x22, 0x67) (0xa2,0xe7)
For EUC-KR converters, they have to be in GR positions.
For JOHAB converters, their code points have to be
translated following the same was as other code points
for symbol characters are translated from GL or GR
position. Their positions in JOHAB are 0xD9E6 and
0xD9E7.
Last March,
I also contacted Solaris I18N team and was told that
next release of Solaris would add these two characters
to EUC-KR codeset of Solaris. Mozilla/Netscape was updated
(http://bugzilla.mozilla.org/show_bug.cgi?id=134749)
and Sybase would do the same in their products.
Linux Glibc fixed this problem a long time ago (in
late 1999 or early 2000). Microsoft was probably
the first to add these two characters to their
Windows-949. MS949 converter for JDK 1.4 correctly
handles these two characters, too.
IBM and Oracle were also notified.
It would be nice if Java take care of this
problem soon.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1.Make a simple UTF-8 file with only two characters
U+20AC and U+00AE with your favorite text editor
(capable of UTF-8 handling) and save it to a file
'test'.
2. run the following three commands
$ native2ascii -encoding UTF-8 text | native2ascii -reverse -encoding EUC_KR
$ native2ascii -encoding UTF-8 text | native2ascii -reverse -encoding Johab
$ native2ascii -encoding UTF-8 text | native2ascii -reverse -encoding MS949
EXPECTED VERSUS ACTUAL BEHAVIOR :
Expected results:
The first and the last commands emit out octet streams
made of '0xA2 0xE6 0xA2 0xE7' and the second
command outputs the octet seq. of
'0xD9 0xE6 0xD9 0xE7'. ('use hexdump' in Solaris/Linux
and other binary-viewing tools of your choice
to examine the output)
Actual results:
Instead, The first two
command output
\u20ac\u00ae which means they're not representable
in EUC_KR and Johab as far as JDK is concerned.
On the other hand, the last command (before
piping through 'hexdump') emits
octet seqeunce of '0xA2 0xE6 0xA2 0xE7' as expected.
This clearly show that MS949 converter was updated
to include two new characters in KS X 1001:1998,
but EUC_KR and Johab converters haven't been.
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
N/A.
I believe testing this problem with native2ascii is sufficient to
demonstrate the issue at hand.
---------- END SOURCE ----------
(Incident Review ID: 166984)
======================================================================