-
Bug
-
Resolution: Not an Issue
-
P5
-
None
-
5.0
-
x86
-
windows_xp
FULL PRODUCT VERSION :
java version "1.5.0_05"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_05-b05)
Java HotSpot(TM) Client VM (build 1.5.0_05-b05, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows XP [Version 5.1.2600]
A DESCRIPTION OF THE PROBLEM :
When I run the following test on a "MS949" encoded string I get an interesting result which appears to indicate that String.getBytes() does not properly interpret the encoding.
This indicates that in logical terms:
bytes != new String(bytes, "MS949").getBytes("MS949");
I used 4 korean characters:
B440(2);; # HANGUL SYLLABLE TIKEUT YO RIEULSIOS
B441(2);; # HANGUL SYLLABLE TIKEUT YO RIEULTHIEUTH
B442(2);; # HANGUL SYLLABLE TIKEUT YO RIEULPHIEUPH
B443(2);; # HANGUL SYLLABLE TIKEUT YO RIEULHIEUH
taken from http://www.iana.org/assignments/idn/kr-korean.html
MS949 is listed as the windows korean character set here:
http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
simply run the code I included above in the description
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
I would expect that the two byte arrays would be equal as they are with default and ISO-8859-1charactersets
ACTUAL -
the byte array returned by String.getBytes("MS949") is not equal to the byte array submitted to new String(bytes, "MS949")
ERROR MESSAGES/STACK TRACES THAT OCCUR :
No errors are reported
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
public void testStringParsing() {
byte[] b = new byte[] {(byte)0xb4, (byte)0x40,
(byte)0xb4, (byte)0x41,
(byte)0xb4, (byte)0x42,
(byte)0xb4, (byte)0x43};
try {
String[] charsets = new String[] {"ISO-8859-1", "MS949"};
for (int i = 0; i < charsets.length; i++) {
System.out.println("Using encoding "+charsets[i]+" for Korean characters.");
String koreanEncodedString = new String(b, charsets[i]);
String defaultEncodedString = new String(b);
System.out.println("KOREAN 1: "+koreanEncodedString);
System.out.println("DEFAULT 1: "+defaultEncodedString);
byte[] koreanBytes = koreanEncodedString.getBytes(charsets[i]);
byte[] defaultBytes = defaultEncodedString.getBytes();
String stringFromKoreanBytes = new String(koreanBytes, charsets[i]);
String stringFromDefaultBytes = new String(defaultBytes);
System.out.println("KOREAN 2: "+stringFromKoreanBytes);
System.out.println("DEFAULT 2: "+stringFromDefaultBytes);
if (koreanEncodedString.equals(koreanEncodedString))
System.out.println("Korean String 1 matches Korean String 1");
else
System.out.println("Korean String 1 does not match Korean String 1");
if (koreanEncodedString.equals(stringFromKoreanBytes))
System.out.println("Korean String 1 matches Korean String 2");
else
System.out.println("Korean String 1 does not match Korean String 2");
if (defaultEncodedString.equals(stringFromDefaultBytes))
System.out.println("Default String 1 matches Default String 2");
else
System.out.println("Default String 1 does not match Default String 2");
StringBuffer sb = new StringBuffer();
formatByteArray(b, 0, b.length, true, sb);
System.out.println(sb.toString());
StringBuffer sb2 = new StringBuffer();
formatByteArray(koreanBytes, 0, koreanBytes.length, true, sb2);
System.out.println(sb2.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
}
private static void formatByteArray(byte[] raw, int start, int length,
boolean useSpace, StringBuffer result)
{
if (raw == null)
return;
for (int i = start; i < start + length; i++) {
if (useSpace && i != start)
result.append(" ");
try {
int b = raw[i] & 0xFF;
if (b < 0x10)
result.append("0");
result.append(Integer.toHexString(b).toUpperCase());
} catch (ArrayIndexOutOfBoundsException e) {
result.append(" ");
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
I guess you could always force "ISO-8859-1" encoding
java version "1.5.0_05"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_05-b05)
Java HotSpot(TM) Client VM (build 1.5.0_05-b05, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows XP [Version 5.1.2600]
A DESCRIPTION OF THE PROBLEM :
When I run the following test on a "MS949" encoded string I get an interesting result which appears to indicate that String.getBytes() does not properly interpret the encoding.
This indicates that in logical terms:
bytes != new String(bytes, "MS949").getBytes("MS949");
I used 4 korean characters:
B440(2);; # HANGUL SYLLABLE TIKEUT YO RIEULSIOS
B441(2);; # HANGUL SYLLABLE TIKEUT YO RIEULTHIEUTH
B442(2);; # HANGUL SYLLABLE TIKEUT YO RIEULPHIEUPH
B443(2);; # HANGUL SYLLABLE TIKEUT YO RIEULHIEUH
taken from http://www.iana.org/assignments/idn/kr-korean.html
MS949 is listed as the windows korean character set here:
http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
simply run the code I included above in the description
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
I would expect that the two byte arrays would be equal as they are with default and ISO-8859-1charactersets
ACTUAL -
the byte array returned by String.getBytes("MS949") is not equal to the byte array submitted to new String(bytes, "MS949")
ERROR MESSAGES/STACK TRACES THAT OCCUR :
No errors are reported
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
public void testStringParsing() {
byte[] b = new byte[] {(byte)0xb4, (byte)0x40,
(byte)0xb4, (byte)0x41,
(byte)0xb4, (byte)0x42,
(byte)0xb4, (byte)0x43};
try {
String[] charsets = new String[] {"ISO-8859-1", "MS949"};
for (int i = 0; i < charsets.length; i++) {
System.out.println("Using encoding "+charsets[i]+" for Korean characters.");
String koreanEncodedString = new String(b, charsets[i]);
String defaultEncodedString = new String(b);
System.out.println("KOREAN 1: "+koreanEncodedString);
System.out.println("DEFAULT 1: "+defaultEncodedString);
byte[] koreanBytes = koreanEncodedString.getBytes(charsets[i]);
byte[] defaultBytes = defaultEncodedString.getBytes();
String stringFromKoreanBytes = new String(koreanBytes, charsets[i]);
String stringFromDefaultBytes = new String(defaultBytes);
System.out.println("KOREAN 2: "+stringFromKoreanBytes);
System.out.println("DEFAULT 2: "+stringFromDefaultBytes);
if (koreanEncodedString.equals(koreanEncodedString))
System.out.println("Korean String 1 matches Korean String 1");
else
System.out.println("Korean String 1 does not match Korean String 1");
if (koreanEncodedString.equals(stringFromKoreanBytes))
System.out.println("Korean String 1 matches Korean String 2");
else
System.out.println("Korean String 1 does not match Korean String 2");
if (defaultEncodedString.equals(stringFromDefaultBytes))
System.out.println("Default String 1 matches Default String 2");
else
System.out.println("Default String 1 does not match Default String 2");
StringBuffer sb = new StringBuffer();
formatByteArray(b, 0, b.length, true, sb);
System.out.println(sb.toString());
StringBuffer sb2 = new StringBuffer();
formatByteArray(koreanBytes, 0, koreanBytes.length, true, sb2);
System.out.println(sb2.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
}
private static void formatByteArray(byte[] raw, int start, int length,
boolean useSpace, StringBuffer result)
{
if (raw == null)
return;
for (int i = start; i < start + length; i++) {
if (useSpace && i != start)
result.append(" ");
try {
int b = raw[i] & 0xFF;
if (b < 0x10)
result.append("0");
result.append(Integer.toHexString(b).toUpperCase());
} catch (ArrayIndexOutOfBoundsException e) {
result.append(" ");
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
I guess you could always force "ISO-8859-1" encoding