Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6378476

Java String.getBytes() does not return proper bytes for "MS949" characterset

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not an Issue
    • Icon: P5 P5
    • None
    • 5.0
    • core-libs

      FULL PRODUCT VERSION :
      java version "1.5.0_05"
      Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_05-b05)
      Java HotSpot(TM) Client VM (build 1.5.0_05-b05, mixed mode)


      ADDITIONAL OS VERSION INFORMATION :
      Microsoft Windows XP [Version 5.1.2600]

      A DESCRIPTION OF THE PROBLEM :

      When I run the following test on a "MS949" encoded string I get an interesting result which appears to indicate that String.getBytes() does not properly interpret the encoding.

      This indicates that in logical terms:
      bytes != new String(bytes, "MS949").getBytes("MS949");

      I used 4 korean characters:
      B440(2);; # HANGUL SYLLABLE TIKEUT YO RIEULSIOS
      B441(2);; # HANGUL SYLLABLE TIKEUT YO RIEULTHIEUTH
      B442(2);; # HANGUL SYLLABLE TIKEUT YO RIEULPHIEUPH
      B443(2);; # HANGUL SYLLABLE TIKEUT YO RIEULHIEUH

      taken from http://www.iana.org/assignments/idn/kr-korean.html
       
      MS949 is listed as the windows korean character set here:
       
      http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html
       

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      simply run the code I included above in the description

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      I would expect that the two byte arrays would be equal as they are with default and ISO-8859-1charactersets
      ACTUAL -
      the byte array returned by String.getBytes("MS949") is not equal to the byte array submitted to new String(bytes, "MS949")

      ERROR MESSAGES/STACK TRACES THAT OCCUR :
      No errors are reported

      REPRODUCIBILITY :
      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------

      public void testStringParsing() {
          byte[] b = new byte[] {(byte)0xb4, (byte)0x40,
              (byte)0xb4, (byte)0x41,
              (byte)0xb4, (byte)0x42,
              (byte)0xb4, (byte)0x43};
          try {
            
            String[] charsets = new String[] {"ISO-8859-1", "MS949"};
            
            for (int i = 0; i < charsets.length; i++) {
              
              System.out.println("Using encoding "+charsets[i]+" for Korean characters.");
              
              String koreanEncodedString = new String(b, charsets[i]);
              String defaultEncodedString = new String(b);
              System.out.println("KOREAN 1: "+koreanEncodedString);
              System.out.println("DEFAULT 1: "+defaultEncodedString);
            
              byte[] koreanBytes = koreanEncodedString.getBytes(charsets[i]);
              byte[] defaultBytes = defaultEncodedString.getBytes();
            
              String stringFromKoreanBytes = new String(koreanBytes, charsets[i]);
              String stringFromDefaultBytes = new String(defaultBytes);
            
              System.out.println("KOREAN 2: "+stringFromKoreanBytes);
              System.out.println("DEFAULT 2: "+stringFromDefaultBytes);
            
              if (koreanEncodedString.equals(koreanEncodedString))
                System.out.println("Korean String 1 matches Korean String 1");
              else
                System.out.println("Korean String 1 does not match Korean String 1");
              
            
              if (koreanEncodedString.equals(stringFromKoreanBytes))
                System.out.println("Korean String 1 matches Korean String 2");
              else
                System.out.println("Korean String 1 does not match Korean String 2");
            
              if (defaultEncodedString.equals(stringFromDefaultBytes))
                System.out.println("Default String 1 matches Default String 2");
              else
                System.out.println("Default String 1 does not match Default String 2");
            
              
              StringBuffer sb = new StringBuffer();
              formatByteArray(b, 0, b.length, true, sb);
              System.out.println(sb.toString());
              
              StringBuffer sb2 = new StringBuffer();
              formatByteArray(koreanBytes, 0, koreanBytes.length, true, sb2);
              System.out.println(sb2.toString());
            }
          } catch (Exception e) {
            e.printStackTrace();
          }
        }
        
        private static void formatByteArray(byte[] raw, int start, int length,
            boolean useSpace, StringBuffer result)
        {
          if (raw == null)
            return;
          for (int i = start; i < start + length; i++) {
            if (useSpace && i != start)
              result.append(" ");
            try {
              int b = raw[i] & 0xFF;
              if (b < 0x10)
                result.append("0");
              result.append(Integer.toHexString(b).toUpperCase());
            } catch (ArrayIndexOutOfBoundsException e) {
              result.append(" ");
            }
          }
      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      I guess you could always force "ISO-8859-1" encoding

            Unassigned Unassigned
            ndcosta Nelson Dcosta (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: