Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4415511

Cp1252 and US-ASCII byte[] to String encoding produces ?

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not an Issue
    • Icon: P4 P4
    • None
    • 1.3.0
    • core-libs



      Name: boT120536 Date: 02/14/2001


      java version "1.3.0"
      Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0-C)
      Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode)

      Using the class defined below I determined that when the Cp1252 character set
      is used to create a String from an array of bytes, any instances of the bytes
      0x81, 0x8d, 0x8f, 0x90, 0x9d are replaced with character 0x3f ('?'). In JDK
      1.2 the byte values 0x8e and 0x9e were also handled this way. US-ASCII
      exhibits this behavior for all characters greater than 0x7f since these are all
      undefined. Although this problem was brought up and then dismissed with Bug
      4227538 I wish to bring it to your attention again. Although no guarantee is
      made that an array of bytes can be converted to a string and back to the same
      array of bytes using a given character encoding I would argue that in this case
      there are three good reasons to do so.

      1) For the developer the symetry of being able to convert from a byte array to
      Cp1252 and back to the same byte array is both logical and convienient.
      2) A Java String is a data type and therefore simply contains data to be
      manipulated and should not make any assumptions about display. Display is a
      function of the application manipulating the String data or even the terminal
      to which the string is printed, not of the data type itself. Substituting ?
      for undefined characters in the character set amounts to the data type
      corrupting itself and determining that it should be displayed in a particular
      way.
      3) The ISO-8859-1 character encoding supported by all JDK's doesn't exhibit
      this problem since all of the 32 undefined byte values are simply ignored and
      their display left up to the application. I also noticed that the US-ASCII
      implementation does the same ? replacement. Consistency among the various
      character sets would be a good thing for Java developers.

      If the US-ASCII or Cp1252 specs indicate that a ? should be substituted for
      undefined bytes then I stand corrected and agree with the current
      implementation but if not then my reasons for disagreeing with the current
      implementation are stated above.



      public class Foo {
          public static void main(String[] argv) {
              try {
                  String encoding = System.getProperty("file.encoding");
                  System.out.println("Default System Encoding: " + encoding);

                  if (argv.length > 0)
                      encoding = argv[0];
                  System.out.println("Using Character Encoding: " + encoding);

                  byte[] bytes = new byte[256];
                  for(int i = 0; i < bytes.length; i++)
                      bytes[i] = (byte)i;

                  String string = new String(bytes, encoding);
                  byte[] sBytes = string.getBytes(encoding);

                  if (bytes.length != sBytes.length)
                      throw new Exception("Byte arrays differ in length (" +
      bytes.length + " != " + sBytes.length + ")!!!");

                  for(int i = 0; i < bytes.length; i++)
                      if (bytes[i] != sBytes[i])
                          System.out.println( "Byte " + Integer.toHexString(i) +
                                              " is different (" +
                                              Integer.toHexString(0xff & bytes[i]) +
                                              " != " +
                                              Integer.toHexString(0xff & sBytes[i]) +
                                              ")!" );
              } catch (Throwable t) {
                  t.printStackTrace();
                  System.exit(1);
              }
              System.exit(0);
          }
      }
      (Review ID: 116984)
      ======================================================================

      Name: yyT116575 Date: 05/16/2001


      java version "1.3.0"
      Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0-C)
      Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode)

      By default, characters in x80 - x9f (cp1252, hex, inclusive) are shown as ? (a blank cute). After commenting out all exclusive characters in font.properties, Swing should show up all characters. Windows NT4.0 shows up, but Windows 98 does not show up all characters. For instance, euro symbol and trade mark symbol are missing. Windows ME has the same problem too.

      Typing euro symbol into a Swing text editor results that nothing is displayed. However, it seems that something is entered. (my application tells me that the number of characters increases.) This is the case in Windows 98 and Windows Me, not Windows NT4.0.
      (Review ID: 124121)
      ======================================================================

            ilittlesunw Ian Little (Inactive)
            bonealsunw Bret O'neal (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: