-
Bug
-
Resolution: Not an Issue
-
P4
-
None
-
1.3.0
-
generic, x86
-
generic, solaris_2.5.1
Name: boT120536 Date: 02/14/2001
java version "1.3.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0-C)
Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode)
Using the class defined below I determined that when the Cp1252 character set
is used to create a String from an array of bytes, any instances of the bytes
0x81, 0x8d, 0x8f, 0x90, 0x9d are replaced with character 0x3f ('?'). In JDK
1.2 the byte values 0x8e and 0x9e were also handled this way. US-ASCII
exhibits this behavior for all characters greater than 0x7f since these are all
undefined. Although this problem was brought up and then dismissed with Bug
4227538 I wish to bring it to your attention again. Although no guarantee is
made that an array of bytes can be converted to a string and back to the same
array of bytes using a given character encoding I would argue that in this case
there are three good reasons to do so.
1) For the developer the symetry of being able to convert from a byte array to
Cp1252 and back to the same byte array is both logical and convienient.
2) A Java String is a data type and therefore simply contains data to be
manipulated and should not make any assumptions about display. Display is a
function of the application manipulating the String data or even the terminal
to which the string is printed, not of the data type itself. Substituting ?
for undefined characters in the character set amounts to the data type
corrupting itself and determining that it should be displayed in a particular
way.
3) The ISO-8859-1 character encoding supported by all JDK's doesn't exhibit
this problem since all of the 32 undefined byte values are simply ignored and
their display left up to the application. I also noticed that the US-ASCII
implementation does the same ? replacement. Consistency among the various
character sets would be a good thing for Java developers.
If the US-ASCII or Cp1252 specs indicate that a ? should be substituted for
undefined bytes then I stand corrected and agree with the current
implementation but if not then my reasons for disagreeing with the current
implementation are stated above.
public class Foo {
public static void main(String[] argv) {
try {
String encoding = System.getProperty("file.encoding");
System.out.println("Default System Encoding: " + encoding);
if (argv.length > 0)
encoding = argv[0];
System.out.println("Using Character Encoding: " + encoding);
byte[] bytes = new byte[256];
for(int i = 0; i < bytes.length; i++)
bytes[i] = (byte)i;
String string = new String(bytes, encoding);
byte[] sBytes = string.getBytes(encoding);
if (bytes.length != sBytes.length)
throw new Exception("Byte arrays differ in length (" +
bytes.length + " != " + sBytes.length + ")!!!");
for(int i = 0; i < bytes.length; i++)
if (bytes[i] != sBytes[i])
System.out.println( "Byte " + Integer.toHexString(i) +
" is different (" +
Integer.toHexString(0xff & bytes[i]) +
" != " +
Integer.toHexString(0xff & sBytes[i]) +
")!" );
} catch (Throwable t) {
t.printStackTrace();
System.exit(1);
}
System.exit(0);
}
}
(Review ID: 116984)
======================================================================
Name: yyT116575 Date: 05/16/2001
java version "1.3.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0-C)
Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode)
By default, characters in x80 - x9f (cp1252, hex, inclusive) are shown as ? (a blank cute). After commenting out all exclusive characters in font.properties, Swing should show up all characters. Windows NT4.0 shows up, but Windows 98 does not show up all characters. For instance, euro symbol and trade mark symbol are missing. Windows ME has the same problem too.
Typing euro symbol into a Swing text editor results that nothing is displayed. However, it seems that something is entered. (my application tells me that the number of characters increases.) This is the case in Windows 98 and Windows Me, not Windows NT4.0.
(Review ID: 124121)
======================================================================
- relates to
-
JDK-4475338 Unsupported character encoding values (codes 129, 141, 143, 144, 157 in Cp1252)
-
- Closed
-