Name: nl37777 Date: 04/12/2004
Several parts of the serviceability agent software seem to
use inappropriate ways of accessing strings that are encoded in
modified UTF-8. For more information on modified UTF-8, see
http://ccc.sfbay/4915107
http://webwork.sfbay/developer/technicalArticles/Intl/Supplementary/
index.html
Problems I've noticed:
agent/make/ClosureFinder.java reads strings from the constant pool
using the following code:
int len = (int) dis.readShort();
byte[] buf = new byte[len];
dis.read(buf);
strings[cpIndex] = new String(buf, "UTF-8");
This seems incorrect, because the class file uses modified UTF-8, while
the UTF-8 converter used by the String class uses standard UTF-8. A
better API to use is DataInputStream.readUTF, which uses modified UTF-8
just like the JVM. The resulting code would also be shorter:
strings[spIndex] = dis.readUTF();
agent/src/share/classes/sun/jvm/hotspot/oops/Symbol.java reads strings
from the constant pool using the following code:
return new String(asByteArray(), "UTF-8");
I can't quite see from the code what the underlying string
representation is. If it is modified UTF-8, then this code has the same
problem as above.
agent/src/share/classes/sun/jvm/hotspot/oops/Symbol.java also has a
startsWith method, which works by comparing the bytes of the UTF-8
array against the chars of the String one by one. Since String is based
on UTF-16, such a comparison is wrong if either of the two strings
contains a non-ASCII character. Also, the length comparison is invalid
in the presence of non-ASCII characters, since the length of the UTF-8
and UTF-16 representations of the same string are not the same. For a
valid comparison, the UTF-8 bytes must be converted to a String.
agent/src/share/classes/sun/jvm/hotspot/runtime/PerfDataEntry.java
encodes a byte array into a String using the following code:
str = new String(byteArrayValue(), "UTF-8");
I can't tell from the immediate context what the content of this byte
array is and why it needs to be converted to a String. However, if the
intent is that the byte array can contain arbitrary data, and it needs
to be converted to a String to satisfy the needs of a protocol that
doesn't otherwise support byte arrays, then this is not going to work.
UTF-8 is a character encoding that carefully restricts permissible byte
sequences - you can tell from the start byte of a sequence how many
bytes must follow, and the continuation bytes must have "10" in the
topmost bits. Any byte sequences that don't meet these requirements
will result in conversion errors and loss of data. If the intent is as
described above, a less demanding encoding should be used, such as
ISO-8859-1.
agent/src/share/classes/sun/jvm/hotspot/memory/SymbolTable.java
converts a String to bytes in standard UTF-8 for a symbol table lookup:
return probe(name.getBytes("UTF-8"));
If the symbol table used here follows the JVM standard, then strings
will be encoded in modified UTF-8. As a consequence, this lookup may
not find symbols that are actually present.
agent/src/share/classes/sun/jvm/hotspot/tools/jcore/ClassWriter.java
has several comments that mention "UTF-8". I believe in all cases
"modified UTF-8" is meant.
======================================================================