Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-5030676

SA utilities:Inappropriate access to strings in modified UTF-8

XMLWordPrintable

    • b51
    • generic
    • generic



      Name: nl37777 Date: 04/12/2004

      Several parts of the serviceability agent software seem to
      use inappropriate ways of accessing strings that are encoded in
      modified UTF-8. For more information on modified UTF-8, see
      http://ccc.sfbay/4915107
      http://webwork.sfbay/developer/technicalArticles/Intl/Supplementary/
      index.html

      Problems I've noticed:

      agent/make/ClosureFinder.java reads strings from the constant pool
      using the following code:
      int len = (int) dis.readShort();
      byte[] buf = new byte[len];
      dis.read(buf);
      strings[cpIndex] = new String(buf, "UTF-8");
      This seems incorrect, because the class file uses modified UTF-8, while
      the UTF-8 converter used by the String class uses standard UTF-8. A
      better API to use is DataInputStream.readUTF, which uses modified UTF-8
      just like the JVM. The resulting code would also be shorter:
                     strings[spIndex] = dis.readUTF();

      agent/src/share/classes/sun/jvm/hotspot/oops/Symbol.java reads strings
      from the constant pool using the following code:
             return new String(asByteArray(), "UTF-8");
      I can't quite see from the code what the underlying string
      representation is. If it is modified UTF-8, then this code has the same
      problem as above.

      agent/src/share/classes/sun/jvm/hotspot/oops/Symbol.java also has a
      startsWith method, which works by comparing the bytes of the UTF-8
      array against the chars of the String one by one. Since String is based
      on UTF-16, such a comparison is wrong if either of the two strings
      contains a non-ASCII character. Also, the length comparison is invalid
      in the presence of non-ASCII characters, since the length of the UTF-8
      and UTF-16 representations of the same string are not the same. For a
      valid comparison, the UTF-8 bytes must be converted to a String.

      agent/src/share/classes/sun/jvm/hotspot/runtime/PerfDataEntry.java
      encodes a byte array into a String using the following code:
      str = new String(byteArrayValue(), "UTF-8");
      I can't tell from the immediate context what the content of this byte
      array is and why it needs to be converted to a String. However, if the
      intent is that the byte array can contain arbitrary data, and it needs
      to be converted to a String to satisfy the needs of a protocol that
      doesn't otherwise support byte arrays, then this is not going to work.
      UTF-8 is a character encoding that carefully restricts permissible byte
      sequences - you can tell from the start byte of a sequence how many
      bytes must follow, and the continuation bytes must have "10" in the
      topmost bits. Any byte sequences that don't meet these requirements
      will result in conversion errors and loss of data. If the intent is as
      described above, a less demanding encoding should be used, such as
      ISO-8859-1.

      agent/src/share/classes/sun/jvm/hotspot/memory/SymbolTable.java
      converts a String to bytes in standard UTF-8 for a symbol table lookup:
             return probe(name.getBytes("UTF-8"));
      If the symbol table used here follows the JVM standard, then strings
      will be encoded in modified UTF-8. As a consequence, this lookup may
      not find symbols that are actually present.

      agent/src/share/classes/sun/jvm/hotspot/tools/jcore/ClassWriter.java
      has several comments that mention "UTF-8". I believe in all cases
      "modified UTF-8" is meant.

      ======================================================================

            sundar Sundararajan Athijegannathan
            nlindenbsunw Norbert Lindenberg (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: