Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8303623

Compiler should disallow non-standard UTF-8 string encodings

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P4 P4
    • 21
    • None
    • tools
    • None
    • jdk-21+12-20-gf3abc4063de

    • b16

      In Java classfiles, "Modified UTF-8" encoding is used to 16 bit Unicode characters.

      When reading UTF-8 strings from classfiles, the compiler does the minimum amount of work possible to decode each character. In particular, it does not validate that the characters are properly encoded:

      * It doesn't verify that 2nd and 3rd bytes have 10 as the top two bits
      * It doesn't verify that \u0000 is encoded in two bytes (as is required for "Modified UTF")
      * It doesn't verify that the shortest possible encoding was used for each character.

      This validation means the compiler will accept classfiles that the JVM would not, which is somewhat bad.

      But a worse problem is that because it does not strictly validate the UTF-8 encoding, the compiler allows multiple encodings for the same character sequence. This is bad because the Names table, which is supposed to guarantee uniqueness, does that by hashing the UTF-8 data. So if the compiler reads a classfile that includes the same Name encoded in two different ways, it will add a duplicate Name to the table.

            acobbs Archie Cobbs
            acobbs Archie Cobbs
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: