Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8304447

Compiler should disallow non-standard UTF-8 string encodings

XMLWordPrintable

    • Icon: CSR CSR
    • Resolution: Approved
    • Icon: P4 P4
    • 21
    • tools
    • None
    • binary, behavioral
    • minimal
    • Hide
      The compiler will warn about (release 21) or reject (releases > 21) classfiles containing illegal "alternate" UTF-8 strings that it previously would not have rejected. However, as far as I know, no version of the JDK compiler has ever actually written out such classfiles, so such classfiles should be relatively rare.
      Show
      The compiler will warn about (release 21) or reject (releases > 21) classfiles containing illegal "alternate" UTF-8 strings that it previously would not have rejected. However, as far as I know, no version of the JDK compiler has ever actually written out such classfiles, so such classfiles should be relatively rare.
    • Class file construct, Other
    • Implementation

      Summary

      Update the Java compiler to reject invalid classfiles, specifically those that contain invalid "Modified UTF-8" strings as defined in the JVMS.

      Problem

      In Java classfiles, "Modified UTF-8" encoding is used to encode 16 bit Unicode characters. Modified UTF-8 is like normal UTF-8 where: (a) only one, two, and three byte sequences are used, and (b) the NUL character \u0000 is encoded in two bytes instead of one.

      When reading UTF-8 strings from classfiles, the compiler currently does the minimum amount of work possible to decode each character: it determines whether the byte sequence is one, two, or three bytes long (or none of the above) but then does no further validation. This results in a decoding that accepts byte sequences that are technically invalid (i.e., violate the classfile specification).

      Specifically, the compiler is too lenient in the following three ways:

      • It doesn't verify that 2nd and 3rd bytes have 10 as the top two bits (these bits are simply masked off and discarded)
      • It doesn't verify that \u0000 is encoded in two bytes (as is required for "Modified UTF") instead of one
      • It doesn't verify that the shortest possible encoding is used for each character.

      This validation means the compiler will accept classfiles that the JVM would not, which is somewhat bad.

      But a worse problem is that because it does not strictly validate the UTF-8 encoding, the compiler allows multiple encodings for the same character sequence. This is bad because the Names table, which is supposed to guarantee uniqueness, does that by hashing the UTF-8 data. So if the compiler reads a classfile that includes the same Name encoded in two different ways, it will add a duplicate Name to the table. This could cause confusion, or worse, a potential security issue.

      Solution

      Tighten the compiler's validation of Modified UTF-8 so that the compiler rejects any classfiles containing UTF-8 encoded strings that are not strictly compliant with the JVMS corresponding to their major/minor version number.

      Since there may be some classfiles out in the wild taking advantage of the compiler's current lax approach, the compiler will only generate a warning in release 21; in releases 22 and later, the compiler will generate an error.

      Specification

      The specification is JVMS ยง4.4.7 "The CONSTANT_Utf8_info Structure".

      Note there is a historical anomaly we need to take care of: classfiles with major version < 45 (i.e., pre-Java 1.4) are allowed to use longer-than-necessary encodings. So the compiler must also accept these encodings when encountering these older classfiles.

            acobbs Archie Cobbs
            acobbs Archie Cobbs
            Vicente Arturo Romero Zaldivar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: