Summary
Update the Java compiler to reject invalid classfiles, specifically those that contain invalid "Modified UTF-8" strings as defined in the JVMS.
Problem
In Java classfiles, "Modified UTF-8" encoding is used to encode 16 bit Unicode characters. Modified UTF-8 is like normal UTF-8 where: (a) only one, two, and three byte sequences are used, and (b) the NUL character \u0000 is encoded in two bytes instead of one.
When reading UTF-8 strings from classfiles, the compiler currently does the minimum amount of work possible to decode each character: it determines whether the byte sequence is one, two, or three bytes long (or none of the above) but then does no further validation. This results in a decoding that accepts byte sequences that are technically invalid (i.e., violate the classfile specification).
Specifically, the compiler is too lenient in the following three ways:
- It doesn't verify that 2nd and 3rd bytes have 10 as the top two bits (these bits are simply masked off and discarded)
- It doesn't verify that \u0000 is encoded in two bytes (as is required for "Modified UTF") instead of one
- It doesn't verify that the shortest possible encoding is used for each character.
This validation means the compiler will accept classfiles that the JVM would not, which is somewhat bad.
But a worse problem is that because it does not strictly validate the UTF-8 encoding, the compiler allows multiple encodings for the same character sequence. This is bad because the Names table, which is supposed to guarantee uniqueness, does that by hashing the UTF-8 data. So if the compiler reads a classfile that includes the same Name encoded in two different ways, it will add a duplicate Name to the table. This could cause confusion, or worse, a potential security issue.
Solution
Tighten the compiler's validation of Modified UTF-8 so that the compiler rejects any classfiles containing UTF-8 encoded strings that are not strictly compliant with the JVMS corresponding to their major/minor version number.
Since there may be some classfiles out in the wild taking advantage of the compiler's current lax approach, the compiler will only generate a warning in release 21; in releases 22 and later, the compiler will generate an error.
Specification
The specification is JVMS ยง4.4.7 "The CONSTANT_Utf8_info Structure".
Note there is a historical anomaly we need to take care of: classfiles with major version < 45 (i.e., pre-Java 1.4) are allowed to use longer-than-necessary encodings. So the compiler must also accept these encodings when encountering these older classfiles.
- csr of
-
JDK-8303623 Compiler should disallow non-standard UTF-8 string encodings
-
- Resolved
-