In Java classfiles, "Modified UTF-8" encoding is used to 16 bit Unicode characters.
When reading UTF-8 strings from classfiles, the compiler does the minimum amount of work possible to decode each character. In particular, it does not validate that the characters are properly encoded:
* It doesn't verify that 2nd and 3rd bytes have 10 as the top two bits
* It doesn't verify that \u0000 is encoded in two bytes (as is required for "Modified UTF")
* It doesn't verify that the shortest possible encoding was used for each character.
This validation means the compiler will accept classfiles that the JVM would not, which is somewhat bad.
But a worse problem is that because it does not strictly validate the UTF-8 encoding, the compiler allows multiple encodings for the same character sequence. This is bad because the Names table, which is supposed to guarantee uniqueness, does that by hashing the UTF-8 data. So if the compiler reads a classfile that includes the same Name encoded in two different ways, it will add a duplicate Name to the table.
When reading UTF-8 strings from classfiles, the compiler does the minimum amount of work possible to decode each character. In particular, it does not validate that the characters are properly encoded:
* It doesn't verify that 2nd and 3rd bytes have 10 as the top two bits
* It doesn't verify that \u0000 is encoded in two bytes (as is required for "Modified UTF")
* It doesn't verify that the shortest possible encoding was used for each character.
This validation means the compiler will accept classfiles that the JVM would not, which is somewhat bad.
But a worse problem is that because it does not strictly validate the UTF-8 encoding, the compiler allows multiple encodings for the same character sequence. This is bad because the Names table, which is supposed to guarantee uniqueness, does that by hashing the UTF-8 data. So if the compiler reads a classfile that includes the same Name encoded in two different ways, it will add a duplicate Name to the table.
- csr for
-
JDK-8304447 Compiler should disallow non-standard UTF-8 string encodings
-
- Closed
-