Loading...

XML

Word

Printable

Type: CSR
Resolution: Approved
Priority: P4
Fix Version/s: 21
Component/s: tools
Labels:
None

Subcomponent:
javac
Compatibility Kind:

binary, behavioral
Compatibility Risk:
minimal
Compatibility Risk Description:

Hide
The compiler will warn about (release 21) or reject (releases > 21) classfiles containing illegal "alternate" UTF-8 strings that it previously would not have rejected. However, as far as I know, no version of the JDK compiler has ever actually written out such classfiles, so such classfiles should be relatively rare.

Show
The compiler will warn about (release 21) or reject (releases > 21) classfiles containing illegal "alternate" UTF-8 strings that it previously would not have rejected. However, as far as I know, no version of the JDK compiler has ever actually written out such classfiles, so such classfiles should be relatively rare.
Interface Kind:

Class file construct, Other
Scope:
Implementation

Summary

Update the Java compiler to reject invalid classfiles, specifically those that contain invalid "Modified UTF-8" strings as defined in the JVMS.

Problem

In Java classfiles, "Modified UTF-8" encoding is used to encode 16 bit Unicode characters. Modified UTF-8 is like normal UTF-8 where: (a) only one, two, and three byte sequences are used, and (b) the NUL character \u0000 is encoded in two bytes instead of one.

When reading UTF-8 strings from classfiles, the compiler currently does the minimum amount of work possible to decode each character: it determines whether the byte sequence is one, two, or three bytes long (or none of the above) but then does no further validation. This results in a decoding that accepts byte sequences that are technically invalid (i.e., violate the classfile specification).

Specifically, the compiler is too lenient in the following three ways:

It doesn't verify that 2nd and 3rd bytes have 10 as the top two bits (these bits are simply masked off and discarded)
It doesn't verify that \u0000 is encoded in two bytes (as is required for "Modified UTF") instead of one
It doesn't verify that the shortest possible encoding is used for each character.

This validation means the compiler will accept classfiles that the JVM would not, which is somewhat bad.

But a worse problem is that because it does not strictly validate the UTF-8 encoding, the compiler allows multiple encodings for the same character sequence. This is bad because the Names table, which is supposed to guarantee uniqueness, does that by hashing the UTF-8 data. So if the compiler reads a classfile that includes the same Name encoded in two different ways, it will add a duplicate Name to the table. This could cause confusion, or worse, a potential security issue.

Solution

Tighten the compiler's validation of Modified UTF-8 so that the compiler rejects any classfiles containing UTF-8 encoded strings that are not strictly compliant with the JVMS corresponding to their major/minor version number.

Since there may be some classfiles out in the wild taking advantage of the compiler's current lax approach, the compiler will only generate a warning in release 21; in releases 22 and later, the compiler will generate an error.

Specification

The specification is JVMS §4.4.7 "The CONSTANT_Utf8_info Structure".

Note there is a historical anomaly we need to take care of: classfiles with major version < 45 (i.e., pre-Java 1.4) are allowed to use longer-than-necessary encodings. So the compiler must also accept these encodings when encountering these older classfiles.

csr of

JDK-8303623 Compiler should disallow non-standard UTF-8 string encodings

Resolved

Assignee:: Archie Cobbs

Reporter:: Archie Cobbs

Reviewed By:: Vicente Arturo Romero Zaldivar

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023-03-18 14:28

Updated:: 2023-03-23 12:16

Resolved:: 2023-03-23 12:16

Details

Description

Summary

Problem

Solution

Specification

Attachments

Issue Links

Activity

People

Dates