-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
P3
-
Affects Version/s: 16, 26
-
Component/s: tools
ADDITIONAL SYSTEM INFORMATION :
Edition Windows 10 Enterprise
Version 22H2
Installed on 19.07.2022
OS Build 19045.6456
openjdk version "25.0.1" 2025-10-21
OpenJDK Runtime Environment (build 25.0.1+8-27)
OpenJDK 64-Bit Server VM (build 25.0.1+8-27, mixed mode, sharing)
A DESCRIPTION OF THE PROBLEM :
According to the Java Language Specification, the lexical grammar for an entire compilation unit is:
Input: { InputElement } [ Sub ] with Sub defined as the ASCII SUB character (control-Z). JLS §3.5 further explains that this SUB character may be ignored only when it is the last character of the escaped input stream, as a compatibility concession for some operating systems.
In other words, at most one trailing SUB, and only at the very end of the escaped input, can be ignored; any other characters that appear after the final InputElement and are not part of a token should be rejected as illegal input. Other control characters (such as U+0001) do not have any such special status.
However, with javac in JDK 25 (and also in JDK 21, and reportedly since JDK 16) the compiler behaves differently. If a U+001A character appears after the last valid token of the compilation unit and is not part of a literal or comment, the compiler silently ignores not only that SUB but also all subsequent characters in the file, including non-SUB control characters.
Concretely, consider a source file whose last line consists of \u001A\u0001\u001A written as raw Unicode escapes after the closing brace of a class. After Unicode-escape processing (§3.3), this produces the three characters U+001A, U+0001, U+001A at the very end of the file, none of which belongs to a token. According to JLS, at most one trailing SUB could be discarded; the other characters should cause a lexical error (for example, for U+0001).
Instead, javac compiles such a file successfully and completely ignores the whole trailing sequence. The same U+0001 on its own (without the preceding U+001A) correctly produces illegal character: '\u0001'. This shows that the presence of U+001A after the last token effectively turns the remainder of the file into a kind of “implicit end-of-file comment”, which is not described anywhere in the specification.
Importantly, U+001A inside tokens works correctly: for instance System.out.println("\u001A"); compiles and prints the U+001A character at run time, as expected. The problematic behaviour only occurs when U+001A appears as a standalone character following the last token of the compilation unit.
Independent public discussions of this behaviour (sometimes referred to as an “end-of-file comment” bug introduced in JDK 16) also indicate that JDK 15 behaved according to the specification, and the regression appeared starting from JDK 16 when the compiler front end was reworked.
In summary, javac currently allows extra non-SUB characters after a trailing U+001A to be silently discarded, which contradicts the grammar and explanatory text in JLS §3.3–§3.5 and can hide malformed source input that ought to be rejected.
REGRESSION : Last worked in version 15
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Create a file SubBug.java with the following exact contents, with no extra spaces or blank lines after the last line:
public class SubBug {
public static void main(String[] args) {
System.out.println("\u001A");
}
}
\u001A\u0001\u001A
The last line of the file consists of the three Unicode escapes \u001A\u0001\u001A outside of any string or comment, with no newline after them. After Unicode-escape translation (§3.3), the escaped input stream ends with the three characters U+001A, U+0001, U+001A after the closing brace.
Compile the file:
javac SubBug.java
Optionally run it:
java SubBug
---------- BEGIN SOURCE ----------
See the minimal reproducible example in the “Steps to reproduce” section above:
public class SubBug {
public static void main(String[] args) {
System.out.println("\u001A");
}
}
\u001A\u0001\u001A
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Avoid placing any U+001A characters after the last token of a compilation unit; keep the file strictly limited to valid tokens plus at most one trailing SUB as allowed by the specification, or remove such characters entirely using a text editor or a pre-processing step.
FREQUENCY :
ALWAYS
Edition Windows 10 Enterprise
Version 22H2
Installed on 19.07.2022
OS Build 19045.6456
openjdk version "25.0.1" 2025-10-21
OpenJDK Runtime Environment (build 25.0.1+8-27)
OpenJDK 64-Bit Server VM (build 25.0.1+8-27, mixed mode, sharing)
A DESCRIPTION OF THE PROBLEM :
According to the Java Language Specification, the lexical grammar for an entire compilation unit is:
Input: { InputElement } [ Sub ] with Sub defined as the ASCII SUB character (control-Z). JLS §3.5 further explains that this SUB character may be ignored only when it is the last character of the escaped input stream, as a compatibility concession for some operating systems.
In other words, at most one trailing SUB, and only at the very end of the escaped input, can be ignored; any other characters that appear after the final InputElement and are not part of a token should be rejected as illegal input. Other control characters (such as U+0001) do not have any such special status.
However, with javac in JDK 25 (and also in JDK 21, and reportedly since JDK 16) the compiler behaves differently. If a U+001A character appears after the last valid token of the compilation unit and is not part of a literal or comment, the compiler silently ignores not only that SUB but also all subsequent characters in the file, including non-SUB control characters.
Concretely, consider a source file whose last line consists of \u001A\u0001\u001A written as raw Unicode escapes after the closing brace of a class. After Unicode-escape processing (§3.3), this produces the three characters U+001A, U+0001, U+001A at the very end of the file, none of which belongs to a token. According to JLS, at most one trailing SUB could be discarded; the other characters should cause a lexical error (for example, for U+0001).
Instead, javac compiles such a file successfully and completely ignores the whole trailing sequence. The same U+0001 on its own (without the preceding U+001A) correctly produces illegal character: '\u0001'. This shows that the presence of U+001A after the last token effectively turns the remainder of the file into a kind of “implicit end-of-file comment”, which is not described anywhere in the specification.
Importantly, U+001A inside tokens works correctly: for instance System.out.println("\u001A"); compiles and prints the U+001A character at run time, as expected. The problematic behaviour only occurs when U+001A appears as a standalone character following the last token of the compilation unit.
Independent public discussions of this behaviour (sometimes referred to as an “end-of-file comment” bug introduced in JDK 16) also indicate that JDK 15 behaved according to the specification, and the regression appeared starting from JDK 16 when the compiler front end was reworked.
In summary, javac currently allows extra non-SUB characters after a trailing U+001A to be silently discarded, which contradicts the grammar and explanatory text in JLS §3.3–§3.5 and can hide malformed source input that ought to be rejected.
REGRESSION : Last worked in version 15
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Create a file SubBug.java with the following exact contents, with no extra spaces or blank lines after the last line:
public class SubBug {
public static void main(String[] args) {
System.out.println("\u001A");
}
}
\u001A\u0001\u001A
The last line of the file consists of the three Unicode escapes \u001A\u0001\u001A outside of any string or comment, with no newline after them. After Unicode-escape translation (§3.3), the escaped input stream ends with the three characters U+001A, U+0001, U+001A after the closing brace.
Compile the file:
javac SubBug.java
Optionally run it:
java SubBug
---------- BEGIN SOURCE ----------
See the minimal reproducible example in the “Steps to reproduce” section above:
public class SubBug {
public static void main(String[] args) {
System.out.println("\u001A");
}
}
\u001A\u0001\u001A
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Avoid placing any U+001A characters after the last token of a compilation unit; keep the file strictly limited to valid tokens plus at most one trailing SUB as allowed by the specification, or remove such characters entirely using a text editor or a pre-processing step.
FREQUENCY :
ALWAYS
- caused by
-
JDK-8254073 Tokenizer improvements (revised)
-
- Resolved
-