Summary
A bug was introduced in the JDK 16 javac compiler that changed the
interpretation of the Unicode escape \u005c
as 1) an escaping backslash and 2)
its non-effect on subsequent Unicode escapes.
Problem
This issue relates to Unicode escapes, described in section 3.3 of the JLS. javac interprets Unicode escapes during the reading of ASCII characters from source. Later on, javac interprets escape sequences, described in section 3.7 of the JLS, during the tokenization of character literals, string literals, and text blocks. Escape sequences are only indirectly affected by this bug.
During reading, a normal backslash (that is, the ASCII \
character, not the corresponding Unicode escape \u005c
) followed by another normal backslash is treated collectively as a pair of backslash characters. No further interpretation is done. This means that if a normal backslash immediately precedes the sequence \
u
A
B
C
D
which would "normally" be interpreted as an Unicode escape, then the interpretation of that sequence as a Unicode escape is suppressed.
For example, the sequence \u2022
would be interpreted as the •
character, whereas \\u2022
would be interpreted as the seven characters \
\
u
2
0
2
2
.
An issue arises when Java developers choose to use a Unicode escape backslash \u005c
in their source code, instead of a normal backslash. Prior to JDK 16, if the Unicode escape backslash was followed by a second Unicode escape, then the second Unicode escape was always interpreted. The normal backslash at the beginning of the second Unicode escape (immediately followed by u
) was not paired with the preceding Unicode escape backslash. Elsewise, any following normal backslash will be paired with the \u005c
.
For example, the sequence \u005c\u2022
would be interpreted as \
and •
, whereas \u005c\tXYZ
would be interpreted as \
\
t
X
Y
Z
.
The bug in JDK 16 ignored \u005c
as having any effect on Unicode interpretation. Using the example from compiler-dev discussions, \u005c\\u005d
:
- Prior to JDK 16, it was interpreted as
\
\
]
- JDK 16 interpreted it as
\
\
\
u
0
0
5
d
which would produce a syntax error downstream in the lexer because the escape sequence\u
is invalid.
Solution
The proposed fix is to reintroduce the pre-JDK 16 behavior of \u005c\
.
Specification
diff --git a/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java b/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
index c51be0fdf07..b089cf396cc 100644
--- a/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
+++ b/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
@@ -85,6 +85,11 @@ public class UnicodeReader {
*/
private boolean wasBackslash;
+ /**
+ * true if the last character was derived from an unicode escape sequence.
+ */
+ private boolean wasUnicodeEscape;
+
/**
* Log for error reporting.
*/
@@ -105,6 +110,7 @@ public class UnicodeReader {
this.character = '\0';
this.codepoint = 0;
this.wasBackslash = false;
+ this.wasUnicodeEscape = false;
this.log = sf.log;
nextCodePoint();
@@ -161,17 +167,22 @@ public class UnicodeReader {
// Fetch next character.
nextCodeUnit();
- // If second backslash is detected.
- if (wasBackslash) {
- // Treat like a normal character (not part of unicode escape.)
- wasBackslash = false;
- } else if (character == '\\') {
- // May be an unicode escape.
+ if (character == '\\' && (!wasBackslash || wasUnicodeEscape)) {
+ // Is a backslash and may be an unicode escape.
switch (unicodeEscape()) {
- case BACKSLASH -> wasBackslash = true;
- case VALID_ESCAPE -> wasBackslash = false;
+ case BACKSLASH -> {
+ wasUnicodeEscape = false;
+ wasBackslash = !wasBackslash;
+ }
+ case VALID_ESCAPE -> {
+ wasUnicodeEscape = true;
+ wasBackslash = character == '\\' && !wasBackslash;
+ }
case BROKEN_ESCAPE -> nextUnicodeInputCharacter(); //skip broken unicode escapes
}
+ } else {
+ wasBackslash = false;
+ wasUnicodeEscape = false;
}
// Codepoint and character match if not surrogate.
@@ -297,6 +308,7 @@ public class UnicodeReader {
position = pos;
width = 0;
wasBackslash = false;
+ wasUnicodeEscape = false;
nextCodePoint();
}
- csr of
-
JDK-8269150 UnicodeReader not translating \u005c\\u005d to \\]
-
- Closed
-
- relates to
-
JDK-8269406 3.3: Clarify the effect of Unicode escape processing
-
- Resolved
-