Loading...

Type: CSR
Resolution: Approved
Priority: P2
Fix Version/s: 17
Component/s: tools
Labels:
None

Subcomponent:
javac
Compatibility Kind:

source
Compatibility Risk:
low
Compatibility Risk Description:
This is a rarely used idiom that only showed up because of corpus work at Google.
Interface Kind:

Java API, Language construct
Scope:
Implementation

Summary

A bug was introduced in the JDK 16 javac compiler that changed the interpretation of the Unicode escape \u005c as 1) an escaping backslash and 2) its non-effect on subsequent Unicode escapes.

Problem

This issue relates to Unicode escapes, described in section 3.3 of the JLS. javac interprets Unicode escapes during the reading of ASCII characters from source. Later on, javac interprets escape sequences, described in section 3.7 of the JLS, during the tokenization of character literals, string literals, and text blocks. Escape sequences are only indirectly affected by this bug.

During reading, a normal backslash (that is, the ASCII \ character, not the corresponding Unicode escape \u005c) followed by another normal backslash is treated collectively as a pair of backslash characters. No further interpretation is done. This means that if a normal backslash immediately precedes the sequence \ u A B C D which would "normally" be interpreted as an Unicode escape, then the interpretation of that sequence as a Unicode escape is suppressed.

For example, the sequence \u2022 would be interpreted as the • character, whereas \\u2022 would be interpreted as the seven characters \ \ u 2 0 2 2.

An issue arises when Java developers choose to use a Unicode escape backslash \u005c in their source code, instead of a normal backslash. Prior to JDK 16, if the Unicode escape backslash was followed by a second Unicode escape, then the second Unicode escape was always interpreted. The normal backslash at the beginning of the second Unicode escape (immediately followed by u) was not paired with the preceding Unicode escape backslash. Elsewise, any following normal backslash will be paired with the \u005c.

For example, the sequence \u005c\u2022 would be interpreted as \ and •, whereas \u005c\tXYZ would be interpreted as \ \ t X Y Z.

The bug in JDK 16 ignored \u005c as having any effect on Unicode interpretation. Using the example from compiler-dev discussions, \u005c\\u005d :

Prior to JDK 16, it was interpreted as \ \ ]
JDK 16 interpreted it as \ \ \ u 0 0 5 d which would produce a syntax error downstream in the lexer because the escape sequence \u is invalid.

Solution

The proposed fix is to reintroduce the pre-JDK 16 behavior of \u005c\.

Specification

    diff --git a/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java b/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
    index c51be0fdf07..b089cf396cc 100644
    --- a/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
    +++ b/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
    @@ -85,6 +85,11 @@ public class UnicodeReader {
          */
         private boolean wasBackslash;

    +    /**
    +     * true if the last character was derived from an unicode escape sequence.
    +     */
    +    private boolean wasUnicodeEscape;
    +
         /**
          * Log for error reporting.
          */
    @@ -105,6 +110,7 @@ public class UnicodeReader {
             this.character = '\0';
             this.codepoint = 0;
             this.wasBackslash = false;
    +        this.wasUnicodeEscape = false;
             this.log = sf.log;

             nextCodePoint();
    @@ -161,17 +167,22 @@ public class UnicodeReader {
             // Fetch next character.
             nextCodeUnit();

    -        // If second backslash is detected.
    -        if (wasBackslash) {
    -            // Treat like a normal character (not part of unicode escape.)
    -            wasBackslash = false;
    -        } else if (character == '\\') {
    -            // May be an unicode escape.
    +        if (character == '\\' && (!wasBackslash || wasUnicodeEscape)) {
    +            // Is a backslash and may be an unicode escape.
                 switch (unicodeEscape()) {
    -                case BACKSLASH -> wasBackslash = true;
    -                case VALID_ESCAPE -> wasBackslash = false;
    +                case BACKSLASH -> {
    +                    wasUnicodeEscape = false;
    +                    wasBackslash = !wasBackslash;
    +                }
    +                case VALID_ESCAPE -> {
    +                    wasUnicodeEscape = true;
    +                    wasBackslash = character == '\\' && !wasBackslash;
    +                }
                     case BROKEN_ESCAPE -> nextUnicodeInputCharacter(); //skip broken unicode escapes
                 }
    +        } else {
    +            wasBackslash = false;
    +            wasUnicodeEscape = false;
             }

             // Codepoint and character match if not surrogate.
    @@ -297,6 +308,7 @@ public class UnicodeReader {
             position = pos;
             width = 0;
             wasBackslash = false;
    +        wasUnicodeEscape = false;
             nextCodePoint();
         }

csr of

JDK-8269150 UnicodeReader not translating \u005c\\u005d to \\]

Closed

relates to

JDK-8269406 3.3: Clarify the effect of Unicode escape processing

Resolved

Details

Description

Summary

Problem

Solution

Specification

Attachments

Issue Links

Activity

People

Dates