Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8269290

UnicodeReader not translating \u005c\\u005d to \\]

XMLWordPrintable

    • Icon: CSR CSR
    • Resolution: Approved
    • Icon: P2 P2
    • 17
    • tools
    • None
    • source
    • low
    • This is a rarely used idiom that only showed up because of corpus work at Google.
    • Java API, Language construct
    • Implementation

      Summary

      A bug was introduced in the JDK 16 javac compiler that changed the interpretation of the Unicode escape \u005c as 1) an escaping backslash and 2) its non-effect on subsequent Unicode escapes.

      Problem

      This issue relates to Unicode escapes, described in section 3.3 of the JLS. javac interprets Unicode escapes during the reading of ASCII characters from source. Later on, javac interprets escape sequences, described in section 3.7 of the JLS, during the tokenization of character literals, string literals, and text blocks. Escape sequences are only indirectly affected by this bug.

      During reading, a normal backslash (that is, the ASCII \ character, not the corresponding Unicode escape \u005c) followed by another normal backslash is treated collectively as a pair of backslash characters. No further interpretation is done. This means that if a normal backslash immediately precedes the sequence \ u A B C D which would "normally" be interpreted as an Unicode escape, then the interpretation of that sequence as a Unicode escape is suppressed.

      For example, the sequence \u2022 would be interpreted as the character, whereas \\u2022 would be interpreted as the seven characters \ \ u 2 0 2 2.

      An issue arises when Java developers choose to use a Unicode escape backslash \u005c in their source code, instead of a normal backslash. Prior to JDK 16, if the Unicode escape backslash was followed by a second Unicode escape, then the second Unicode escape was always interpreted. The normal backslash at the beginning of the second Unicode escape (immediately followed by u) was not paired with the preceding Unicode escape backslash. Elsewise, any following normal backslash will be paired with the \u005c.

      For example, the sequence \u005c\u2022 would be interpreted as \ and , whereas \u005c\tXYZ would be interpreted as \ \ t X Y Z.

      The bug in JDK 16 ignored \u005c as having any effect on Unicode interpretation. Using the example from compiler-dev discussions, \u005c\\u005d :

      • Prior to JDK 16, it was interpreted as \ \ ]
      • JDK 16 interpreted it as \ \ \ u 0 0 5 d which would produce a syntax error downstream in the lexer because the escape sequence \u is invalid.

      Solution

      The proposed fix is to reintroduce the pre-JDK 16 behavior of \u005c\.

      Specification

          diff --git a/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java b/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
          index c51be0fdf07..b089cf396cc 100644
          --- a/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
          +++ b/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
          @@ -85,6 +85,11 @@ public class UnicodeReader {
                */
               private boolean wasBackslash;
      
          +    /**
          +     * true if the last character was derived from an unicode escape sequence.
          +     */
          +    private boolean wasUnicodeEscape;
          +
               /**
                * Log for error reporting.
                */
          @@ -105,6 +110,7 @@ public class UnicodeReader {
                   this.character = '\0';
                   this.codepoint = 0;
                   this.wasBackslash = false;
          +        this.wasUnicodeEscape = false;
                   this.log = sf.log;
      
                   nextCodePoint();
          @@ -161,17 +167,22 @@ public class UnicodeReader {
                   // Fetch next character.
                   nextCodeUnit();
      
          -        // If second backslash is detected.
          -        if (wasBackslash) {
          -            // Treat like a normal character (not part of unicode escape.)
          -            wasBackslash = false;
          -        } else if (character == '\\') {
          -            // May be an unicode escape.
          +        if (character == '\\' && (!wasBackslash || wasUnicodeEscape)) {
          +            // Is a backslash and may be an unicode escape.
                       switch (unicodeEscape()) {
          -                case BACKSLASH -> wasBackslash = true;
          -                case VALID_ESCAPE -> wasBackslash = false;
          +                case BACKSLASH -> {
          +                    wasUnicodeEscape = false;
          +                    wasBackslash = !wasBackslash;
          +                }
          +                case VALID_ESCAPE -> {
          +                    wasUnicodeEscape = true;
          +                    wasBackslash = character == '\\' && !wasBackslash;
          +                }
                           case BROKEN_ESCAPE -> nextUnicodeInputCharacter(); //skip broken unicode escapes
                       }
          +        } else {
          +            wasBackslash = false;
          +            wasUnicodeEscape = false;
                   }
      
                   // Codepoint and character match if not surrogate.
          @@ -297,6 +308,7 @@ public class UnicodeReader {
                   position = pos;
                   width = 0;
                   wasBackslash = false;
          +        wasUnicodeEscape = false;
                   nextCodePoint();
               }

            jlaskey Jim Laskey
            jlaskey Jim Laskey
            Jan Lahoda
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: