Specification for JEP 8196004: Raw String Literals (2018-02-23)

This document proposes changes to the Java Language Specification to support raw string literals. See JEP 8196004 for an overview.

(Production difficulties prevent the notes and examples which are new in this document from being colored green.)

3 Lexical Structure

...

The Unicode characters resulting from the lexical translations are reduced to a sequence of input elements (3.5), which are white space (3.6), comments (3.7), and tokens. The tokens are the identifiers (3.8), keywords (3.9), literals (3.10), separators (3.11), and operators (3.12) of the syntactic grammar.

Among the input elements, raw string literals (3.10.7) are special because they effectively opt out of the lexical translations. As a result, they can directly include textual fragments of other programs which themselves include Unicode escapes and other escape sequences.

3.1 Unicode

Except for comments (3.7), identifiers, and the contents of character and string and raw string literals (3.10.4, 3.10.5, 3.10.7), all input elements (3.5) in a program are formed only from ASCII characters (or Unicode escapes (3.3) which result in ASCII characters).

3.3 Unicode Escapes

...

If an eligible \ is followed by u, or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs the eligible \ and all the u characters which follow are treated as RawInputCharacters and remain part of the escaped Unicode stream. If the third step of lexical translation (3.5) results in these RawInputCharacters becoming part of an input element that is not a raw string literal (3.10.7), then a compile-time error occurs.

Thus, this is legal:

String tm = "The \u2122 symbol";

But the following code, which truncates the Unicode escape, is not legal:

String tm = "The \u212 symbol";

Raw string literals are unique in that they avoid Unicode escape processing. The string literal:

"\\u2122=\u2122"

represents a string of nine characters: (TM is intended to indicate the trademark symbol)

\ \ u 2 1 2 2 = TM

whereas the raw string literal:

 `\\u2122=\u2122` 

represents a string of 14 characters:

\ \ u 2 1 2 2 = \ u 2 1 2 2

Since raw string literals do not contain Unicode escapes that could be considered truncated, this is legal:

String tm = `The \u212 symbol`;

but the comment in the following code is not legal, since it contains a truncated Unicode escape outside the raw string literal:

String tm = `The \u212 symbol`;  // We use \u212 because ...

3.10 Literals

Literal:

IntegerLiteral

FloatingPointLiteral

BooleanLiteral

CharacterLiteral

StringLiteral

RawStringLiteral

NullLiteral

3.10.7 Raw String Literals

A raw string literal consists of one or more characters enclosed in ASCII backtick characters. Characters that would be represented with escape sequences (3.10.6) in a string literal, such as newlines and double quotes, can be represented directly in a raw string literal. A raw string literal can also represent character sequences that would denote Unicode escapes anywhere else in the program; this facility causes the string represented by a raw string literal to be derived in a unique manner.

RawStringLiteral:

RawStringDelimiter RawStringBody RawStringDelimiter

RawStringDelimiter:

` {`}

RawStringBody:

UnicodeInputCharacter {UnicodeInputCharacter}

It is a compile-time error if any backtick character in a RawStringDelimiter was lexically translated from the Unicode escape \u0060.

It is undesirable to allow the Unicode escape \u0060 (`) to serve as the opening delimiter because the same six-character sequence cannot serve as the closing delimiter. Thus, the following is illegal:

String s = \u0060Hi Bob`;

This boundary between the "outside" and the "inside" of a raw string literal is the only place in the Java programming language where a Unicode escape is disallowed.

The delimiters of a raw string literal must be balanced. It is a compile-time error if the opening RawStringDelimiter is not identical to the closing RawStringDelimiter.

The body of a raw string literal is the sequence of input characters and line terminators that served as input to the third step of lexical translation (3.5) in order to yield the RawStringBody of the literal. However, the string represented by a raw string literal is not the body. Instead, the string represented by a raw string literal is based on the sequence of raw Unicode characters that served as input to the first step of lexical translation (3.2) and subsequently became the body after the first and second steps. In particular, the string is the sequence of raw Unicode characters with the following translations applied, in order:

  1. an ASCII CR character followed by an ASCII LF character is translated to an ASCII LF character.

  2. an ASCII CR character is translated to an ASCII LF character.

Examples of raw string literals:

`raw`         // the three characters r a w
`Hi, "Bob".`  // the ten characters H i , SP " B o b " .
`\(.\)\1`     // the seven characters \ ( . \ ) \ 1
`Hi,
 Bob
`             // the nine characters H i , LF SP B o b LF
````````````
Hello, world
````````````    // the 14 characters LF H e l l o , SP w o r l d LF
`\n`            // the two characters \ n (not LF)
`\uvw`          // the four characters \ u v w (not a Unicode escape)
`\u0060`        // the six characters \ u 0 0 6 0 (not `)
`\u000a`        // the six characters \ u 0 0 0 a (not LF)
`\u000d\u000a`  // the 12 characters \ u 0 0 0 d \ u 0 0 0 a

When this specification says that a raw string literal contains a particular character or sequence of characters, or that a particular character or sequence of characters is in a raw string literal, it means that the string represented by the raw string literal (as opposed to the body of the raw string literal) contains the character or sequence of characters.

A raw string literal may contain a backtick character in any position except the beginning or the end.

The lexical grammar implies that the string represented by a raw string literal is non-empty, and does not begin or end with a backtick character. Denoting an empty raw string literal:

String s = ``;  // Illegal

is not possible because the backticks are interpreted as the opening delimiter of a raw string literal that does not finish before the end of the compilation unit. Beginning the string with a backtick:

String s = `` is the backtick character`;

is not possible for the same reason. Ending the string with a backtick is not possible because the delimiters are unbalanced:

String s = `Don't forget the backtick character ``;

If a string must begin or end with a backtick, then a padding character must be prepended or appended to the raw string literal to separate the backtick from the delimiters:

String s = "`" + ` is the backtick character, and so is ` + "`";

The number of backtick characters in the opening delimiter (and thus, in the closing delimiter) must be chosen with regard for the presence of backtick characters in the raw string literal. If a raw string literal contains a sequence of one or more backtick characters preceded and followed by non-backtick characters, then the length of the sequence must be different than the number of backticks in the opening delimiter, or a compile-time error occurs.

Examples of raw string literals that contain backticks:

    `Hi, ``Bob`` and ```Jim```.`

    ``Hi, `Bob` and ```Jim```.``

    ```Hi, `Bob` and ``Jim``.```

    `` ` ``      // the three characters SP ` SP
    ``` `` ```   // the four characters SP ` ` SP

A raw string literal is always of type String (4.3.3).

At run time, a raw string literal is evaluated to a reference to an instance of type String that corresponds to the string represented by the raw string literal. Raw string literals are interned in the same manner as string literals.

Raw string literals can be used wherever an instance of String is allowed, such as in the string concatenation operator (15.18.1) and when calling methods of String:

    System.out.println("abc" + `cde`);
    `1+1 is ` + String.valueOf(2)
    String cde = `abcde`.substring(2);

Additional changes