Unicode escapes which denote normal 7-bit ASCII characters are very often a bad idea, making Java code hard to read.
(There may be some very specialized use cases where an entire Java program should be encoded as Unicode escapes, for example to avoid whitespace and most of the rest of the alphabet of ASCII. But those are vanishingly rare, even if a legitimate use of the Java language.)
The worst cases are when the encoded 7-bit ASCII character itself controls boundary of a lexical element of the Java source, notably a line termination, a comment, a string or character literal delimiter, or an escape itself.
Here is a taste of what I mean. The following lines are all difficult to read because unicode escapes expand to ASCII characters which modify token boundaries:
```
/*XTCOMMENT \u002A/ int x */
//XCOMMENT \u000A int x
static final String XNEWLINE = "\u000A";
static final String XDQUOTE = "\u0022";
static final String XSQUOTE = "" + '\u0027';
static final String XESCAPE = "\u005C";
```
Exercise for the reader: Which are legal and which are illegal inputs?
More details are at https://bugs.openjdk.java.net/browse/JDK-8269406?focusedCommentId=14430157&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14430157
Such an obnoxiously obfuscated program is the wrong kind of OOP, indeed!
There should be a lint option (turned on with other options by `-Xlint` simple) which flags such token-changing unicode escapes.
A very narrowly focused lint check would emit lint warnings in the above cases, and no others: If a unicode escape encoded a newline, or a / or * which introduces or terminates a comment, or one of the characters " ' \ outside of a comment, then it would be flagged as an obfuscation.
A more widely conceived lint check would simply flag any unicode escape for a printing character (or a space or either newline) in 7-bit ASCII.
The check could ignore the interiors of comments, but certainly not the edges of them. That is, encoded newlines in //-comments and any encoding in a */ that would end a comment should be flagged.
We can roughly classify uses of unicode escapes into three buckets: Legitimate, obnoxious, and marginal.
Legitimate uses encode names, string literals, and character literals which cannot otherwise be written in a reduced character set (such as 7-bit ASCII). These uses are why we have unicode escapes in the first place.
A second legitimate use is a unicode escape for a 7-bit ASCII *escape* character, such as null ('\0'), within a string or character literal body. Since C-style escapes are arguably clearer, one might argue for a lint warning if a unicode escape encodes one of the ASCII values below 32. (Here I do *not* mean whitespace; \n \r \t are both whitespace and control characters and are not obfuscated legitimately.) But I think some users would object to that warning, on the grounds that, if you are going to use an escape at all, you should be able to pick one consistently, and unicode escapes cover the most code points.
Obnoxious uses are those described above: They obfuscate token boundaries.
The main marginal cases are any use of unicode escapes to encode 7-bit characters which comprise Java tokens. These include:
- whitespace \u0009 for tab (maybe \u0020 for space, NOT \u000A for LF)
- names `\u006foop` for `ooop`
- expression operators `i\u002b=2` for `i+=2`
- braces and brackets of various sorts
- string and character literals '\u005cn' for '\n'
These usages are probably rare enough that turning on `-Xlint:unicode-escapes` should exclude them, under the more expansive rule of "no unicode escapes for 7-bit ASCII code points", with the exception of non-whitespace invisible control characters.
(There may be some very specialized use cases where an entire Java program should be encoded as Unicode escapes, for example to avoid whitespace and most of the rest of the alphabet of ASCII. But those are vanishingly rare, even if a legitimate use of the Java language.)
The worst cases are when the encoded 7-bit ASCII character itself controls boundary of a lexical element of the Java source, notably a line termination, a comment, a string or character literal delimiter, or an escape itself.
Here is a taste of what I mean. The following lines are all difficult to read because unicode escapes expand to ASCII characters which modify token boundaries:
```
/*XTCOMMENT \u002A/ int x */
//XCOMMENT \u000A int x
static final String XNEWLINE = "\u000A";
static final String XDQUOTE = "\u0022";
static final String XSQUOTE = "" + '\u0027';
static final String XESCAPE = "\u005C";
```
Exercise for the reader: Which are legal and which are illegal inputs?
More details are at https://bugs.openjdk.java.net/browse/JDK-8269406?focusedCommentId=14430157&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14430157
Such an obnoxiously obfuscated program is the wrong kind of OOP, indeed!
There should be a lint option (turned on with other options by `-Xlint` simple) which flags such token-changing unicode escapes.
A very narrowly focused lint check would emit lint warnings in the above cases, and no others: If a unicode escape encoded a newline, or a / or * which introduces or terminates a comment, or one of the characters " ' \ outside of a comment, then it would be flagged as an obfuscation.
A more widely conceived lint check would simply flag any unicode escape for a printing character (or a space or either newline) in 7-bit ASCII.
The check could ignore the interiors of comments, but certainly not the edges of them. That is, encoded newlines in //-comments and any encoding in a */ that would end a comment should be flagged.
We can roughly classify uses of unicode escapes into three buckets: Legitimate, obnoxious, and marginal.
Legitimate uses encode names, string literals, and character literals which cannot otherwise be written in a reduced character set (such as 7-bit ASCII). These uses are why we have unicode escapes in the first place.
A second legitimate use is a unicode escape for a 7-bit ASCII *escape* character, such as null ('\0'), within a string or character literal body. Since C-style escapes are arguably clearer, one might argue for a lint warning if a unicode escape encodes one of the ASCII values below 32. (Here I do *not* mean whitespace; \n \r \t are both whitespace and control characters and are not obfuscated legitimately.) But I think some users would object to that warning, on the grounds that, if you are going to use an escape at all, you should be able to pick one consistently, and unicode escapes cover the most code points.
Obnoxious uses are those described above: They obfuscate token boundaries.
The main marginal cases are any use of unicode escapes to encode 7-bit characters which comprise Java tokens. These include:
- whitespace \u0009 for tab (maybe \u0020 for space, NOT \u000A for LF)
- names `\u006foop` for `ooop`
- expression operators `i\u002b=2` for `i+=2`
- braces and brackets of various sorts
- string and character literals '\u005cn' for '\n'
These usages are probably rare enough that turning on `-Xlint:unicode-escapes` should exclude them, under the more expansive rule of "no unicode escapes for 7-bit ASCII code points", with the exception of non-whitespace invisible control characters.
- relates to
-
JDK-8269150 UnicodeReader not translating \u005c\\u005d to \\]
-
- Closed
-
-
JDK-8269406 3.3: Clarify the effect of Unicode escape processing
-
- Resolved
-
-
JDK-8278542 javac could produce a warning for suspicious uses of bi-directional Unicode control characters
-
- Open
-