Summary
Introduce a new method String::encodeEscapes
which replaces non-ASCII and
non-printing characters in the string with escape sequences or unicode escapes.
Problem
If a string contains non-printing or non-ASCII characters then content may be unparsable, unreadable or unsecure when:
- part of source for a programming language
- application input
- debug reporting
There is also the serious issue of when raw input from an external source is injected into vulnerability sensitive source (Ex. SQL).
Solution
Convert the string content to use escape sequences for non-ASCII and non-printing characters. This would allow developers to construct parsable/readable strings.
Note that this method the near reciprocal of String::translateEscapes
. That is
string.encodeEscapes().translateEscapes().equals(string)
will always be true
.
However, string.translateEscapes().encodeEscapes().equals(string)
will not
always be true since the source string can vary some characters as
escaped or non-escaped.
Specification
Update translateEscapes to reference encodeEscapes:
diff --git a/src/java.base/share/classes/java/lang/String.java b/src/java.base/share/classes/java/lang/String.java
index 67ee641fba1..ab4ebcfb4b9 100644
--- a/src/java.base/share/classes/java/lang/String.java
+++ b/src/java.base/share/classes/java/lang/String.java
@@ -4238,6 +4238,8 @@ private static int outdent(List<String> lines) {
* @jls 3.10.7 Escape Sequences
* @jls 3.3 Unicode Escapes
*
+ * @see String#encodeEscapes()
+ *
* @since 15
*/
public String translateEscapes() {
@@ -4324,6 +4326,77 @@ public String translateEscapes() {
return new String(chars, 0, to);
}
encodeEscapes:
+ /**
+ * Translate characters to their escaped equivalents, if necessary, such that the
+ * resulting string, when embedded in double quotes, can be parsed by the compiler
+ * to reproduce this string.
+ * <p>
+ * Characters are translated as follows;
+ * <table class="striped">
+ * <caption style="display:none">Translation</caption>
+ * <thead>
+ * <tr>
+ * <th scope="col">Character</th>
+ * <th scope="col">Name</th>
+ * <th scope="col">Translation</th>
+ * </tr>
+ * </thead>
+ * <tbody>
+ * <tr>
+ * <th scope="row">{@code U+0008}
+ * <td>backspace {@code (\u005Cb)}</td>
+ * <td>{@code \u005C\u005Cb}</td>
+ * </tr>
+ * <tr>
+ * <th scope="row">{@code U+0009}</th>
+ * <td>horizontal tab {@code (\u005Ct)}</td>
+ * <td>{@code \u005C\u005Ct}</td>
+ * </tr>
+ * <tr>
+ * <th scope="row">{@code U+000A}</th>
+ * <td>line feed {@code (\u005Cn)}</td>
+ * <td>{@code \u005C\u005Cn}</td>
+ * </tr>
+ * <tr>
+ * <th scope="row">{@code U+000C}</th>
+ * <td>form feed {@code (\u005Cf)}</td>
+ * <td>{@code \u005C\u005Cf}</td>
+ * </tr>
+ * <tr>
+ * <th scope="row">{@code U+000D}</th>
+ * <td>carriage return {@code (\u005Cr)}</td>
+ * <td>{@code \u005C\u005Cr}</td>
+ * </tr>
+ * <tr>
+ * <th scope="row">{@code U+0022}</th>
+ * <td>double quote {@code (\u005C")}</td>
+ * <td>{@code \u005C\u005C"}</td>
+ * </tr>
+ * <tr>
+ * <th scope="row">{@code U+0027}</th>
+ * <td>single quote {@code (\u005C')}</td>
+ * <td>{@code \u005C\u005C'}</td>
+ * </tr>
+ * <tr>
+ * <th scope="row">{@code U+005C}</th>
+ * <td>backslash {@code (\u005C\u005C)}</td>
+ * <td>{@code \u005C\u005C\u005C\u005C}</td>
+ * </tr>
+ * <tr>
+ * <th scope="row">{@code U+0020-U+007E}</th>
+ * <td>visible ASCII characters, excluding<p>double quote, single quote and backslash</td>
+ * <td>as-is</td>
+ * </tr>
+ * <tr>
+ * <th scope="row">{@code U+XXXX}</th>
+ * <td>all other characters</td>
+ * <td>\u005C\u005CuXXXX</td>
+ * </tr>
+ * </tbody>
+ * </table>
+ * Example:
+ * {@snippet lang=JAVA :
+ * String encoded = "\u2022This is a line.\n\t followed by this line.".encodeEscapes();
+ * System.out.println(encoded.equals("\\u2022This is a line.\\n\\t followed by this line."
+ * }
+ * will print out {@code true}.
+ * <p>
+ * The result of this method is lossless and as such the original string can always be
+ * reproduced using {@link String#translateEscapes()}. Also because of losslessness,
+ * specific characters that don't require escaping can be reverted by simply using a
+ * replace method. For example: if newlines need to be maintained in the resulting
+ * string then just apply {@code replace("\\n", "\n")} to the result.
+ *
+ * @return string with characters encoded using escape sequences and Unicode escapes
+ *
+ * @since 23
+ *
+ * @see #translateEscapes()
+ *
+ * @jls 3.10.7 Escape Sequences
+ * @jls 3.3 Unicode Escapes
+ *
+ * @implNote If no characters were translated then the original string will be returned.
+ *
+ * @implSpec {@code string.encodeEscapes().translateEscapes().equals(string)} will
+ * always be {@code true}. However, this method is not the reciprocal to
+ * {@link #translateEscapes()} as any string yielded by
+ * {@link #translateEscapes()} may have many variations of original string.
+ */
+ public String encodeEscapes() {
+ int length = length();
+ StringBuilder sb = new StringBuilder(length + (length >> 2));
+ for (int i = 0; i < length; i++) {
+ char ch = charAt(i);
+ switch (ch) {
+ case '\b': sb.append('\\'); sb.append('b'); break;
+ case '\t': sb.append('\\'); sb.append('t'); break;
+ case '\n': sb.append('\\'); sb.append('n'); break;
+ case '\f': sb.append('\\'); sb.append('f'); break;
+ case '\r': sb.append('\\'); sb.append('r'); break;
+ case '\"': sb.append('\\'); sb.append('\"'); break;
+ case '\'': sb.append('\\'); sb.append('\''); break;
+ case '\\': sb.append('\\'); sb.append('\\'); break;
+ default: if (' ' <= ch && ch <= '~') {
+ sb.append(ch);
+ } else {
+ String hex = Integer.toHexString(ch);
+ sb.append('\\');
+ sb.append('u');
+ sb.repeat('0', 4 - hex.length());
+ sb.append(hex);
+ }
+ }
+ }
+ return sb.length() != length ? sb.toString() : this;
+ }
+
- csr of
-
JDK-8253438 Add String::encodeEscapes
-
- Open
-