-
CSR
-
Resolution: Approved
-
P3
-
None
-
behavioral
-
medium
-
-
Java API, System or security property
-
SE
Summary
Use UTF-8 as the default charset for the Java SE APIs, so that APIs which depend on the default charset behave consistently across all JDK implementations and independently of the user’s operating system, locale, and configuration.
Problem
APIs that use the default charset are a hazard for developers that are new to the Java platform. They are also a bugbear for experienced developers. Consider an application that creates a java.io.FileWriter
with its 1-arg constructor and uses it to writes some text to a file. Writing the text encodes it into a sequence of bytes using the default charset. Another application, run on a different machine or by a different user on the same machine, creates a java.io.FileReader
with its 1-arg constructor and uses it to read the text from the file. Reading the file decodes the bytes to a sequence of characters/text using the default charset. If the default charset is different when reading then the resulting text may be silently corrupted or incomplete (as these APIs replace erroneous input, they don't fail).
Developers that are familiar with the hazard may choose to use methods that specify the charset (either by charset name or Charset
) but the resulting code is more verbose. Furthermore, using APIs that specify the charset may inhibit the use of some Java Language features (Method References in particular). Sometimes developers attempt to set the default charset by means of the system property file.encoding
but this has never been a supported mechanism (and may not actually be effective, especially when changed after the Java virtual machine has been initialized).
In JDK 17 and earlier, the name default
is recognized as an alias for the US-ASCII
charset. That is, Charset.forName("default")
produces the same result as Charset.forName("US-ASCII")
. The default alias was introduced in JDK 5 to ensure that legacy code which used sun.io
converters could migrate to the java.nio.charset
framework introduced in JDK 1.4.
It would be extremely confusing for JDK 18 to preserve default
as an alias for US-ASCII
when the default charset is specified to be UTF-8
. It would also be confusing for default
to mean US-ASCII
if the user configures the default charset to its pre-JDK 18 value by setting -Dfile.encoding=COMPAT
on the command line. Redefining default
in JDK 18 to be an alias for the default charset (whether UTF-8
or user-configured) would cause subtle behavioral changes in the (few) programs that call Charset.forName("default")
.
Continuing to recognize default
in JDK 18 would be prolonging a poor choice. It is not defined by the Java SE Platform, nor is it recognized by IANA as the name or alias of any character set. In fact, for ASCII-based network protocols, IANA encourages use of the canonical name US-ASCII
rather than just ASCII
or obscure aliases such as ANSI_X3.4-1968
-- plainly, use of the JDK-specific alias default
goes counter to that advice. Java programs can use the enum constant StandardCharsets.US_ASCII
to make their intent clear, rather than passing a string to Charset.forName(...)
.
Solution
The specification of the Charset.defaultCharset()
API will be changed to specify that the default charset is UTF-8 unless configured otherwise by an implementation-specific means. All APIs that use the default charset will link to Charset.defaultCharset()
if they don't already do so. System.out
and System.err
are the exceptions in that they continue to use Console.charset()
charset as the default charset.
To mitigate the compatibility impact, the file.encoding
property will be documented (in an implementation note) so that it can be set on the command line to the value "COMPAT" (i.e. -Dfile.encoding=COMPAT
). When started with this value the default charset will be determined based on the locale and default encoding as long-standing behavior, which is the same encoding as native.encoding
system property value.
In addition, the file.encoding
property will also be documented to allow it to be set on the command line with the value "UTF-8", essentially a no-op.
With regards to the charset name default
, Charset.forName("default")
will throw an UnsupportedCharsetException
in JDK18. This will give developers a chance to detect use of the idiom and migrate to either US-ASCII
or to the result of Charset.defaultCharset()
.
Specification
Add the following row in the chart in Implementation Note
in java.lang.System#getProperties()
method.
* <tr><th scope="row">{@systemProperty file.encoding}</th>
* <td>The name of the default charset, defaults to {@code UTF-8}.
* The property may be set on the command line to the value
* {@code UTF-8} or {@code COMPAT}. If set on the command line to
* the value {@code COMPAT} then the value is replaced with the
* value of the {@code native.encoding} property during startup.
* Setting the property to a value other than {@code UTF-8} or
* {@code COMPAT} leads to unspecified behavior.
* </td></tr>
Modify the following paragraph in the class description of java.nio.charset.Charset
class from:
* <p> Every instance of the Java virtual machine has a default charset, which
* y or may not be one of the standard charsets. The default charset is
* determined during virtual-machine startup and typically depends upon the
* locale and charset being used by the underlying operating system. </p>
to:
* <p> Every instance of the Java virtual machine has a default charset, which
* is {@code UTF-8} unless changed in implementation specific manner. Refer to
* {@link #defaultCharset()} for more detail.
Modify the method description of java.nio.charset.Charset#defaultCharset()
from:
/**
* Returns the default charset of this Java virtual machine.
*
* <p> The default charset is determined during virtual-machine startup and
* typically depends upon the locale and charset of the underlying
* operating system.
*
* @return A charset object for the default charset
*
* @since 1.5
*/
to:
/**
* Returns the default charset of this Java virtual machine.
*
* <p> The default charset is {@code UTF-8}, unless changed in an
* implementation specific manner.
*
* @implNote An implementation may override the default charset with
* the system property {@code file.encoding} on the command line. If the
* value is {@code COMPAT}, the default charset is derived from
* the {@code native.encoding} system property, which typically depends
* upon the locale and charset of the underlying operating system.
*
* @return A charset object for the default charset
* @see <a href="../../lang/System.html#file.encoding">file.encoding</a>
* @see <a href="../../lang/System.html#native.encoding">native.encoding</a>
*
* @since 1.5
*/
Remove the platform
from the default charset wording from the following method descriptions, e.g., change "the platform's default charset" to "the default charset":
java/io/ByteArrayOutputStream
java/io/FileReader
java/io/FileWriter
java/io/InputStreamReader
java/io/OutputStreamWriter
java/io/PrintStream
java/io/PrintWriter
java/net/URLDecoder
java/net/URLEncoder
java/util/Scanner
In addition, change the code example in java/io/OutputStreamWriter
class description from:
* <pre>
* Writer out
* = new BufferedWriter(new OutputStreamWriter(System.out));
* </pre>
to:
* <pre>
* Writer out
* = new BufferedWriter(new OutputStreamWriter(anOutputStream));
* </pre>
This is a leftover from the related CSR.
- csr of
-
JDK-8260265 UTF-8 by Default
-
- Resolved
-