Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8260266

UTF-8 by Default

XMLWordPrintable

    • Icon: CSR CSR
    • Resolution: Approved
    • Icon: P3 P3
    • 18
    • core-libs
    • None
    • behavioral
    • medium
    • Hide
      There are no risks in some environments:

      - The default charset on macOS has been UTF-8 for several releases except for the POSIX C locale.
      - The default charset in many (but not all) Linux environments is UTF-8 so these environments will not observe a change.
      - Many server applications are started with `-Dfile.encoding=UTF-8` so they will also not observe any change.

      In other environments, the risk of changing the default charset to UTF-8 after 20+ years may be significant. We expect the main impact will be to users of Microsoft Windows in Asian locales and maybe some server environments in Asian/other locales.

      - Upgrading: e.g. an application has been running for years with SJIS as the default charset. When upgraded to a JDK release that uses UTF-8 as the default charset it experiences problems when reading files that are encoded as SJIS. For this example, the application could be changed to specify SJIS when opening the file. If the code cannot be changed then running with `-Dfile.encoding=COMPAT` will force the default charset to be SJIS until the application is updated or the file converted to UTF-8.

      - Environments where there are several JDK versions in use, e.g. one user using an older JDK release where SJIS is the default charset, another where UTF-8 is the default charset.

      - `Charset.forName("default")` will throw `UnsupportedCharsetException`. However, a search on Corpus reveals that no direct call was found, and the number of indirect calls that use `default` are very small. We might provide some compatibility option in the future, but no plan for now.
      Show
      There are no risks in some environments: - The default charset on macOS has been UTF-8 for several releases except for the POSIX C locale. - The default charset in many (but not all) Linux environments is UTF-8 so these environments will not observe a change. - Many server applications are started with `-Dfile.encoding=UTF-8` so they will also not observe any change. In other environments, the risk of changing the default charset to UTF-8 after 20+ years may be significant. We expect the main impact will be to users of Microsoft Windows in Asian locales and maybe some server environments in Asian/other locales. - Upgrading: e.g. an application has been running for years with SJIS as the default charset. When upgraded to a JDK release that uses UTF-8 as the default charset it experiences problems when reading files that are encoded as SJIS. For this example, the application could be changed to specify SJIS when opening the file. If the code cannot be changed then running with `-Dfile.encoding=COMPAT` will force the default charset to be SJIS until the application is updated or the file converted to UTF-8. - Environments where there are several JDK versions in use, e.g. one user using an older JDK release where SJIS is the default charset, another where UTF-8 is the default charset. - `Charset.forName("default")` will throw `UnsupportedCharsetException`. However, a search on Corpus reveals that no direct call was found, and the number of indirect calls that use `default` are very small. We might provide some compatibility option in the future, but no plan for now.
    • Java API, System or security property
    • SE

      Summary

      Use UTF-8 as the default charset for the Java SE APIs, so that APIs which depend on the default charset behave consistently across all JDK implementations and independently of the user’s operating system, locale, and configuration.

      Problem

      APIs that use the default charset are a hazard for developers that are new to the Java platform. They are also a bugbear for experienced developers. Consider an application that creates a java.io.FileWriter with its 1-arg constructor and uses it to writes some text to a file. Writing the text encodes it into a sequence of bytes using the default charset. Another application, run on a different machine or by a different user on the same machine, creates a java.io.FileReader with its 1-arg constructor and uses it to read the text from the file. Reading the file decodes the bytes to a sequence of characters/text using the default charset. If the default charset is different when reading then the resulting text may be silently corrupted or incomplete (as these APIs replace erroneous input, they don't fail).

      Developers that are familiar with the hazard may choose to use methods that specify the charset (either by charset name or Charset) but the resulting code is more verbose. Furthermore, using APIs that specify the charset may inhibit the use of some Java Language features (Method References in particular). Sometimes developers attempt to set the default charset by means of the system property file.encoding but this has never been a supported mechanism (and may not actually be effective, especially when changed after the Java virtual machine has been initialized).

      In JDK 17 and earlier, the name default is recognized as an alias for the US-ASCII charset. That is, Charset.forName("default") produces the same result as Charset.forName("US-ASCII"). The default alias was introduced in JDK 5 to ensure that legacy code which used sun.io converters could migrate to the java.nio.charset framework introduced in JDK 1.4.

      It would be extremely confusing for JDK 18 to preserve default as an alias for US-ASCII when the default charset is specified to be UTF-8. It would also be confusing for default to mean US-ASCII if the user configures the default charset to its pre-JDK 18 value by setting -Dfile.encoding=COMPAT on the command line. Redefining default in JDK 18 to be an alias for the default charset (whether UTF-8 or user-configured) would cause subtle behavioral changes in the (few) programs that call Charset.forName("default").

      Continuing to recognize default in JDK 18 would be prolonging a poor choice. It is not defined by the Java SE Platform, nor is it recognized by IANA as the name or alias of any character set. In fact, for ASCII-based network protocols, IANA encourages use of the canonical name US-ASCII rather than just ASCII or obscure aliases such as ANSI_X3.4-1968 -- plainly, use of the JDK-specific alias default goes counter to that advice. Java programs can use the enum constant StandardCharsets.US_ASCII to make their intent clear, rather than passing a string to Charset.forName(...).

      Solution

      The specification of the Charset.defaultCharset() API will be changed to specify that the default charset is UTF-8 unless configured otherwise by an implementation-specific means. All APIs that use the default charset will link to Charset.defaultCharset() if they don't already do so. System.out and System.err are the exceptions in that they continue to use Console.charset() charset as the default charset.

      To mitigate the compatibility impact, the file.encoding property will be documented (in an implementation note) so that it can be set on the command line to the value "COMPAT" (i.e. -Dfile.encoding=COMPAT). When started with this value the default charset will be determined based on the locale and default encoding as long-standing behavior, which is the same encoding as native.encoding system property value.

      In addition, the file.encoding property will also be documented to allow it to be set on the command line with the value "UTF-8", essentially a no-op.

      With regards to the charset name default, Charset.forName("default") will throw an UnsupportedCharsetException in JDK18. This will give developers a chance to detect use of the idiom and migrate to either US-ASCII or to the result of Charset.defaultCharset().

      Specification

      Add the following row in the chart in Implementation Note in java.lang.System#getProperties() method.

       * <tr><th scope="row">{@systemProperty file.encoding}</th>
       *     <td>The name of the default charset, defaults to {@code UTF-8}.
       *     The property may be set on the command line to the value
       *     {@code UTF-8} or {@code COMPAT}. If set on the command line to
       *     the value {@code COMPAT} then the value is replaced with the
       *     value of the {@code native.encoding} property during startup.
       *     Setting the property to a value other than {@code UTF-8} or
       *     {@code COMPAT} leads to unspecified behavior.
       *     </td></tr>

      Modify the following paragraph in the class description of java.nio.charset.Charset class from:

       * <p> Every instance of the Java virtual machine has a default charset, which
       * y or may not be one of the standard charsets.  The default charset is
       * determined during virtual-machine startup and typically depends upon the
       * locale and charset being used by the underlying operating system. </p>

      to:

       * <p> Every instance of the Java virtual machine has a default charset, which
       * is {@code UTF-8} unless changed in implementation specific manner. Refer to
       * {@link #defaultCharset()} for more detail.

      Modify the method description of java.nio.charset.Charset#defaultCharset() from:

        /**
         * Returns the default charset of this Java virtual machine.
         *
         * <p> The default charset is determined during virtual-machine startup and
         * typically depends upon the locale and charset of the underlying
         * operating system.
         *
         * @return  A charset object for the default charset
         *
         * @since 1.5
         */

      to:

      /**
       * Returns the default charset of this Java virtual machine.
       *
       * <p> The default charset is {@code UTF-8}, unless changed in an
       * implementation specific manner.
       *
       * @implNote An implementation may override the default charset with
       * the system property {@code file.encoding} on the command line. If the
       * value is {@code COMPAT}, the default charset is derived from
       * the {@code native.encoding} system property, which typically depends
       * upon the locale and charset of the underlying operating system.
       *
       * @return  A charset object for the default charset
       * @see <a href="../../lang/System.html#file.encoding">file.encoding</a>
       * @see <a href="../../lang/System.html#native.encoding">native.encoding</a>
       *
       * @since 1.5
       */

      Remove the platform from the default charset wording from the following method descriptions, e.g., change "the platform's default charset" to "the default charset":

      • java/io/ByteArrayOutputStream
      • java/io/FileReader
      • java/io/FileWriter
      • java/io/InputStreamReader
      • java/io/OutputStreamWriter
      • java/io/PrintStream
      • java/io/PrintWriter
      • java/net/URLDecoder
      • java/net/URLEncoder
      • java/util/Scanner

      In addition, change the code example in java/io/OutputStreamWriter class description from:

       * <pre>
       * Writer out
       *   = new BufferedWriter(new OutputStreamWriter(System.out));
       * </pre>

      to:

       * <pre>
       * Writer out
       *   = new BufferedWriter(new OutputStreamWriter(anOutputStream));
       * </pre>

      This is a leftover from the related CSR.

            naoto Naoto Sato
            naoto Naoto Sato
            Alan Bateman
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: