Deprecate UTF-16-only String Representation

XMLWordPrintable

    • Type: JEP
    • Resolution: Unresolved
    • Priority: P4
    • None
    • Component/s: core-libs
    • None
    • Stuart Marks
    • Feature
    • Open
    • JDK

      Summary

      When the Compact Strings feature was introduced, it included the ability to disable Compact Strings and run the system in UTF-16-only mode. This JEP deprecates for removal the ability of the system to run in UTF-16-only mode.

      Motivation

      Prior to the introduction of Compact Strings, String objects had a single internal representation: an array of char values. Each char value occupies two bytes (16 bits), and the character data stored in the char array was encoded in UTF-16. Even if character data could be encoded in one byte, it was encoded in UTF-16 and occupied two bytes per character. This mode of operation is referred to as UTF-16-only mode.

      The introduction of Compact Strings by JEP 254 changed the internal representation of String objects. (This change is strictly internal; there were no public API changes.) The new internal String representation allows two alternative forms: one byte per character (with characters encoded in ISO Latin 1) or two bytes per character (encoded in UTF-16). Many strings can be encoded in ISO Latin 1, so storing them using only one byte per character provides considerable space savings compared to UTF-16-only mode.

      The Compact Strings feature also introduced a command line option -XX:-CompactStrings that disables the new internal representation and restores UTF-16-only mode. This option was included as a contingency in case an application's workload includes string data for which Compact Strings would cause a performance regression. This option is documented in the Oracle Java Virtual Machine Guide.

      The JDK thus has three possible representations for Strings and consequently three possible code paths for every String operation:

      1. Compact Strings with the ISO Latin 1 coder;
      2. Compact Strings with the UTF-16 coder; and
      3. UTF-16 only.

      Maintaining three code paths is a maintenance burden. Since Compact Strings are enabled by default, many optimizations have been applied to its code paths, and they are tested thoroughly. By contrast, the UTF-16-only code paths have received less optimization and have not been tested as well as the Compact Strings code paths. Indeed, several bugs have occurred only in UTF-16-only mode:

      insert list of bugs here

      Removing the UTF-16-only option and code will result in a simplification of all the String code, and it will reduce the JDK's maintenance burden.

      Description

      Deprecate the UTF-16-only mode for removal in a future release.

      Use of the -XX:-CompactStrings command line option will issue a warning that this capability will be removed in the future. It will continue to disable Compact Strings and run the system in UTF-16-only mode.

      Update documentation to note that the option is deprecated and that UTF-16-only mode will eventually be removed.

      Risks

      Unlike the other JDK ports, the ARM32 port uses UTF-16-only mode. The ARM32 port has not been tested in Compact Strings mode. It is therefore likely that the ARM32 port will need some work to bring its Compact Strings implementation up to production quality. Alternatively, support for ARM32 could be dropped entirely from the JDK.

      Asian languages such as Chinese, Japanese, and Korean (CJK) use many characters that cannot be encoded in a single byte. For these languages, UTF-16 works reasonably well. With Compact Strings enabled, the system performs additional work to check every string to see whether it can be encoded in ISO Latin 1, and if it cannot, the string data will be encoded and stored in UTF-16. An application that processes CJK text may therefore consume extra CPU time checking for the possibility of encoding strings in ISO Latin 1 and will end up storing UTF-16 strings anyway. This may result in an increase in CPU time without any net space savings.

      A key design assumption underlying the Compact Strings feature is that, even in applications that process mainly CJK string data, there will be many other strings (such as class names, request headers, etc.) that can be encoded in ISO Latin 1, and this will result in a net space savings and reduced GC overhead. However, there may be some applications for which this assumption isn't true. Installations may choose to run such applications in production with Compact Strings disabled. The Compact Strings implementation has improved over time, so these regressions may have been mitigated. However, when UTF-16-only mode is eventually removed, these applications will need to migrate and will possibly suffer a performance regression, or they can stay on older JDK releases indefinitely.

      Under UTF-16-only mode, the JNI function GetStringCritical returns a direct pointer to the internal String char array. With Compact Strings, this function makes a copy of the array. Applications that make heavy use of GetStringCritical will see a regression if they switch from UTF-16-only mode to using Compact Strings. There is no obvious mitigation for this problem.

            Assignee:
            Stuart Marks
            Reporter:
            Stuart Marks
            Stuart Marks Stuart Marks
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: