API to compute the byte length of a String encoded in a given Charset

XMLWordPrintable

    • Type: CSR
    • Resolution: Unresolved
    • Priority: P4
    • 27
    • Component/s: core-libs
    • None
    • behavioral
    • minimal
    • This is a new method, as such it doesn't affect existing clients.
    • Java API
    • SE

      Summary

      Add a method to String to return the byte length of a String encoded in a given Charset.

      Problem

      It is sometimes necessary to compute the byte length of a String encoded in a particular charset. One motivating use-case is encoding multiple large strings into a single array. Without an efficient way to get the encoded length, it's necessary to encode into a temporary array and pay the cost of resizing it (potentially multiple times).

      Using getBytes(cs).length is correct but inefficient, as it creates an intermediate array.

      Solution

      Computing the encoded length without allocating is possible to do with a non-JDK library method, but a JDK implementation could be more efficient. The JDK can optimize by using internal knowledge of the string representation. For certain combinations of string representations and charsets the JDK can compute the encoded length in constant time, for example if the string data is ASCII and the target charset is UTF-8. The JDK can also use intrinsics for some string operations.

      The proposed solution adds the method java.lang.String#getByteLength(Charset).

      Alternatives

      Method name: getBytesLength

      The method name getBytesLength even more closely aligns with getBytes, potentially making the connection to getBytes(...).length clearer, and ensuring it shows up beside getBytes in javadoc.

      getByteLength is more grammatical (as an analogy, "head count" is more idiomatic than "heads count"). getByteLength is still very closely aligned to getBytes, and there are no other methods that would separate them in the javadoc list.

      Return type: long

      In some cases, the encoded length of the string in bytes will be longer than Integer.MAX_VALUE. Consider a string of length (Integer.MAX_VALUE / 2) - 1 containing U+20AC characters, which will each take three bytes in the UTF-8 encoding of the string, so the encoded length in bytes is roughly 3 / 2 * Integer.MAX_VALUE. Returning a long would allow the implementation to accurately report the encoded length for this string, even though getBytes(Charset) would not be able to encode it due to array size limits.

      There is some precedent for accommodating very long string data in APIs, for example MemorySegment#getString(long offset, Charset charset, long byteLength) takes a long byteLength, and in the future may support decoding Strings where the encoded length exceeds Integer.MAX_VALUE.

      There are disadvantages to returning long from getByteLength. It forces all callers to think about a situation that is extremely rare in practice. If they don't need to handle strings of this length, it requires the ceremony of testing the return type and throwing an exception, or worse they may just cast to int and lose the protection that would be provided if getByteLength checked the result. Returning long makes it harder to migrate existing uses of getBytes(...).length, since the new API is not a drop-in replacement. Being able to specify the implementation as having exactly the same result as getBytes(...).length makes it easier to understand and use correctly.

      The introduction of Compact Strings previously changed the effective maximum size of Strings, as discussed in JDK-8190429. In practice it seems rare for Strings in the ecosystem to push the limits of length or encoded length.

      In the rare case that callers need the encoded length of Strings of this size, they could avoid the method and implement their own loop, or handle the exception thrown by getByteLength and fall back to their own less optimized implementation.

      Specification

      --- a/src/java.base/share/classes/java/lang/String.java
      +++ b/src/java.base/share/classes/java/lang/String.java
      ...
      +    /**
      +     * {@return the length in bytes of this {@code String} encoded with the given {@link Charset}}
      +     *
      +     * <p>The returned length accounts for the replacement of malformed-input and unmappable-character
      +     * sequences with the charset's default replacement byte array. The result will be the same value
      +     * as {@link #getBytes(Charset) getBytes(cs).length}.
      +     *
      +     * @apiNote This method provides equivalent or better performance than {@link #getBytes(Charset)
      +     *          getBytes(cs).length}. This method may allocate memory to compute the length for some charsets.
      +     *
      +     * @param cs The {@link Charset} used to the compute the length
      +     * @since 27
      +     */
      +    public int getByteLength(Charset cs) {
      

            Assignee:
            Liam Miller-Cushon
            Reporter:
            Liam Miller-Cushon
            Alan Bateman, Naoto Sato, Roger Riggs
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: