Type: CSR
Resolution: Approved
Priority: P4
Fix Version/s: 27
Component/s: core-libs
Labels:
None

Subcomponent:
java.lang
Compatibility Kind:

behavioral
Compatibility Risk:
minimal
Compatibility Risk Description:
This is a new method, as such it doesn't affect existing clients.
Interface Kind:

Java API
Scope:
SE

Summary

Add a method to String to return the byte length of a String encoded in a given Charset.

Problem

It is sometimes necessary to compute the byte length of a String encoded in a particular charset. One motivating use-case is encoding multiple large strings into a single array. Without an efficient way to get the encoded length, it's necessary to encode into a temporary array and pay the cost of resizing it (potentially multiple times).

Using getBytes(cs).length is correct but inefficient, as it creates an intermediate array.

Solution

Computing the encoded length without allocating is possible to do with a non-JDK library method, but a JDK implementation could be more efficient. The JDK can optimize by using internal knowledge of the string representation. For certain combinations of string representations and charsets the JDK can compute the encoded length in constant time, for example if the string data is ASCII and the target charset is UTF-8. The JDK can also use intrinsics for some string operations.

The proposed solution adds the method java.lang.String#encodedLength(Charset).

Alternatives

Method names: `getByteLength`, `getBytesLength`

getByteLength aligns with getBytes(charset).length. Some callers might expect encodedLength(UTF_16) to return a length in code units and not bytes, there was some related discussion in JDK-8372338 and the Pulling the (foreign) string doc.

getBytesLength even more closely aligns with getBytes, potentially making the connection to getBytes(...).length clearer, and ensuring it shows up beside getBytes in javadoc. However getBytesLength is less grammatical (as an analogy, "head count" is more idiomatic than "heads count").

encodedLength was chosen because it is more evocative of the value being computed, and doesn't rely on the argument to hint at its primary function. A get prefix was considered but omitted, to be cleaner and consistent with more of the String API.

Return type: `long`

In some cases, the encoded length of the string in bytes will be longer than Integer.MAX_VALUE. Consider a string of length (Integer.MAX_VALUE / 2) - 1 containing U+20AC characters, which will each take three bytes in the UTF-8 encoding of the string, so the encoded length in bytes is roughly 3 / 2 * Integer.MAX_VALUE. Returning a long would allow the implementation to accurately report the encoded length for this string, even though getBytes(Charset) would not be able to encode it due to array size limits.

There is some precedent for accommodating very long string data in APIs, for example MemorySegment#getString(long offset, Charset charset, long byteLength) takes a long byteLength, and in the future may support decoding Strings where the encoded length exceeds Integer.MAX_VALUE.

There are disadvantages to returning long from encodedLength. It forces all callers to think about a situation that is extremely rare in practice. If they don't need to handle strings of this length, it requires the ceremony of testing the return type and throwing an exception, or worse they may just cast to int and lose the protection that would be provided if encodedLength checked the result. Returning long makes it harder to migrate existing uses of getBytes(...).length, since the new API would no longer be a drop-in replacement. Being able to specify the implementation as having exactly the same result as getBytes(...).length makes it easier to understand and use correctly.

The introduction of Compact Strings previously changed the effective maximum size of Strings, as discussed in JDK-8190429. In practice it seems rare for Strings in the ecosystem to push the limits of length or encoded length.

In the rare case that callers need the encoded length of Strings of this size, they could avoid the method and implement their own loop, or handle the exception thrown by encodedLength and fall back to their own less optimized implementation.

Location: CharsetEncoder

Moving the API to CharsetEncoder instead of java.lang.String was discussed:

    try {
        int byteLength = StandardCharsets.UTF_8.newEncoder()
                .onUnmappableCharacter(CodingErrorAction.REPLACE)
                .onMalformedInput(CodingErrorAction.REPLACE)
                .getByteLength(stringData);
    } catch (CharacterCodingException e) {
        throw new IllegalStateException(e);
    }

Preserving the performance of the method would require a package-private method in String, and using JavaLangAccess to share it with CharsetEncoder. The fast path is only available for CodingErrorAction.REPLACE, which matches String.getBytes(Charset)'s replacement character handling, other configurations would be slower.

Specification

--- a/src/java.base/share/classes/java/lang/String.java
+++ b/src/java.base/share/classes/java/lang/String.java
...
+    /**
+     * {@return the length in bytes of this {@code String} encoded with the given {@link Charset}}
+     *
+     * <p>The returned length accounts for the replacement of malformed-input and unmappable-character
+     * sequences with the charset's default replacement byte array. The result will be the same value
+     * as {@link #getBytes(Charset) getBytes(cs).length}.
+     *
+     * @apiNote This method provides equivalent or better performance than {@link #getBytes(Charset)
+     *          getBytes(cs).length}. This method may allocate memory to compute the length for some charsets.
+     *
+     * @param cs The {@link Charset} used to the compute the length
+     * @since 27
+     */
+    public int encodedLength(Charset cs) {

csr of

JDK-8372353 API to compute the byte length of a String encoded in a given Charset

Open

Details

Description

Problem

Solution

Alternatives

Method names: getByteLength, getBytesLength

Return type: long

Location: CharsetEncoder

Specification

Attachments

Issue Links

Activity

People

Dates

Method names: `getByteLength`, `getBytesLength`

Return type: `long`