API to compute the byte length of a String encoded in a given Charset

XMLWordPrintable

    • Type: CSR
    • Resolution: Unresolved
    • Priority: P4
    • 27
    • Component/s: core-libs
    • None
    • behavioral
    • minimal
    • This is a new method, as such it doesn't affect existing clients.
    • Java API
    • SE

      Summary

      Introduce a new method to compute the byte length of a String encoded in a given Charset.

      Problem

      It is sometimes necessary to compute the byte length of a String encoded in a particular charset. One motivating use-case is encoding multiple large strings into a single array. Without an efficient way to get the encoded length, it's necessary to encode into a temporary array and pay the cost of resizing it (potentially multiple times).

      Using getBytes(cs).length is correct but inefficient, as it creates an intermediate array.

      Solution

      Computing the encoded length without allocating is possible to do with a non-JDK library method, but a JDK implementation could be more efficient. The JDK can optimize by using internal knowledge of the string representation. For certain combinations of string representations and charsets the JDK can compute the encoded length in constant time, for example if the string data is ASCII and the target charset is UTF-8. The JDK can also use intrinsics for some string operations.

      The proposed solution adds the method java.lang.String#getBytesLength(Charset).

      Specification

      --- a/src/java.base/share/classes/java/lang/String.java
      +++ b/src/java.base/share/classes/java/lang/String.java
      ...
      +    /**
      +     * {@return the length in bytes of the given String encoded with the given {@link Charset}}
      +     *
      +     * <p>The result will be the same value as {@link #getBytes(Charset) getBytes(cs).length}.
      +     *
      +     * @apiNote This method provides equivalent or better performance than {@link #getBytes(Charset)
      +     *          getBytes(cs).length}. It may allocate memory to compute the length for some charsets.
      +     *
      +     * @param cs The {@link Charset} used to the compute the length
      +     * @since 27
      +     */
      +    public int getBytesLength(Charset cs) {
      

            Assignee:
            Liam Miller-Cushon
            Reporter:
            Liam Miller-Cushon
            Naoto Sato, Roger Riggs
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated: