Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8282429

StringBuilder/StringBuffer.toString() skip compressing for UTF16 strings

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Fixed
    • Icon: P4 P4
    • 19
    • 19
    • core-libs
    • b17
    • generic
    • generic

      Both StringBuilder and StringBuffer are subclasses of AbstractStringBulder. When clients append chars to AbstractStringBulder, it inflates the internal byte array if the incoming chars can't be encoded in LATIN1. The field coder is the result. It's either LATIN-1 or UTF16.

      Here is StringBuilder::toString(). The UTF16 path doesn't utilize the information that value can't be encoded in LATIN1, which has already been known by AbstractStringBuilder. toString of StringBuffer is similar.

          public String toString() {
              // Create a copy, don't share the array
              return isLatin1() ? StringLatin1.newString(value, 0, count)
                                : StringUTF16.newString(value, 0, count);
          }

      As a result, StringUTF16.newString() attempts to compress value again if String.COMPACT_STRINGS is true. It ends up allocating a new array of len bytes but the compression can't succeed.

          public static byte[] compress(byte[] val, int off, int len) {
              byte[] ret = new byte[len];
              if (compress(val, off, ret, 0, len) == len) {
                  return ret;
              }
              return null;
      }

      Here is an example of that case. When we use StringBuilder, the only last char can’t be encoded in LATIN-1.

      import org.openjdk.jmh.annotations.*;

      @State(Scope.Benchmark)
      @Fork(3)
      @Warmup(iterations=10)
      @Measurement(iterations = 10)
      public class MyBenchmark {
          @Param({"1024"})
          public int SIZE;

          @Benchmark
          public String testMethod() {
              StringBuilder sb = new StringBuilder(SIZE);
              for (int i = 0; i < SIZE - 4; ++i) {
                  sb.append('a');
              }
              sb.append("あ"); // can't be encoded in latin-1
              return sb.toString();
          }
      }

      The initial capacity of StringBuilder is SIZE in bytes. When we encounter the last character ‘あ‘, the string builder object inflates (2 * SIZE) and changes its encoder from LATIN1 to UTF16. sb.toString() will take !isLatin1() path and StringUTF16::compress() will fail. The allocation in method compress() is wasteful.


            xliu Xin Liu
            xliu Xin Liu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: