Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Fixed
Priority: P4
Fix Version/s: 19
Affects Version/s: 19
Component/s: core-libs
Labels:
- String

Subcomponent:
java.lang
Resolved In Build:
b17
CPU:

generic
OS:

generic

Both StringBuilder and StringBuffer are subclasses of AbstractStringBulder. When clients append chars to AbstractStringBulder, it inflates the internal byte array if the incoming chars can't be encoded in LATIN1. The field coder is the result. It's either LATIN-1 or UTF16.

Here is StringBuilder::toString(). The UTF16 path doesn't utilize the information that value can't be encoded in LATIN1, which has already been known by AbstractStringBuilder. toString of StringBuffer is similar.

    public String toString() {
        // Create a copy, don't share the array
        return isLatin1() ? StringLatin1.newString(value, 0, count)
                          : StringUTF16.newString(value, 0, count);
    }

As a result, StringUTF16.newString() attempts to compress value again if String.COMPACT_STRINGS is true. It ends up allocating a new array of len bytes but the compression can't succeed.

    public static byte[] compress(byte[] val, int off, int len) {
        byte[] ret = new byte[len];
        if (compress(val, off, ret, 0, len) == len) {
            return ret;
        }
        return null;
}

Here is an example of that case. When we use StringBuilder, the only last char can’t be encoded in LATIN-1.

import org.openjdk.jmh.annotations.*;

@State(Scope.Benchmark)
@Fork(3)
@Warmup(iterations=10)
@Measurement(iterations = 10)
public class MyBenchmark {
    @Param({"1024"})
    public int SIZE;

    @Benchmark
    public String testMethod() {
        StringBuilder sb = new StringBuilder(SIZE);
        for (int i = 0; i < SIZE - 4; ++i) {
            sb.append('a');
        }
        sb.append("あ"); // can't be encoded in latin-1
        return sb.toString();
    }
}

The initial capacity of StringBuilder is SIZE in bytes. When we encounter the last character ‘あ‘, the string builder object inflates (2 * SIZE) and changes its encoder from LATIN1 to UTF16. sb.toString() will take !isLatin1() path and StringUTF16::compress() will fail. The allocation in method compress() is wasteful.

relates to

JDK-8332282 AbstractStringBuilder.toString spec needs amendments for empty strings

Closed

JDK-8325730 StringBuilder.toString allocation for the empty String

Closed

links to

Commit openjdk/jdk/bab431cc

Review openjdk/jdk/7671

Review(master) openjdk/jdk/7671

Assignee:: Xin Liu

Reporter:: Xin Liu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2022-02-27 14:56

Updated:: 2025-06-30 13:40

Resolved:: 2022-03-31 21:45

Details

Description

Attachments

Issue Links

Activity

People

Dates