Both StringBuilder and StringBuffer are subclasses of AbstractStringBulder. When clients append chars to AbstractStringBulder, it inflates the internal byte array if the incoming chars can't be encoded in LATIN1. The field coder is the result. It's either LATIN-1 or UTF16.
Here is StringBuilder::toString(). The UTF16 path doesn't utilize the information that value can't be encoded in LATIN1, which has already been known by AbstractStringBuilder. toString of StringBuffer is similar.
public String toString() {
// Create a copy, don't share the array
return isLatin1() ? StringLatin1.newString(value, 0, count)
: StringUTF16.newString(value, 0, count);
}
As a result, StringUTF16.newString() attempts to compress value again if String.COMPACT_STRINGS is true. It ends up allocating a new array of len bytes but the compression can't succeed.
public static byte[] compress(byte[] val, int off, int len) {
byte[] ret = new byte[len];
if (compress(val, off, ret, 0, len) == len) {
return ret;
}
return null;
}
Here is an example of that case. When we use StringBuilder, the only last char can’t be encoded in LATIN-1.
import org.openjdk.jmh.annotations.*;
@State(Scope.Benchmark)
@Fork(3)
@Warmup(iterations=10)
@Measurement(iterations = 10)
public class MyBenchmark {
@Param({"1024"})
public int SIZE;
@Benchmark
public String testMethod() {
StringBuilder sb = new StringBuilder(SIZE);
for (int i = 0; i < SIZE - 4; ++i) {
sb.append('a');
}
sb.append("あ"); // can't be encoded in latin-1
return sb.toString();
}
}
The initial capacity of StringBuilder is SIZE in bytes. When we encounter the last character ‘あ‘, the string builder object inflates (2 * SIZE) and changes its encoder from LATIN1 to UTF16. sb.toString() will take !isLatin1() path and StringUTF16::compress() will fail. The allocation in method compress() is wasteful.
Here is StringBuilder::toString(). The UTF16 path doesn't utilize the information that value can't be encoded in LATIN1, which has already been known by AbstractStringBuilder. toString of StringBuffer is similar.
public String toString() {
// Create a copy, don't share the array
return isLatin1() ? StringLatin1.newString(value, 0, count)
: StringUTF16.newString(value, 0, count);
}
As a result, StringUTF16.newString() attempts to compress value again if String.COMPACT_STRINGS is true. It ends up allocating a new array of len bytes but the compression can't succeed.
public static byte[] compress(byte[] val, int off, int len) {
byte[] ret = new byte[len];
if (compress(val, off, ret, 0, len) == len) {
return ret;
}
return null;
}
Here is an example of that case. When we use StringBuilder, the only last char can’t be encoded in LATIN-1.
import org.openjdk.jmh.annotations.*;
@State(Scope.Benchmark)
@Fork(3)
@Warmup(iterations=10)
@Measurement(iterations = 10)
public class MyBenchmark {
@Param({"1024"})
public int SIZE;
@Benchmark
public String testMethod() {
StringBuilder sb = new StringBuilder(SIZE);
for (int i = 0; i < SIZE - 4; ++i) {
sb.append('a');
}
sb.append("あ"); // can't be encoded in latin-1
return sb.toString();
}
}
The initial capacity of StringBuilder is SIZE in bytes. When we encounter the last character ‘あ‘, the string builder object inflates (2 * SIZE) and changes its encoder from LATIN1 to UTF16. sb.toString() will take !isLatin1() path and StringUTF16::compress() will fail. The allocation in method compress() is wasteful.
- relates to
-
JDK-8332282 AbstractStringBuilder.toString spec needs amendments for empty strings
- New
-
JDK-8325730 StringBuilder.toString allocation for the empty String
- Closed