-
JEP
-
Resolution: Delivered
-
P2
-
Brent Christian
-
Feature
-
Open
-
Implementation
-
-
L
-
XL
-
254
Summary
Adopt a more space-efficient internal representation for strings.
Goals
Improve the space efficiency of the String
class and related classes
while maintaining performance in most scenarios and preserving full
compatibility for all related Java and native interfaces.
Non-Goals
It is not a goal to use alternate encodings such as UTF-8 in the internal representation of strings. A subsequent JEP may explore that approach.
Motivation
The current implementation of the String
class stores characters in a
char
array, using two bytes (sixteen bits) for each character. Data
gathered from many different applications indicates that strings are a
major component of heap usage and, moreover, that most String
objects
contain only Latin-1 characters. Such characters require only one byte
of storage, hence half of the space in the internal char
arrays of such
String
objects is going unused.
Description
We propose to change the internal representation of the String
class
from a UTF-16 char
array to a byte
array plus an encoding-flag field.
The new String
class will store characters encoded either as
ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per
character), based upon the contents of the string. The encoding flag
will indicate which encoding is used.
String-related classes such as AbstractStringBuilder
, StringBuilder
,
and StringBuffer
will be updated to use the same representation, as
will the HotSpot VM's intrinsic string operations.
This is purely an implementation change, with no changes to existing public interfaces. There are no plans to add any new public APIs or other interfaces.
The prototyping work done to date confirms the expected reduction in memory footprint, substantial reductions of GC activity, and minor performance regressions in some corner cases.
For further detail, see:
Alternatives
We tried a "compressed strings" feature in JDK 6 update releases, enabled
by an -XX
flag. When enabled, String.value
was changed to an
Object
reference and would point either to a byte
array, for strings
containing only 7-bit US-ASCII characters, or else a char
array. This
implementation was not open-sourced, so it was difficult to maintain and
keep in sync with the mainline JDK source. It has since been removed.
Testing
Thorough compatibility and regression testing will be essential for a change to such a fundamental part of the platform.
We will also need to confirm that we have fulfilled the performance goals of this project. Analysis of memory savings will need to be done. Performance testing should be done using a broad range of workloads, ranging from focused microbenchmarks to large-scale server workloads.
We will encourage the entire Java community to perform early testing with this change in order to identify any remaining issues.
Risks and Assumptions
Optimizing character storage for memory may well come with a trade-off in terms of run-time performance. We expect that this will be offset by reduced GC activity and that we will be able to maintain the throughput of typical server benchmarks. If not, we will investigate optimizations that can strike an acceptable balance between memory saving and run-time performance.
Other recent projects have already reduced the heap space used by strings, in particular JEP 192: String Deduplication in G1. Even with duplicates eliminated, the remaining string data can be made to consume less space if encoded more efficiently. We are assuming that this project will still provide a benefit commensurate with the effort required.
- is blocked by
-
JDK-8064810 JEP-JDK-8054307: Performance plan for More memory-efficient internal representation for Strings
- Resolved
- relates to
-
JDK-8146547 String copy intrinsics should zero array in case of tightly coupled allocation
- Open
-
JDK-8155608 String intrinsic range checks are not strict enough
- Resolved
-
JDK-8196995 java.lang.Character should not state UTF-16 encoding is used for strings
- Closed
-
JDK-8162716 Doc tasks for JEP 254: Compact Strings
- Resolved
-
JDK-8134758 Final String field values should be trusted as stable
- Resolved
-
JDK-8144693 Intrinsify StringCoding.hasNegatives() on SPARC
- Resolved
-
JDK-8046182 JEP 192: String Deduplication in G1
- Closed
-
JDK-8279833 Loop optimization issue in String.encodeUTF8_UTF16
- Resolved
-
JDK-8143553 StringBuffer.getByte(byte[], int, byte) should be package private (not protected)
- Resolved
-
JDK-8144212 JDK 9 b93 breaks Apache Lucene due to compact strings
- Resolved
-
JDK-8143219 AArch64 broken by 8141132: JEP 254: Compact Strings
- Resolved
-
JDK-8140390 Char stores/loads accessing byte arrays must be marked as unmatched
- Closed
-
JDK-8141443 jdk/test/java/util/regex/RegExTest.java fails: No match found
- Closed
-
JDK-8142303 C2 compilation fails with "bad AD file"
- Closed
-
JDK-8164612 NoSuchMethodException when method name contains NULL or Latin-1 supplement character
- Closed
-
JDK-8144691 JEP 254: Compact Strings: endiannes mismatch in Java source code and intrinsic
- Closed
-
JDK-6826329 (str) Fastpath for new String(bytes..) and String#getBytes(..) for ASCII + ISO-8859-1
- Open
-
JDK-8173585 Intrinsify StringLatin1.indexOf(char)
- Resolved
-
JDK-8184943 AARCH64: Intrinsify hasNegatives
- Resolved
-
JDK-6941938 Improve array equals intrinsic on SPARC
- Resolved
-
JDK-8231717 Improve performance of charset decoding when charset is always compactable
- Resolved
-
JDK-8059092 JEP 250: Store Interned Strings in CDS Archives
- Closed
-
JDK-8085796 JEP 280: Indify String Concatenation
- Closed
-
JDK-8139132 CompactStrings intrinsics should use ArrayCopyNode
- Closed
-
JDK-8156861 AArch64: JEP 254: Partially-implemented intrinsics
- Closed