-
Enhancement
-
Resolution: Fixed
-
P4
-
None
-
b22
Summary
I hereby propose adding static Reader.of(CharSequence) method, providing a non-synchronized Reader implementation for all kinds of CharSequences, optimized for String, StringBuilder, StringBuffer and CharBuffer.
This idea was recently discussed on the core-libs-dev mailing list between Alan Bateman, myself, and others (see https://mail.openjdk.org/pipermail/core-libs-dev/2024-September/130468.html).
The following sections summarize and update the last state of that discussion.
Problem
Since Java 1.1 we have the StringReader class. Since Java 1.4 we have the CharSequence class. StringBuilder, StringBuffer and CharBuffer are first-class implementations of it in the JDK, and there might exist third-party implementations of non-String character streams. Until today, however, we do not have a Reader for CharSequences, but need to go costly detours.
To process non-String character streams, the typical detour today is to turn a CharSequence into a temporary String (hence duplicating its full contents), which needs time and memory (and eventually GC), for the sole sake of being processable by a StringReader. As StringReader is synchronized, each single access is synthetically slowed down. In many cases the synchronization has no use at all, as in real-world applications, least Readers are actually accessed concurrently. As a result, today the major benefit of StringBuilder over StringBuffer (being non-synchronized) vanishes as soon as a StringReader is used to access it. This means, "new StringReader(stringBuffer.toString());" imposes slower performance than essentially needed, in two ways: toString, synchronized.
Solution
In an attempt to improve performance of this rather typical use case, I like to contribute a pull request providing a non-synchronized anonymous Reader implementation using a new public Reader.of(CharSequence) method.
The implementation is a mostly unchanged copy the existing code of StringReader, but wraps CharSequence instead of String, and strips synchronization.
It adds optimized access for the String, StringBuffer, StringBuilder and CharBuffer sources in the sense of ::getChars(int,int,char[],int) and ::get(int,char[],int,int) to prevent a char-by-char loop in these cases.
The idea mostly is covered by Apache Commons IO's CharSequenceReader, which nicely serves as a PoC: https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/input/CharSequenceReader.java.
Alternatives:
- Applications could use Apache Commons IO's CharSequenceReader. As it is an open-source third-party dependency, some authors might not be allowed to use it, or may not want to carry this additional burden just for the sake of this single performance improvement. In addition, besides bug fixing, this library is not very actively maintained; its Java baseline still is Java 8 without public plans for stepping up anytime soon. There is no commercial support.
- Applications could write their own Reader implementation. Given the assumption that this is a rather common use case, this imposes unjustified additional work for the authors of thousands of applications. It is hard to justify why there is a StringReader but not a Reader for other CharSequence implementations.
- Instead of writing a new Reader class for CharSequence sources we could slightly modify StringReader, so it accepts CharSequences (not only Strings). This does not remove the synchronization overhead unless we decide to remove the synchronization from StringReader's implementation, and it would be confusing / surprising (in the negative sense) that a class named "StringReader" actually is a "CharSequenceReader".
- Instead of writing a factory for an anonymous class, we could introduce a new public CharSequenceReader class. This would pile up the stack of special-cases classes, and it would bind existing applications for all time to that alternative class name, while an anoymous class can be replaced later without recompilation of the application.
- Instead of the name "Reader.of()" (following the existing "Map.of()", "List.of()", etc.) we could go with "Reader.charSequenceReader()" (following the existing "Reader.nullReader()"). IMHO "of" feels more lighweight, modern and concise, and it allows to later adopt variants for additional source types easily, e. g. "Reader.of(char[])". "Reader.charSequenceReader()" IMHO feels rather clumsy and enforces a new lenghty name in the future for each additional source type, e. g. "Reader.charArrayReader(char[])".
Options:
- Instead of adding special cases for "String/StringBuilder/StringBuffer::getChars" and "CharBuffer::get", we could add "getChars(char[], int, int)" as a new default method to the CharSequence interface, essentially providing a char-by-char loop for all CharSequence implementations not already having an optimized getChars method. This makes the implementation of CharSequenceReader simpler, as "switch over instanceof" is not needed to benefit from optimized "getChars/get" implementations. As this option is not essentially needed, and as it implies deeper thought and discussion, this option is explicitly EXCLUDED from this current enhancement request. We can pick it up later in a subsequent enhancement request.
- Once we have CharSequence Reader, we could replace the full implementation of StringReader by synchronized calls to CharSequenceReader. This would reduce duplicate code. As code deduplication is beneficial, as the risk is low due to the unchanged nature of the the new implementation (it is a code migration, not a new code), and as the added risk can get addresses by additional tests, I propose to INCLUDE this option in this current enhancement request.
- We could adopt char[] to the list of accepted sources, as -just with CharSequence- there is no non-synchronized Reader for char[], but just the synchronized CharArrayReader. As a char[] is not a CharSequence, and as supporting it needs totally unrelated implementations and upfront discussions, support for char[] is deliberately EXCLUDED from this current enhancement request. We can address char[] in a subsequent enhancement request.
Specification:
The actual specification of the proposed new method "Reader.of(CharSequence)" is found in the accompanying Github Pull Request, as it is easier to discuss with the actual code change at hand.
I hereby propose adding static Reader.of(CharSequence) method, providing a non-synchronized Reader implementation for all kinds of CharSequences, optimized for String, StringBuilder, StringBuffer and CharBuffer.
This idea was recently discussed on the core-libs-dev mailing list between Alan Bateman, myself, and others (see https://mail.openjdk.org/pipermail/core-libs-dev/2024-September/130468.html).
The following sections summarize and update the last state of that discussion.
Problem
Since Java 1.1 we have the StringReader class. Since Java 1.4 we have the CharSequence class. StringBuilder, StringBuffer and CharBuffer are first-class implementations of it in the JDK, and there might exist third-party implementations of non-String character streams. Until today, however, we do not have a Reader for CharSequences, but need to go costly detours.
To process non-String character streams, the typical detour today is to turn a CharSequence into a temporary String (hence duplicating its full contents), which needs time and memory (and eventually GC), for the sole sake of being processable by a StringReader. As StringReader is synchronized, each single access is synthetically slowed down. In many cases the synchronization has no use at all, as in real-world applications, least Readers are actually accessed concurrently. As a result, today the major benefit of StringBuilder over StringBuffer (being non-synchronized) vanishes as soon as a StringReader is used to access it. This means, "new StringReader(stringBuffer.toString());" imposes slower performance than essentially needed, in two ways: toString, synchronized.
Solution
In an attempt to improve performance of this rather typical use case, I like to contribute a pull request providing a non-synchronized anonymous Reader implementation using a new public Reader.of(CharSequence) method.
The implementation is a mostly unchanged copy the existing code of StringReader, but wraps CharSequence instead of String, and strips synchronization.
It adds optimized access for the String, StringBuffer, StringBuilder and CharBuffer sources in the sense of ::getChars(int,int,char[],int) and ::get(int,char[],int,int) to prevent a char-by-char loop in these cases.
The idea mostly is covered by Apache Commons IO's CharSequenceReader, which nicely serves as a PoC: https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/input/CharSequenceReader.java.
Alternatives:
- Applications could use Apache Commons IO's CharSequenceReader. As it is an open-source third-party dependency, some authors might not be allowed to use it, or may not want to carry this additional burden just for the sake of this single performance improvement. In addition, besides bug fixing, this library is not very actively maintained; its Java baseline still is Java 8 without public plans for stepping up anytime soon. There is no commercial support.
- Applications could write their own Reader implementation. Given the assumption that this is a rather common use case, this imposes unjustified additional work for the authors of thousands of applications. It is hard to justify why there is a StringReader but not a Reader for other CharSequence implementations.
- Instead of writing a new Reader class for CharSequence sources we could slightly modify StringReader, so it accepts CharSequences (not only Strings). This does not remove the synchronization overhead unless we decide to remove the synchronization from StringReader's implementation, and it would be confusing / surprising (in the negative sense) that a class named "StringReader" actually is a "CharSequenceReader".
- Instead of writing a factory for an anonymous class, we could introduce a new public CharSequenceReader class. This would pile up the stack of special-cases classes, and it would bind existing applications for all time to that alternative class name, while an anoymous class can be replaced later without recompilation of the application.
- Instead of the name "Reader.of()" (following the existing "Map.of()", "List.of()", etc.) we could go with "Reader.charSequenceReader()" (following the existing "Reader.nullReader()"). IMHO "of" feels more lighweight, modern and concise, and it allows to later adopt variants for additional source types easily, e. g. "Reader.of(char[])". "Reader.charSequenceReader()" IMHO feels rather clumsy and enforces a new lenghty name in the future for each additional source type, e. g. "Reader.charArrayReader(char[])".
Options:
- Instead of adding special cases for "String/StringBuilder/StringBuffer::getChars" and "CharBuffer::get", we could add "getChars(char[], int, int)" as a new default method to the CharSequence interface, essentially providing a char-by-char loop for all CharSequence implementations not already having an optimized getChars method. This makes the implementation of CharSequenceReader simpler, as "switch over instanceof" is not needed to benefit from optimized "getChars/get" implementations. As this option is not essentially needed, and as it implies deeper thought and discussion, this option is explicitly EXCLUDED from this current enhancement request. We can pick it up later in a subsequent enhancement request.
- Once we have CharSequence Reader, we could replace the full implementation of StringReader by synchronized calls to CharSequenceReader. This would reduce duplicate code. As code deduplication is beneficial, as the risk is low due to the unchanged nature of the the new implementation (it is a code migration, not a new code), and as the added risk can get addresses by additional tests, I propose to INCLUDE this option in this current enhancement request.
- We could adopt char[] to the list of accepted sources, as -just with CharSequence- there is no non-synchronized Reader for char[], but just the synchronized CharArrayReader. As a char[] is not a CharSequence, and as supporting it needs totally unrelated implementations and upfront discussions, support for char[] is deliberately EXCLUDED from this current enhancement request. We can address char[] in a subsequent enhancement request.
Specification:
The actual specification of the proposed new method "Reader.of(CharSequence)" is found in the accompanying Github Pull Request, as it is easier to discuss with the actual code change at hand.
- csr for
-
JDK-8341596 Add Reader.of(CharSequence)
- Closed
- links to
-
Commit(master) openjdk/jdk/3c14c2ba
-
Review(master) openjdk/jdk/21371
There are no Sub-Tasks for this issue.