-
Bug
-
Resolution: Unresolved
-
P4
-
None
-
generic
-
generic
There is a performance regression in java.util.Scanner introduced by 8236201 (CPU 2020-04). JDK main line, 11u and 8u are affected; among (possibly) other releases.
8236201 changed the Scanner::groupSeparator and Scanner::decimalSeparator regexp string patterns from simple escaping (\...) to a more complex one (\x{...}) [1][2].
There is a cost in compilation time, as Pattern::x has to be called now [3]. The Pattern::Single automata node built is the same, so that shouldn't impact the Matcher's performance [4]. A few of the compilations occur only once per Scanner instance or per Scanner::useLocale method call, but there are other uses while scanning the input [5]. That should explain why we notice the performance degradation even with a single Scanner going through a large input.
We can optimize for the default or most common cases (there are group and decimal separators that repeat in multiple locales).
Attached to this ticket you'll find:
* A reproducer (TestScannerNoIOManyStreams.java)
* A simple playground code to compare regexp compilation times (Main.java)
--
[1] - https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/rev/a8f0a9ef1797#l1.34
[2] - https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/rev/a8f0a9ef1797#l1.35
[3] - https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/84c5676f140b/src/share/classes/java/util/regex/Pattern.java#l3209
[4] - https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/84c5676f140b/src/share/classes/java/util/regex/Pattern.java#l3831
[5] - https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/84c5676f140b/src/share/classes/java/util/Scanner.java#l2042
8236201 changed the Scanner::groupSeparator and Scanner::decimalSeparator regexp string patterns from simple escaping (\...) to a more complex one (\x{...}) [1][2].
There is a cost in compilation time, as Pattern::x has to be called now [3]. The Pattern::Single automata node built is the same, so that shouldn't impact the Matcher's performance [4]. A few of the compilations occur only once per Scanner instance or per Scanner::useLocale method call, but there are other uses while scanning the input [5]. That should explain why we notice the performance degradation even with a single Scanner going through a large input.
We can optimize for the default or most common cases (there are group and decimal separators that repeat in multiple locales).
Attached to this ticket you'll find:
* A reproducer (TestScannerNoIOManyStreams.java)
* A simple playground code to compare regexp compilation times (Main.java)
--
[1] - https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/rev/a8f0a9ef1797#l1.34
[2] - https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/rev/a8f0a9ef1797#l1.35
[3] - https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/84c5676f140b/src/share/classes/java/util/regex/Pattern.java#l3209
[4] - https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/84c5676f140b/src/share/classes/java/util/regex/Pattern.java#l3831
[5] - https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/84c5676f140b/src/share/classes/java/util/Scanner.java#l2042