-
Bug
-
Resolution: Unresolved
-
P4
-
8, 11, 17, 18, 19
-
generic
-
generic
ADDITIONAL SYSTEM INFORMATION :
slackware64-current, gentoo, with an oracle build of openjdk 18 but it's about the same across all openjdk's i tried: 8,11,15,16,17,18. ryzen 3900x system.
A DESCRIPTION OF THE PROBLEM :
certain regular expressions run a orders of magnitude slower than expected.
It mostly seems to be tied to the complexity of the character class but is easy to breach. It might also be related to nested repeating patterns ((foo)\.*)
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Create a regex with a non-trivial character class and execute matcher(string).find() on a long string.
This was discovered when diagnosing a performance issue with netbeans' output window, it uses the following pattern to detect stack traces in output:
private static final String JIDENT = "[\\p{javaJavaIdentifierStart}][\\p{javaJavaIdentifierPart}]*";
private static final Pattern STACK_TRACE = Pattern.compile(
"(.*?((?:" + JIDENT + "[.])*)(" + JIDENT + ")[.](?:" + JIDENT + "|<init>|<clinit>)" +
"[(])(((?:"+JIDENT+"(?:\\."+JIDENT+")*/)?" + JIDENT + "[.]java):([0-9]+)|Unknown Source)([)].*)");
After some investigation i came up with a simpler example which appears to expose the same issue.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
All character class options in an otherwise identical pattern run in a reasonable time.
ACTUAL -
The Pattern SLOW with [a-z0-9] in it is ~60x slower than the one that only uses [a-z] in the same location.
---------- BEGIN SOURCE ----------
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class SlowMatching {
static long stamp = System.nanoTime();
static void time(String what) {
System.out.printf(" %12.9f %s\n", (System.nanoTime() - stamp) * 1E-9, what);
stamp = System.nanoTime();
}
public static void main(String[] args) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 200; i++) {
sb.append(i);
sb.append("foo");
}
String s = sb.toString();
Pattern SLOW = Pattern.compile(".*((?:[a-z]*\\.)*[a-z0-9]*)\\(.*");
Pattern FAST = Pattern.compile(".*((?:[a-z]*\\.)*[a-z]*)\\(.*");
for (int i = 0; i < 10; i++) {
SLOW.matcher(s).find();
time("slow.find()");
FAST.matcher(s).find();
time("fast.find()");
SLOW.matcher(s).matches(); // also fast, much faster
time("slow.matches()");
}
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Depending on the application Matcher::matches() doesn't seem to suffer from the problem and might be usable as an alternative if the semantics of find() are not required, perhaps with a modified pattern. For example netbeans can just use matches() instead since the pattern above matches the whole string anyway.
FREQUENCY : always
slackware64-current, gentoo, with an oracle build of openjdk 18 but it's about the same across all openjdk's i tried: 8,11,15,16,17,18. ryzen 3900x system.
A DESCRIPTION OF THE PROBLEM :
certain regular expressions run a orders of magnitude slower than expected.
It mostly seems to be tied to the complexity of the character class but is easy to breach. It might also be related to nested repeating patterns ((foo)\.*)
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Create a regex with a non-trivial character class and execute matcher(string).find() on a long string.
This was discovered when diagnosing a performance issue with netbeans' output window, it uses the following pattern to detect stack traces in output:
private static final String JIDENT = "[\\p{javaJavaIdentifierStart}][\\p{javaJavaIdentifierPart}]*";
private static final Pattern STACK_TRACE = Pattern.compile(
"(.*?((?:" + JIDENT + "[.])*)(" + JIDENT + ")[.](?:" + JIDENT + "|<init>|<clinit>)" +
"[(])(((?:"+JIDENT+"(?:\\."+JIDENT+")*/)?" + JIDENT + "[.]java):([0-9]+)|Unknown Source)([)].*)");
After some investigation i came up with a simpler example which appears to expose the same issue.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
All character class options in an otherwise identical pattern run in a reasonable time.
ACTUAL -
The Pattern SLOW with [a-z0-9] in it is ~60x slower than the one that only uses [a-z] in the same location.
---------- BEGIN SOURCE ----------
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class SlowMatching {
static long stamp = System.nanoTime();
static void time(String what) {
System.out.printf(" %12.9f %s\n", (System.nanoTime() - stamp) * 1E-9, what);
stamp = System.nanoTime();
}
public static void main(String[] args) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 200; i++) {
sb.append(i);
sb.append("foo");
}
String s = sb.toString();
Pattern SLOW = Pattern.compile(".*((?:[a-z]*\\.)*[a-z0-9]*)\\(.*");
Pattern FAST = Pattern.compile(".*((?:[a-z]*\\.)*[a-z]*)\\(.*");
for (int i = 0; i < 10; i++) {
SLOW.matcher(s).find();
time("slow.find()");
FAST.matcher(s).find();
time("fast.find()");
SLOW.matcher(s).matches(); // also fast, much faster
time("slow.matches()");
}
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Depending on the application Matcher::matches() doesn't seem to suffer from the problem and might be usable as an alternative if the semantics of find() are not required, perhaps with a modified pattern. For example netbeans can just use matches() instead since the pattern above matches the whole string anyway.
FREQUENCY : always