Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: P4
Fix Version/s: tbd
Affects Version/s: 8, 11, 17, 18, 19
Component/s: core-libs
Labels:

Subcomponent:
java.util.regex
CPU:

generic
OS:

generic

ADDITIONAL SYSTEM INFORMATION :
slackware64-current, gentoo, with an oracle build of openjdk 18 but it's about the same across all openjdk's i tried: 8,11,15,16,17,18. ryzen 3900x system.

A DESCRIPTION OF THE PROBLEM :
certain regular expressions run a orders of magnitude slower than expected.

It mostly seems to be tied to the complexity of the character class but is easy to breach. It might also be related to nested repeating patterns ((foo)\.*)

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Create a regex with a non-trivial character class and execute matcher(string).find() on a long string.

This was discovered when diagnosing a performance issue with netbeans' output window, it uses the following pattern to detect stack traces in output:

   private static final String JIDENT = "[\\p{javaJavaIdentifierStart}][\\p{javaJavaIdentifierPart}]*";
   private static final Pattern STACK_TRACE = Pattern.compile(
            "(.*?((?:" + JIDENT + "[.])*)(" + JIDENT + ")[.](?:" + JIDENT + "|<init>|<clinit>)" +
            "[(])(((?:"+JIDENT+"(?:\\."+JIDENT+")*/)?" + JIDENT + "[.]java):([0-9]+)|Unknown Source)([)].*)");

After some investigation i came up with a simpler example which appears to expose the same issue.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
All character class options in an otherwise identical pattern run in a reasonable time.

ACTUAL -
The Pattern SLOW with [a-z0-9] in it is ~60x slower than the one that only uses [a-z] in the same location.

---------- BEGIN SOURCE ----------
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class SlowMatching {
    static long stamp = System.nanoTime();
    static void time(String what) {
        System.out.printf(" %12.9f %s\n", (System.nanoTime() - stamp) * 1E-9, what);
        stamp = System.nanoTime();
    }

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 200; i++) {
            sb.append(i);
            sb.append("foo");
        }
        String s = sb.toString();

        Pattern SLOW = Pattern.compile(".*((?:[a-z]*\\.)*[a-z0-9]*)\\(.*");
        Pattern FAST = Pattern.compile(".*((?:[a-z]*\\.)*[a-z]*)\\(.*");

        for (int i = 0; i < 10; i++) {
            SLOW.matcher(s).find();
            time("slow.find()");
            FAST.matcher(s).find();
            time("fast.find()");
            SLOW.matcher(s).matches(); // also fast, much faster
            time("slow.matches()");
        }
    }
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Depending on the application Matcher::matches() doesn't seem to suffer from the problem and might be usable as an alternative if the semantics of find() are not required, perhaps with a modified pattern. For example netbeans can just use matches() instead since the pattern above matches the whole string anyway.

FREQUENCY : always

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

SlowMatching.java
1.0 kB
2022-05-26 05:51

Assignee:: Ian Graves

Reporter:: Webbug Group

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2022-05-23 05:58

Updated:: 2022-06-07 00:57

Details

Description

Attachments

Attachments

Activity

People

Dates