There is an increasing focus on the performance of regular expression matching, and especially the predictability.
- regular expression denial of service attacks are sufficiently well known in the industry that the term ReDoS has been coined.
- it is common for web sites to provide a search service for large text corpora with user-provided regexes, and having a reliably efficient (i.e. O(N)) regex evaluation engine may be considered a strict requirement. This was the impetus for Russ Cox's re2 library,.
- it is surprisingly difficult to solve some common software engineering problems efficiently with just a regex. The prime example is detecting and removing trailing whitespace, which is explored in a microbenchmark and was the cause of the famous stackoverflow outage caused by \s+$.. It is not true that careful crafting of regexes to use e.g. possessive quantifiers, by a skilled engineer, can solve a particular regex performance problem.
- The use of Matcher#find (instead of Matcher#matches) is very convenient, but introduces an implicit O(N) loop over the input, or alternatively, a non-possessive "^.*?" prefix in the regex. In order for the entire search operation to not be O(N^2), most of the regex match operations while scanning the input need to be O(1), which may require the use of less-obvious constructs like lookbehind. The use of possessive quantifiers in the regex itself is sadly insufficient.
The current (jdk16) implementation is a NFA-based backtracking engine. It mitigates but does not eliminate O(2^N) performance. StackOverflowError is a risk.
Make it possible to build a ReDoS-safe "regexp search engine" that can safely accept user-provided regexes with O(N) runtime performance, O(1) stack space usage, and as close as possible to O(M) space usage for the compiled regex. It's OK for some regex features like backrefs to be dropped. Users can ask for safety or power; no one knows how to provide both. Of course, we maintain compatibility - the legacy API must remain unsafe.
The re2 library strongly advocates building a regex engine using DFA, and this guarantees O(N) runtime matching performance. But a DFA engine cannot handle some popular modern regex engine features, notably back-references. The re2 library simply chooses not to support such features, and that is a reasonable restriction when regexes are user-supplied. One plausible evolution of the API is to add a flag to Pattern.compile(String, int) that would specifically request O(N) performance (or fail). Another plausible evolution of the API is to allow alternate implementations of Pattern, and ensure that most Java SE APIs that take a String regex also take a Pattern as input.
The re2 library has been ported from C++ to other languages:
- the standard regexp library for golang.
- re2j is a port to Java, but it is incomplete, perhaps because j.u.regex is a strong incumbent.
Some library implementations are a thin wrapper around native re2:
Wrappers around native re2 work better in languages other than java. There is more cultural acceptance of such wrappers, and the refcounting GC common in such languages is better at promptly releasing resources. No one wants to close() their Patterns when they are done with them, to release the associated native memory!
GNU grep, which has invested heavily in performance, implements both DFA and NFA engines, falling back to NFA if the regex cannot be compiled to DFA.
Openjdk addressed the problem of exponential O(2^N) performance in JDK-6328855 by memoizing previous failed match attempts, and this mitigates most exponential performance problems caused by sloppy regexes, but:
- it's an incomplete solution, and users have no performance guarantees.
- the optimization is turned off in the presence of backrefs
- O(N^2) performance problems remain, as with Matcher#find
Perl addressed the problem by adding Backtracking Control Verbs which allow a highly skilled regex programmer to produce an efficient program, but at the cost of lock-in to the backtracking execution model and not solving the general problem.
Modern Java includes invokedynamic and the ability to build highly performant languages on the JVM. Those techniques can be applied to the regular expression language as well.
An optimized DFA or NFA engine may benefit from the value types in the valhalla project.
Even though the very first implementation of regular expressions by Ken Thompson used efficient DFA, most current implementations use NFA backtracking. It may be best to re-engineer all such implementations to return to DFA (where possible), but it's an enormous engineering effort.
There are resources other than runtime cpu to be considered:
- Stack space is more constrained than heap space, so regex engines should not use O(N) stack space when matching a quantified group, but JDK-8260866 demonstrates a stack overflow.
- The theoretical result that gives O(N) runtime matching can require O(2^M) time for regex compilation, where M is the length of the regex. Real world DFA engines may have a compile time resource usage check or fail with memory exhaustion. We don't want an unused Pattern to consume gigabytes of heap. An ideal implementation may need to balance compile-time and run-time resources.
The very ambitious sregex project (also based on re2) claims
No pathological regexes exist for this regex engine because it does not use a backtracking algorithm at all.
but it appears to be unfinished and inactive.
Building a great engine is hard!