-
CSR
-
Resolution: Approved
-
P4
-
None
-
behavioral
-
low
-
-
Java API
-
SE
Summary
Regular expression pattern matching loses character class after intersection (&&) operator.
This is a fix to a bug in the regex compiler when compiling intersection &&
operators so that it does not drop certain character classes. The buggy behavior is long standing and has existed since at least JDK 7, but likely earlier.
Problem
When character classes are mixed both inside of square brackets ([..]
) on the right hand side of an intersection operator &&
we observe the compiler dropping some of them in the matchers it produces. This creates broken matchers that are missing important character classes. Without a fix this behavior remains in a broken state. This is publicly documented and known (see the second paragraph in the "Intersection of Multiple Classes" subsection).
Solution
The solution is to fix a bug where the regex compiler clobbers matchers it constructs for the right-hand-side of the intersection operation where it should be merging them with union operators. This brings functionality in line with that seen in Ruby's regular expressions. Python's bundled re
library doesn't support intersection. Perl and JavaScript do not support nested expressions inside of square brackets similar to how Java and Ruby already do.
Specification
--- a/src/java.base/share/classes/java/util/regex/Pattern.java
+++ b/src/java.base/share/classes/java/util/regex/Pattern.java
@@ -2663,7 +2663,11 @@ loop: for(int x=0, offset=0; x<nCodePoints; x++, offset+=len) {
right = right.union(clazz(true));
} else { // abc&&def
unread();
- right = clazz(false);
+ if(right == null) {
+ right = clazz(false);
+ } else {
+ right = right.union(clazz(false));
+ }
}
ch = peek();
}
- csr of
-
JDK-8037397 RegEx pattern matching loses character class after intersection (&&) operator
- Resolved