-
CSR
-
Resolution: Approved
-
P4
-
None
-
behavioral
-
low
-
-
Java API
-
SE
Summary
The \b
(word boundary; zero-width match) metacharacter's behavior is not consistent with the \w
(word character) metacharacter's behavior when the java.util.regex.Pattern.UNICODE_CHARACTER_CLASS
flag is not set. This inconsistency is like a bug/oversight. This CSR proposes changing the \b
character behavior to be consistent with \w
when the UNICODE_CHARACTER_CLASS
is set or not.
Problem
Conceptually, the word boundary \b
metacharacter is supposed to be a zero-width match in cases where a word character \w
abuts a non-word character \W
. If one makes the reasonable assumption that "word" in "word boundary" is the same as "word" in "word character," then the definition of the \b
metacharacter should be equivalent to (?:(?<=\w)(?=\W)|(?<=\W)(?=\w))
which is a zero-width position where the character behind is a word character and the character ahead is a non-word character or vice versa.
The \w
metacharacter matches [a-zA-Z_0-9]
in the absence of the UNICODE_CHARACTER_CLASS
being set per the spec. The spec is vague on what the \b
metacharacter will match. In diving into the implementation to determine its behavior, it appears that \b
relies on j.l.Character.isLetterOrDigit
along with a check for underscores. The issue with this is that isLetterOrDigit
will also match some unicode characters in addition to the range specified by \w
. When the UNICODE_CHARACTER_CLASS
is set, the behavior of both \w
and \b
is consistent.
The spec is silent on this inconsistency. Indeed, in the implementation the \b
metacharacter is sensitive to the UNICODE_CHARACTER_CLASS
flag being set. A different codepath is followed if this is the case which further strengthens the assumption that this is an oversight or a bug. The inconsistency between these two metacharacters would be very hard to explain or justify in the spec. Generally across other regex implementations (Python, Ruby, Rust, for example) these two metacharacters complement one another in an ASCII-sensitive or Unicode-sensitive fashion.
Solution
The solution is to alter the implementation for the \b
metacharacter such that its matcher uses the ASCII_WORD()
predicate in java.util.regex.CharPredicates
so it instead uses the same range of characters as \w
for determining word boundaries. This moves the two into consistent behavior in both cases where UNICODE_CHARACTER_CLASS
is set and when it is not set.
Specification
In src/java.base/share/classes/java/util/regex/Pattern.java
:
@@ -5364,7 +5364,7 @@ loop: for(int x=0, offset=0; x<nCodePoints; x++, offset+=len) {
boolean isWord(int ch) {
return useUWORD ? CharPredicates.WORD().is(ch)
- : (ch == '_' || Character.isLetterOrDigit(ch));
+ : CharPredicates.ASCII_WORD().is(ch);
}
We also update the spec to state explicitly that the \b
metacharacter is equivalent to (?:(?<=\w)(?=\W)|(?<=\W)(?=\w))
and then note that all metacharacters that are sensitive to the Pattern.UNICODE_CHARACTER_CLASS
can have their definitions affected by this flag in the Predefined character classes
section of the java.util.regex.Pattern
spec with forward linkage to how this behavior works when that flag is set. Spec changes follow.
diff --git a/src/java.base/share/classes/java/util/regex/Pattern.java b/src/java.base/share/classes/java/util/regex/Pattern.java
index 7ef1dabadde..66c511b2ff4 100644
--- a/src/java.base/share/classes/java/util/regex/Pattern.java
+++ b/src/java.base/share/classes/java/util/regex/Pattern.java
@@ -158,7 +158,8 @@ import jdk.internal.util.ArraysSupport;
* <tr><th style="vertical-align:top; font-weight:normal" id="any">{@code .}</th>
* <td headers="matches predef any">Any character (may or may not match <a href="#lt">line terminators</a>)</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="digit">{@code \d}</th>
- * <td headers="matches predef digit">A digit: {@code [0-9]}</td></tr>
+ * <td headers="matches predef digit">A digit: {@code [0-9]} if if <a href="#UNICODE_CHARACTER_CLASS">
+ * * UNICODE_CHARACTER_CLASS</a> is not set. See <a href="#unicodesupport">Unicode Support</a>.</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="non_digit">{@code \D}</th>
* <td headers="matches predef non_digit">A non-digit: {@code [^0-9]}</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="horiz_white">{@code \h}</th>
@@ -167,7 +168,9 @@ import jdk.internal.util.ArraysSupport;
* <tr><th style="vertical-align:top; font-weight:normal" id="non_horiz_white">{@code \H}</th>
* <td headers="matches predef non_horiz_white">A non-horizontal whitespace character: {@code [^\h]}</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="white">{@code \s}</th>
- * <td headers="matches predef white">A whitespace character: {@code [ \t\n\x0B\f\r]}</td></tr>
+ * <td headers="matches predef white">A whitespace character: {@code [ \t\n\x0B\f\r]} if
+ * <a href="#UNICODE_CHARACTER_CLASS"> UNICODE_CHARACTER_CLASS</a> is not set. See
+ * <a href="#unicodesupport">Unicode Support</a>.</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="non_white">{@code \S}</th>
* <td headers="matches predef non_white">A non-whitespace character: {@code [^\s]}</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="vert_white">{@code \v}</th>
@@ -176,7 +179,8 @@ import jdk.internal.util.ArraysSupport;
* <tr><th style="vertical-align:top; font-weight:normal" id="non_vert_white">{@code \V}</th>
* <td headers="matches predef non_vert_white">A non-vertical whitespace character: {@code [^\v]}</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="word">{@code \w}</th>
- * <td headers="matches predef word">A word character: {@code [a-zA-Z_0-9]}</td></tr>
+ * <td headers="matches predef word">A word character: {@code [a-zA-Z_0-9]} if <a href="#UNICODE_CHARACTER_CLASS">
+ * UNICODE_CHARACTER_CLASS</a> is not set. See <a href="#unicodesupport">Unicode Support</a>. <a href="#Uni"</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="non_word">{@code \W}</th>
* <td headers="matches predef non_word">A non-word character: {@code [^\w]}</td></tr>
*
@@ -246,11 +250,11 @@ import jdk.internal.util.ArraysSupport;
* <tr><th style="vertical-align:top; font-weight:normal" id="end_line">{@code $}</th>
* <td headers="matches bounds end_line">The end of a line</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="word_boundary">{@code \b}</th>
- * <td headers="matches bounds word_boundary">A word boundary</td></tr>
+ * <td headers="matches bounds word_boundary">A word boundary: {@code (?:(?<=\w)(?=\W)|(?<=\W)(?=\w))} (the location
+ * where a non-word character abuts a word character)</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="grapheme_cluster_boundary">{@code \b{g}}</th>
* <td headers="matches bounds grapheme_cluster_boundary">A Unicode extended grapheme cluster boundary</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="non_word_boundary">{@code \B}</th>
- * <td headers="matches bounds non_word_boundary">A non-word boundary</td></tr>
+ * <td headers="matches bounds non_word_boundary">A non-word boundary: {@code [^\b]}</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="begin_input">{@code \A}</th>
* <td headers="matches bounds begin_input">The beginning of the input</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="end_prev_match">{@code \G}</th>
@@ -535,7 +539,7 @@ import jdk.internal.util.ArraysSupport;
* that do not capture text and do not count towards the group total, or
* <i>named-capturing</i> group.
*
- * <h2> Unicode support </h2>
+ * <h2 id="unicodesupport"> Unicode support </h2>
*
* <p> This class is in conformance with Level 1 of <a
* href="http://www.unicode.org/reports/tr18/"><i>Unicode Technical
--
diff --git a/src/java.base/share/classes/java/util/regex/Pattern.java b/src/java.base/share/classes/java/util/regex/Pattern.java
index 4cc4729e73a..220b1a83c95 100644
--- a/src/java.base/share/classes/java/util/regex/Pattern.java
+++ b/src/java.base/share/classes/java/util/regex/Pattern.java
@@ -158,7 +158,7 @@ import jdk.internal.util.ArraysSupport;
* <tr><th style="vertical-align:top; font-weight:normal" id="any">{@code .}</th>
* <td headers="matches predef any">Any character (may or may not match <a href="#lt">line terminators</a>)</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="digit">{@code \d}</th>
- * <td headers="matches predef digit">A digit: {@code [0-9]} if if <a href="#UNICODE_CHARACTER_CLASS">
+ * <td headers="matches predef digit">A digit: {@code [0-9]} if <a href="#UNICODE_CHARACTER_CLASS">
* * UNICODE_CHARACTER_CLASS</a> is not set. See <a href="#unicodesupport">Unicode Support</a>.</td></tr>
* <tr><th style="vertical-align:top; font-weight:normal" id="non_digit">{@code \D}</th>
* <td headers="matches predef non_digit">A non-digit: {@code [^0-9]}</td></tr>
--
- csr of
-
JDK-8264160 Regex \b is not consistent with \w without UNICODE_CHARACTER_CLASS
-
- Closed
-