Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8282129

Regex \b is not consistent with \w without UNICODE_CHARACTER_CLASS

XMLWordPrintable

    • Icon: CSR CSR
    • Resolution: Approved
    • Icon: P4 P4
    • 19
    • core-libs
    • None
    • behavioral
    • low
    • Hide
      The existing behavior of the \b metacharacter in Java regex strings is longstanding and changing it may impact existing regular expressions that rely on this inconsistent (with respect to Unicode characters) behavior. However, the use of \b is less common and code that focuses on ASCII-encoded data or similar will be unaffected.
      Show
      The existing behavior of the \b metacharacter in Java regex strings is longstanding and changing it may impact existing regular expressions that rely on this inconsistent (with respect to Unicode characters) behavior. However, the use of \b is less common and code that focuses on ASCII-encoded data or similar will be unaffected.
    • Java API
    • SE

      Summary

      The \b (word boundary; zero-width match) metacharacter's behavior is not consistent with the \w (word character) metacharacter's behavior when the java.util.regex.Pattern.UNICODE_CHARACTER_CLASS flag is not set. This inconsistency is like a bug/oversight. This CSR proposes changing the \b character behavior to be consistent with \w when the UNICODE_CHARACTER_CLASS is set or not.

      Problem

      Conceptually, the word boundary \b metacharacter is supposed to be a zero-width match in cases where a word character \w abuts a non-word character \W. If one makes the reasonable assumption that "word" in "word boundary" is the same as "word" in "word character," then the definition of the \b metacharacter should be equivalent to (?:(?<=\w)(?=\W)|(?<=\W)(?=\w)) which is a zero-width position where the character behind is a word character and the character ahead is a non-word character or vice versa.

      The \w metacharacter matches [a-zA-Z_0-9] in the absence of the UNICODE_CHARACTER_CLASS being set per the spec. The spec is vague on what the \b metacharacter will match. In diving into the implementation to determine its behavior, it appears that \b relies on j.l.Character.isLetterOrDigit along with a check for underscores. The issue with this is that isLetterOrDigit will also match some unicode characters in addition to the range specified by \w. When the UNICODE_CHARACTER_CLASS is set, the behavior of both \w and \b is consistent.

      The spec is silent on this inconsistency. Indeed, in the implementation the \b metacharacter is sensitive to the UNICODE_CHARACTER_CLASS flag being set. A different codepath is followed if this is the case which further strengthens the assumption that this is an oversight or a bug. The inconsistency between these two metacharacters would be very hard to explain or justify in the spec. Generally across other regex implementations (Python, Ruby, Rust, for example) these two metacharacters complement one another in an ASCII-sensitive or Unicode-sensitive fashion.

      Solution

      The solution is to alter the implementation for the \b metacharacter such that its matcher uses the ASCII_WORD() predicate in java.util.regex.CharPredicates so it instead uses the same range of characters as \w for determining word boundaries. This moves the two into consistent behavior in both cases where UNICODE_CHARACTER_CLASS is set and when it is not set.

      Specification

      In src/java.base/share/classes/java/util/regex/Pattern.java:

      @@ -5364,7 +5364,7 @@ loop:   for(int x=0, offset=0; x<nCodePoints; x++, offset+=len) {
      
               boolean isWord(int ch) {
                   return useUWORD ? CharPredicates.WORD().is(ch)
      -                            : (ch == '_' || Character.isLetterOrDigit(ch));
      +                            : CharPredicates.ASCII_WORD().is(ch);
               }

      We also update the spec to state explicitly that the \b metacharacter is equivalent to (?:(?<=\w)(?=\W)|(?<=\W)(?=\w)) and then note that all metacharacters that are sensitive to the Pattern.UNICODE_CHARACTER_CLASS can have their definitions affected by this flag in the Predefined character classes section of the java.util.regex.Pattern spec with forward linkage to how this behavior works when that flag is set. Spec changes follow.

      diff --git a/src/java.base/share/classes/java/util/regex/Pattern.java b/src/java.base/share/classes/java/util/regex/Pattern.java
      index 7ef1dabadde..66c511b2ff4 100644
      --- a/src/java.base/share/classes/java/util/regex/Pattern.java
      +++ b/src/java.base/share/classes/java/util/regex/Pattern.java
      @@ -158,7 +158,8 @@ import jdk.internal.util.ArraysSupport;
        * <tr><th style="vertical-align:top; font-weight:normal" id="any">{@code .}</th>
        *     <td headers="matches predef any">Any character (may or may not match <a href="#lt">line terminators</a>)</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="digit">{@code \d}</th>
      - *     <td headers="matches predef digit">A digit: {@code [0-9]}</td></tr>
      + *     <td headers="matches predef digit">A digit: {@code [0-9]} if if <a href="#UNICODE_CHARACTER_CLASS">
      + *  *         UNICODE_CHARACTER_CLASS</a> is not set. See <a href="#unicodesupport">Unicode Support</a>.</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="non_digit">{@code \D}</th>
        *     <td headers="matches predef non_digit">A non-digit: {@code [^0-9]}</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="horiz_white">{@code \h}</th>
      @@ -167,7 +168,9 @@ import jdk.internal.util.ArraysSupport;
        * <tr><th style="vertical-align:top; font-weight:normal" id="non_horiz_white">{@code \H}</th>
        *     <td headers="matches predef non_horiz_white">A non-horizontal whitespace character: {@code [^\h]}</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="white">{@code \s}</th>
      - *     <td headers="matches predef white">A whitespace character: {@code [ \t\n\x0B\f\r]}</td></tr>
      + *     <td headers="matches predef white">A whitespace character: {@code [ \t\n\x0B\f\r]} if
      + *     <a href="#UNICODE_CHARACTER_CLASS"> UNICODE_CHARACTER_CLASS</a> is not set. See
      + *     <a href="#unicodesupport">Unicode Support</a>.</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="non_white">{@code \S}</th>
        *     <td headers="matches predef non_white">A non-whitespace character: {@code [^\s]}</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="vert_white">{@code \v}</th>
      @@ -176,7 +179,8 @@ import jdk.internal.util.ArraysSupport;
        * <tr><th style="vertical-align:top; font-weight:normal" id="non_vert_white">{@code \V}</th>
        *     <td headers="matches predef non_vert_white">A non-vertical whitespace character: {@code [^\v]}</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="word">{@code \w}</th>
      - *     <td headers="matches predef word">A word character: {@code [a-zA-Z_0-9]}</td></tr>
      + *     <td headers="matches predef word">A word character: {@code [a-zA-Z_0-9]} if <a href="#UNICODE_CHARACTER_CLASS">
      + *         UNICODE_CHARACTER_CLASS</a> is not set. See <a href="#unicodesupport">Unicode Support</a>. <a href="#Uni"</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="non_word">{@code \W}</th>
        *     <td headers="matches predef non_word">A non-word character: {@code [^\w]}</td></tr>
        *
      @@ -246,11 +250,11 @@ import jdk.internal.util.ArraysSupport;
        * <tr><th style="vertical-align:top; font-weight:normal" id="end_line">{@code $}</th>
        *     <td headers="matches bounds end_line">The end of a line</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="word_boundary">{@code \b}</th>
      - *     <td headers="matches bounds word_boundary">A word boundary</td></tr>
      + *     <td headers="matches bounds word_boundary">A word boundary: {@code (?:(?<=\w)(?=\W)|(?<=\W)(?=\w))} (the location
      + *     where a non-word character abuts a word character)</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="grapheme_cluster_boundary">{@code \b{g}}</th>
        *     <td headers="matches bounds grapheme_cluster_boundary">A Unicode extended grapheme cluster boundary</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="non_word_boundary">{@code \B}</th>
      - *     <td headers="matches bounds non_word_boundary">A non-word boundary</td></tr>
      + *     <td headers="matches bounds non_word_boundary">A non-word boundary: {@code [^\b]}</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="begin_input">{@code \A}</th>
        *     <td headers="matches bounds begin_input">The beginning of the input</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="end_prev_match">{@code \G}</th>
      @@ -535,7 +539,7 @@ import jdk.internal.util.ArraysSupport;
        * that do not capture text and do not count towards the group total, or
        * <i>named-capturing</i> group.
        *
      - * <h2> Unicode support </h2>
      + * <h2 id="unicodesupport"> Unicode support </h2>
        *
        * <p> This class is in conformance with Level 1 of <a
        * href="http://www.unicode.org/reports/tr18/"><i>Unicode Technical
      -- 
      
      diff --git a/src/java.base/share/classes/java/util/regex/Pattern.java b/src/java.base/share/classes/java/util/regex/Pattern.java
      index 4cc4729e73a..220b1a83c95 100644
      --- a/src/java.base/share/classes/java/util/regex/Pattern.java
      +++ b/src/java.base/share/classes/java/util/regex/Pattern.java
      @@ -158,7 +158,7 @@ import jdk.internal.util.ArraysSupport;
        * <tr><th style="vertical-align:top; font-weight:normal" id="any">{@code .}</th>
        *     <td headers="matches predef any">Any character (may or may not match <a href="#lt">line terminators</a>)</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="digit">{@code \d}</th>
      - *     <td headers="matches predef digit">A digit: {@code [0-9]} if if <a href="#UNICODE_CHARACTER_CLASS">
      + *     <td headers="matches predef digit">A digit: {@code [0-9]} if <a href="#UNICODE_CHARACTER_CLASS">
        *  *         UNICODE_CHARACTER_CLASS</a> is not set. See <a href="#unicodesupport">Unicode Support</a>.</td></tr>
        * <tr><th style="vertical-align:top; font-weight:normal" id="non_digit">{@code \D}</th>
        *     <td headers="matches predef non_digit">A non-digit: {@code [^0-9]}</td></tr>
      -- 

            igraves Ian Graves
            webbuggrp Webbug Group
            Stuart Marks
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: