Loading...

Type: Bug
Resolution: Fixed
Priority: P4
Fix Version/s: 19
Affects Version/s: 17
Component/s: core-libs
Labels:

Subcomponent:
java.util.regex
Resolved In Build:
b16
CPU:

generic
OS:

generic
Verification:
Verified

ADDITIONAL SYSTEM INFORMATION :
Windows 10, although that is probably irrelevant.

Java 17 ea, but also reproducible on Java 8

> java -version
openjdk version "17-ea" 2021-09-14
OpenJDK Runtime Environment (build 17-ea+14-1110)
OpenJDK 64-Bit Server VM (build 17-ea+14-1110, mixed mode, sharing)

openjdk version "1.8.0_222"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.222-b10, mixed mode)

The figures below are the ones obtained on JDK 17. Due to updates in the Unicode database, results differ between JDK versions, but the inconsistencies are always there.

A DESCRIPTION OF THE PROBLEM :
As already highlighted by https://bugs.openjdk.java.net/browse/JDK-6452709 and later https://bugs.openjdk.java.net/browse/JDK-8043727, the JavaDoc is too vague about the meaning of \b and \B in regular expressions. However, as the latter points out, it is usually understood that it should be consistent with \w and \W.

This is the case in Java regexes and the UNICODE_CHARACTER_CLASS flag is used, but is not consistent when it is not used. The set of characters considered as word characters by \b is also different with and without UNICODE_CHARACTER_CLASS, so this is not a case of always using Unicode definitions for \b.

The inconsistency means that using \b without UNICODE_CHARACTER_CLASS is basically impossible, because it does not follow any intuitive or broadly accepted definition, nor is it documented. Therefore, I am submitting this as a bug report, rather than just missing documentation like the above issues.

A workaround is to use the subpattern `(?:(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w))` instead.

The attached reproduction highlights the inconsistencies. My expectation is that \b (and \B) should be consistent with \w and \W, in all cases.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Using the test file test/Test.java provided below:

$ javac -d bin test/Test.java
$ java -cp bin test.Test

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
1. total: 0

2. total: 0

3. total: 0

...
4. total: ??? (many)
ACTUAL -
...
1. 31347 false true
1. 31348 false true
1. 31349 false true
1. 3134a false true
1. total: 131829

2. total: 0

3. total: 0

4. 300 false true
4. 301 false true
4. 302 false true
...
4. total: 2672

---------- BEGIN SOURCE ----------
package test;

import java.util.regex.*;

public class Test {
  private static Pattern basicWordCharPattern = Pattern.compile("\\w");
  private static Pattern basicWordCharForBoundaryPattern = Pattern.compile(";\\b.", Pattern.DOTALL);

  private static Pattern basicWordCharForBoundaryWithWorkaroundPattern = Pattern.compile(";(?:(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)).", Pattern.DOTALL);

  private static Pattern unicodeWordCharPattern = Pattern.compile("\\w", Pattern.UNICODE_CHARACTER_CLASS);
  private static Pattern unicodeWordCharForBoundaryPattern = Pattern.compile(";\\b.", Pattern.UNICODE_CHARACTER_CLASS | Pattern.DOTALL);

  private static String cpToString(int cp) {
    if (Character.isBmpCodePoint(cp))
      return "" + ((char) cp);
    else
      return "" + Character.highSurrogate(cp) + Character.lowSurrogate(cp);
  }

  private static boolean isBasicWordChar(int cp) {
    return basicWordCharPattern.matcher(cpToString(cp)).matches();
  }

  private static boolean isBasicWordCharForBoundary(int cp) {
    return basicWordCharForBoundaryPattern.matcher(";" + cpToString(cp)).matches();
  }

  private static boolean isBasicWordCharForBoundaryWithWorkaround(int cp) {
    return basicWordCharForBoundaryWithWorkaroundPattern.matcher(";" + cpToString(cp)).matches();
  }

  private static boolean isUnicodeWordChar(int cp) {
    return unicodeWordCharPattern.matcher(cpToString(cp)).matches();
  }

  private static boolean isUnicodeWordCharForBoundary(int cp) {
    return unicodeWordCharForBoundaryPattern.matcher(";" + cpToString(cp)).matches();
  }

  public static void main(String[] args) {
    // Print code points for which \b is not consistent with \w without UNICODE_CHARACTER_CLASS.
    int total = 0;
    for (int cp = 0; cp <= Character.MAX_CODE_POINT; cp++) {
      boolean basicWC = isBasicWordChar(cp);
      boolean basicBoundaryWC = isBasicWordCharForBoundary(cp);

      if (basicWC != basicBoundaryWC) {
        System.out.println("1. " + Integer.toHexString(cp) + " " + basicWC + " " + basicBoundaryWC);
        total++;
      }
    }
    System.out.println("1. total: " + total); // 131829, but should be 0

    System.out.println("");

    // Print code points for which the workaround is not consistent with \w without UNICODE_CHARACTER_CLASS.
    total = 0;
    for (int cp = 0; cp <= Character.MAX_CODE_POINT; cp++) {
      boolean basicWC = isBasicWordChar(cp);
      boolean basicBoundaryWithWorkaroundWC = isBasicWordCharForBoundaryWithWorkaround(cp);

      if (basicWC != basicBoundaryWithWorkaroundWC) {
        System.out.println("2. " + Integer.toHexString(cp) + " " + basicWC + " " + basicBoundaryWithWorkaroundWC);
        total++;
      }
    }
    System.out.println("2. total: " + total); // 0

    System.out.println("");

    // Print code points for which \b is not consistent with \w *with* UNICODE_CHARACTER_CLASS.
    total = 0;
    for (int cp = 0; cp <= Character.MAX_CODE_POINT; cp++) {
      boolean unicodeWC = isUnicodeWordChar(cp);
      boolean unicodeBoundaryWC = isUnicodeWordCharForBoundary(cp);

      if (unicodeWC != unicodeBoundaryWC) {
        System.out.println("3. " + Integer.toHexString(cp) + " " + unicodeWC + " " + unicodeBoundaryWC);
        total++;
      }
    }
    System.out.println("3. total: " + total); // 0 (correct; they are all consistent)

    System.out.println("");

    /* Print code points for which \b without UNICODE_CHARACTER_CLASS is inconsistent
     * with \b *with* UNICODE_CHARACTER_CLASS.
     */
    total = 0;
    for (int cp = 0; cp <= Character.MAX_CODE_POINT; cp++) {
      boolean basicBoundaryWC = isBasicWordCharForBoundary(cp);
      boolean unicodeBoundaryWC = isUnicodeWordCharForBoundary(cp);

      if (basicBoundaryWC != unicodeBoundaryWC) {
        System.out.println("4. " + Integer.toHexString(cp) + " " + basicBoundaryWC + " " + unicodeBoundaryWC);
        total++;
      }
    }
    System.out.println("4. total: " + total); // 2672 (should be much higher)
  }
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
A workaround is to use the subpattern `(?:(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w))` instead.

FREQUENCY : always

csr for

JDK-8282129 Regex \b is not consistent with \w without UNICODE_CHARACTER_CLASS

Closed

duplicates

JDK-8260221 java.util.Formatter throws wrong exception for mismatched flags in %% conversion

Closed

links to

Commit openjdk/jdk/f01cce23

Review openjdk/jdk/7539

1.

Release Note: Regex \b Character Class now Matches ASCII Characters Only by Default

Resolved

Ian Graves

Details

Description

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates