Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8043727

Behavior of regex \b (word boundary) is unclear; Description of \B is wrong

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Duplicate
    • Icon: P4 P4
    • None
    • 7u45
    • core-libs

      A DESCRIPTION OF THE PROBLEM :
      The documentation for java.util.regex.Pattern says \b matches "A word boundary" with no further mention.

      One can only infer that it behaves equivalently to Perl, since the class describes its syntax as similar to Perl and explains its differences from Perl.

      Well, in Perl, \b is equivalent to (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)). Although the meaning of \w (word character) can be changed by some modifiers, \b is always in sync with \w about what constitutes a word character.

      This nice property does not seem to hold in Java. \w is equivalent to [a-zA-Z0-9_], but digging in the source shows that \b considers the underscore plus anything matched by Character.isLetterOrDigit as a word character, which includes Unicode stuff.

      (If UNICODE_CHARACTER_CLASS is enabled for the Pattern, \w and \b both change, and now use the same definition of a word character.)

      If Java's \b behavior is correct, it should at least be documented, as it's currently impossible to reason confidently about what it's supposed to do.

      Also, the description of \B is wrong, or at least, open to misinterpretation if you don't already know what it does. It says it's "A non-word boundary" when what it ought to say is "Not a word boundary". Think about it: the location of the boundary of a "non-word" is the same as the location of a boundary of a word (except at the beginning/end of a string).


      URL OF FAULTY DOCUMENTATION :
      http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

            sherman Xueming Shen
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: