Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8264547

RegEx pattern matching loses character class after intersection (&&) operator

XMLWordPrintable

    • Icon: CSR CSR
    • Resolution: Approved
    • Icon: P4 P4
    • 17
    • core-libs
    • None
    • behavioral
    • low
    • Hide
      This proposal changes a longstanding behavior in the regex matcher. Patterns of the shape `nested&&[nested]unnessted` currently do not match anything, for example. A pattern of the shape `[a-z]&&[a-g]h-z` would now match the entire range of characters because the matcher would now properly reflect the full intersection.
      Show
      This proposal changes a longstanding behavior in the regex matcher. Patterns of the shape `nested&&[nested]unnessted` currently do not match anything, for example. A pattern of the shape `[a-z]&&[a-g]h-z` would now match the entire range of characters because the matcher would now properly reflect the full intersection.
    • Java API
    • SE

      Summary

      Regular expression pattern matching loses character class after intersection (&&) operator. This is a fix to a bug in the regex compiler when compiling intersection && operators so that it does not drop certain character classes. The buggy behavior is long standing and has existed since at least JDK 7, but likely earlier.

      Problem

      When character classes are mixed both inside of square brackets ([..]) on the right hand side of an intersection operator && we observe the compiler dropping some of them in the matchers it produces. This creates broken matchers that are missing important character classes. Without a fix this behavior remains in a broken state. This is publicly documented and known (see the second paragraph in the "Intersection of Multiple Classes" subsection).

      Solution

      The solution is to fix a bug where the regex compiler clobbers matchers it constructs for the right-hand-side of the intersection operation where it should be merging them with union operators. This brings functionality in line with that seen in Ruby's regular expressions. Python's bundled re library doesn't support intersection. Perl and JavaScript do not support nested expressions inside of square brackets similar to how Java and Ruby already do.

      Specification

      --- a/src/java.base/share/classes/java/util/regex/Pattern.java
      +++ b/src/java.base/share/classes/java/util/regex/Pattern.java
      @@ -2663,7 +2663,11 @@ loop:   for(int x=0, offset=0; x<nCodePoints; x++, offset+=len) {
                                           right = right.union(clazz(true));
                                   } else { // abc&&def
                                       unread();
      -                                right = clazz(false);
      +                                if(right == null) {
      +                                    right = clazz(false);
      +                                } else {
      +                                    right = right.union(clazz(false));
      +                                }
                                   }
                                   ch = peek();
                               }

            igraves Ian Graves
            webbuggrp Webbug Group
            Roger Riggs
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: