Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4480632

java.util.regex.Pattern lookahead operator definition is not very clear

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P3 P3
    • 1.4.0
    • 1.4.0
    • docs
    • beta2
    • x86
    • windows_nt


      ingrid.yao@Eng 2001-07-17

      J2SE Version (please include all output from java -version flag):
      -------------------------
      J:\borsotti\jtest>java -version
      java version "1.4.0-beta_refresh"
      Java(TM) 2 Runtime Environment, Standard Edition (build
      1.4.0-beta_refresh-b70)
      Java HotSpot(TM) Client VM (build 1.4.0-beta_refresh-b70, mixed mode)

      Does this problem occur on J2SE 1.3? Yes / No (pick one)
      ------------------------------------
      Not Applicable

      Operating System Configuration Information (be specific):
      -------------------------------------------
      NT 4.0 service pack 5


      Hardware Configuration Information (be specific):
      -------------------------------------------------
      Compaq Deskpro

      Bug Description:
      --------------------
      Definition of lookahead operator in java.util.regex.Pattern is not very clear.

      Steps to Reproduce (be specific):
      -----------------------------------
      Consider the following program:

      import java.util.regex.*;
      public class RE {
          private static void test(String pat, String s, String exp){
              Pattern p = Pattern.compile(pat);
              Matcher m = p.matcher(s);
              boolean b = m.lookingAt();
              System.out.print((b ? "matched" : "not matched"));
              if (b) System.out.print(" " +
                  s.substring(m.start(),m.end()) + " " + m.start() +
                  " " + m.end());
              System.out.println(" expected: " + exp);
          }
          public static void main(String[] args){
              test("(abcd|abc)(?=d?)", "abcd","abcd");
              test("(abc|abcd)(?=d?)", "abcd","abcd");
              test("a(bc|bcd)(?=d?)", "abcd","abcd");
              test("a(bcd|bc)(?=d?)", "abcd","abcd");
              test("a*(?=(a?bc|bcd)d?)", "aaabcd","aaa");
              test("a*(?=(bcd|a?bc)d?)", "aaabcd","aaa");
              test("(a|ab)(?=(blip)?)", "ablip","ab");
              test("(a|ab)(?=(blip)?)", "ab","ab");
              test("(ab|a)(?=(blip)?)", "ablip","ab");
              test("(ab|a)(?=(blip)?)", "ab","ab");
              test("(a|ab)(?=blip)", "ablip","a");
              test("(a|ab)(?=blip)", "ab","no match");
              test("(ab|a)(?=blip)", "ablip","a");
              test("(ab|a)(?=blip)", "ab","no match");
          }
      }

      when run, it prints:

      J:\borsotti\jtest>java RE
      matched abcd 0 4 expected: abcd
      matched abc 0 3 expected: abcd
      matched abc 0 3 expected: abcd
      matched abcd 0 4 expected: abcd
      matched aaa 0 3 expected: aaa
      matched aaa 0 3 expected: aaa
      matched a 0 1 expected: ab
      matched a 0 1 expected: ab
      matched ab 0 2 expected: ab
      matched ab 0 2 expected: ab
      matched a 0 1 expected: a
      not matched expected: no match
      matched a 0 1 expected: a
      not matched expected: no match

      The matching of lookaheads is not entirely defined in the API spec,
      what the customer expects is that r1(?=r2) matches:

        the longest r1 such that r2 matches after the text matched by r1

      but it sounds like it should return the first match not the
      longest match instead.

      Without a definition like this, matching becomes unpredictable.
      E.g. in test nr. 2 "abc" is matched, which is an instance of r1
      followed by one of r2, but it is shorter than "abcd", which is
      also a correct match.

            mmcclosksunw Michael Mccloskey (Inactive)
            tyao Ting-Yun Ingrid Yao (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: