ingrid.yao@Eng 2001-07-17
J2SE Version (please include all output from java -version flag):
-------------------------
J:\borsotti\jtest>java -version
java version "1.4.0-beta_refresh"
Java(TM) 2 Runtime Environment, Standard Edition (build
1.4.0-beta_refresh-b70)
Java HotSpot(TM) Client VM (build 1.4.0-beta_refresh-b70, mixed mode)
Does this problem occur on J2SE 1.3? Yes / No (pick one)
------------------------------------
Not Applicable
Operating System Configuration Information (be specific):
-------------------------------------------
NT 4.0 service pack 5
Hardware Configuration Information (be specific):
-------------------------------------------------
Compaq Deskpro
Bug Description:
--------------------
Definition of lookahead operator in java.util.regex.Pattern is not very clear.
Steps to Reproduce (be specific):
-----------------------------------
Consider the following program:
import java.util.regex.*;
public class RE {
private static void test(String pat, String s, String exp){
Pattern p = Pattern.compile(pat);
Matcher m = p.matcher(s);
boolean b = m.lookingAt();
System.out.print((b ? "matched" : "not matched"));
if (b) System.out.print(" " +
s.substring(m.start(),m.end()) + " " + m.start() +
" " + m.end());
System.out.println(" expected: " + exp);
}
public static void main(String[] args){
test("(abcd|abc)(?=d?)", "abcd","abcd");
test("(abc|abcd)(?=d?)", "abcd","abcd");
test("a(bc|bcd)(?=d?)", "abcd","abcd");
test("a(bcd|bc)(?=d?)", "abcd","abcd");
test("a*(?=(a?bc|bcd)d?)", "aaabcd","aaa");
test("a*(?=(bcd|a?bc)d?)", "aaabcd","aaa");
test("(a|ab)(?=(blip)?)", "ablip","ab");
test("(a|ab)(?=(blip)?)", "ab","ab");
test("(ab|a)(?=(blip)?)", "ablip","ab");
test("(ab|a)(?=(blip)?)", "ab","ab");
test("(a|ab)(?=blip)", "ablip","a");
test("(a|ab)(?=blip)", "ab","no match");
test("(ab|a)(?=blip)", "ablip","a");
test("(ab|a)(?=blip)", "ab","no match");
}
}
when run, it prints:
J:\borsotti\jtest>java RE
matched abcd 0 4 expected: abcd
matched abc 0 3 expected: abcd
matched abc 0 3 expected: abcd
matched abcd 0 4 expected: abcd
matched aaa 0 3 expected: aaa
matched aaa 0 3 expected: aaa
matched a 0 1 expected: ab
matched a 0 1 expected: ab
matched ab 0 2 expected: ab
matched ab 0 2 expected: ab
matched a 0 1 expected: a
not matched expected: no match
matched a 0 1 expected: a
not matched expected: no match
The matching of lookaheads is not entirely defined in the API spec,
what the customer expects is that r1(?=r2) matches:
the longest r1 such that r2 matches after the text matched by r1
but it sounds like it should return the first match not the
longest match instead.
Without a definition like this, matching becomes unpredictable.
E.g. in test nr. 2 "abc" is matched, which is an instance of r1
followed by one of r2, but it is shorter than "abcd", which is
also a correct match.