-
Bug
-
Resolution: Fixed
-
P3
-
5.0, 6
-
b81
-
generic, x86
-
generic, windows_xp
FULL PRODUCT VERSION :
A DESCRIPTION OF THE PROBLEM :
A lookbehind makes an assertion about the text immediately preceding the
current match position (that is, the position that has been reached in the
target text at the time the lookbehind gets evaulated, which I'll call the
CMP henceforth). That means the text matched by the lookbehind subexpression
should end precisely at the CMP. If the subexpression contains quantifiers,
it should employ backtracking as needed to achieve alignment with the CMP,
just like it would for '$', '\b', or any other anchor. But the way it's
implemented, the subexpression match is allowed to end wherever it happens
to, then that position is compared to the CMP. That comparison is what
determines whether the lookbehind succeeds or fails, even if backtracking
could have changed the result.
In the sample code, I'm trying to use lookbehind to find all matches for the
regex 'foo\d', but only if they're preceded by a percent sign and up to five
intervening characters (so it should match "foo1", "foo2" and "foo3"). The
obvious regex for this is '(?<=%.{0,5})foo\d', but it fails to match "foo1"
and "foo2". The problem is that '.{0,5}' greedily matches the maximum five
characters, taking it past the CMP. Then, because the end position of that
match is not the same as the CMP, the lookbehind fails. The third attempt
succeeds only because there are exactly five characters between the "%" and
"foo3".
In the second regex, '(?<=%.{0,5}?)foo\d', I use a reluctant quantifier, but
that doesn't work any better. The first attempt succeeds because there are
zero intervening characters in that line, but on the second and third tries,
the matcher never goes back to try matching more characters.
In the third regex, '(?<=%.{0,5}\b)foo\d', I try using my knowledge of the
data to help anchor the lookbehind. The first attempt still fails because
the word-boundary assertion matches at the end of the line, but on the second
attempt the '.{0,5}' is forced to backtrack to the correct position. This
partial success demonstrates how lookbehinds are supposed to work: they need
to be solidly anchored to the current match position, and the anchor has to
be integral to the lookbehind, not applied afterward as is currently done.
(The fourth regex demonstrates that negative lookbehinds are equally broken:
it should match only the last two lines, but it matches "foo1" and "foo2" as
well. The fifth regex is a workaround that I discuss below.)
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
import java.util.regex.*;
public class Test
{
public static void main(String[] args)
{
String str = "%foo1\n" +
"%bar foo2\n" +
"%bar foo3\n" +
"%blahblah foo4\n" +
"foo5";
String[] rgxs = { "(?<=%.{0,5})foo\\d",
"(?<=%.{0,5}?)foo\\d",
"(?<=%.{0,5}\\b)foo\\d",
"(?<!%.{0,5})foo\\d",
"foo\\d(?<=%.{0,5}foo\\d)" };
for (int i = 0; i < rgxs.length; i++)
{
Pattern p = Pattern.compile(rgxs[i]);
Matcher m = p.matcher(str);
System.out.println();
System.out.println(p.pattern());
while (m.find())
{
System.out.println(m.group());
}
}
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Match the part you want to capture first, then use a lookbehind to check for
the prefix, as I did in my fifth sample regex: 'foo\d(?<=%.{0,5}foo\d)'. This
workaround seems pretty reliable, but its usefulness is very limited because
the whole regex has to have an obvious maximum length. That means it has to
be fairly simple and can't contain any '*', '+' or '{n,}' quantifiers.
###@###.### 2005-06-10 19:27:30 GMT
A DESCRIPTION OF THE PROBLEM :
A lookbehind makes an assertion about the text immediately preceding the
current match position (that is, the position that has been reached in the
target text at the time the lookbehind gets evaulated, which I'll call the
CMP henceforth). That means the text matched by the lookbehind subexpression
should end precisely at the CMP. If the subexpression contains quantifiers,
it should employ backtracking as needed to achieve alignment with the CMP,
just like it would for '$', '\b', or any other anchor. But the way it's
implemented, the subexpression match is allowed to end wherever it happens
to, then that position is compared to the CMP. That comparison is what
determines whether the lookbehind succeeds or fails, even if backtracking
could have changed the result.
In the sample code, I'm trying to use lookbehind to find all matches for the
regex 'foo\d', but only if they're preceded by a percent sign and up to five
intervening characters (so it should match "foo1", "foo2" and "foo3"). The
obvious regex for this is '(?<=%.{0,5})foo\d', but it fails to match "foo1"
and "foo2". The problem is that '.{0,5}' greedily matches the maximum five
characters, taking it past the CMP. Then, because the end position of that
match is not the same as the CMP, the lookbehind fails. The third attempt
succeeds only because there are exactly five characters between the "%" and
"foo3".
In the second regex, '(?<=%.{0,5}?)foo\d', I use a reluctant quantifier, but
that doesn't work any better. The first attempt succeeds because there are
zero intervening characters in that line, but on the second and third tries,
the matcher never goes back to try matching more characters.
In the third regex, '(?<=%.{0,5}\b)foo\d', I try using my knowledge of the
data to help anchor the lookbehind. The first attempt still fails because
the word-boundary assertion matches at the end of the line, but on the second
attempt the '.{0,5}' is forced to backtrack to the correct position. This
partial success demonstrates how lookbehinds are supposed to work: they need
to be solidly anchored to the current match position, and the anchor has to
be integral to the lookbehind, not applied afterward as is currently done.
(The fourth regex demonstrates that negative lookbehinds are equally broken:
it should match only the last two lines, but it matches "foo1" and "foo2" as
well. The fifth regex is a workaround that I discuss below.)
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
import java.util.regex.*;
public class Test
{
public static void main(String[] args)
{
String str = "%foo1\n" +
"%bar foo2\n" +
"%bar foo3\n" +
"%blahblah foo4\n" +
"foo5";
String[] rgxs = { "(?<=%.{0,5})foo\\d",
"(?<=%.{0,5}?)foo\\d",
"(?<=%.{0,5}\\b)foo\\d",
"(?<!%.{0,5})foo\\d",
"foo\\d(?<=%.{0,5}foo\\d)" };
for (int i = 0; i < rgxs.length; i++)
{
Pattern p = Pattern.compile(rgxs[i]);
Matcher m = p.matcher(str);
System.out.println();
System.out.println(p.pattern());
while (m.find())
{
System.out.println(m.group());
}
}
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Match the part you want to capture first, then use a lookbehind to check for
the prefix, as I did in my fifth sample regex: 'foo\d(?<=%.{0,5}foo\d)'. This
workaround seems pretty reliable, but its usefulness is very limited because
the whole regex has to have an obvious maximum length. That means it has to
be fairly simple and can't contain any '*', '+' or '{n,}' quantifiers.
###@###.### 2005-06-10 19:27:30 GMT