-
Enhancement
-
Resolution: Fixed
-
P5
-
1.4.0
-
beta2
-
generic
-
generic
Name: bsC130419 Date: 07/12/2001
java version "1.4.0-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-beta-b65)
Java HotSpot(TM) Client VM (build 1.4.0-beta-b65, mixed mode)
The implementation of substitution variables in the regex
package suffers from several flaws, which I describe below
(note that I use the term "replacement spec" to mean the
string that is passed to the appendReplacement() and
replaceAll() methods, and "replacement text" to mean the
result after all replacements have been made). Below that
is some sample code and output illustrating these flaws,
followed my proposal for a better implementation.
o There is no sanity-checking. When the matcher finds the
sequence "$(" in the replacement spec, it takes any char-
acters between that and the next ")" as being a reference to
a capturing group, without even looking at the characters
to see if they are digits. It just tries to convert them to
an integer, throwing an exception if it fails. And of
course, if it doesn't find a matching ")", it throws an
exception for that.
o There is no escaping. A "$(" is ALWAYS interpreted as the
beginning of a substitution variable; there is no way to
make that literal sequence appear in the replacement text.
o It's too noisy. If a substitution variable in the replace-
ment spec refers to a group that didn't match anything, the
matcher replaces the variable with a null, which rather
bizarrely appears as the word "null" in the replacement text.
And if there WAS no such capturing group in the regex, the
matcher throws an exception. (Perl, in both these situations,
will insert an 'undef' value, which prints out as an empty
string.)
o The syntax is cumbersome. I have a terrible time remember-
ing to add those parentheses, and I don't see why I should.
Perl's "$n" syntax has become a de facto standard--every Java
regex tool that I know of uses it. It isn't perfect, but it
shouldn't be discarded lightly.
------------------------ sample code ----------------------------
import java.util.regex.*;
public class ReplaceTest
{
public static void main(String[] argv)
{
try
{
Pattern p = Pattern.compile(argv[0]);
Matcher m = p.matcher("xxxyyyzzz");
System.out.println(m.replaceAll(argv[1]));
}
catch (Exception ex)
{
ex.printStackTrace();
}
}
}
-------------------------- output ------------------------------
$ java ReplaceTest '(y+)(a)*' '###$(1)###'
xxx###yyy###zzz
$ java ReplaceTest '(y+)(a)*' '###$(gronk)###'
java.lang.NumberFormatException: gronk
at java.lang.Integer.parseInt(Integer.java:429)
at java.lang.Integer.parseInt(Integer.java:479)
at java.util.regex.Matcher.appendReplacement(Matcher.java:507)
at java.util.regex.Matcher.replaceAll(Matcher.java:576)
at ReplaceTest.main(ReplaceTest.java:11)
$ java ReplaceTest '(y+)(a)*' '###$(###'
java.lang.IllegalArgumentException: Unbalanced parens in replacement
at java.util.regex.Matcher.appendReplacement(Matcher.java:502)
at java.util.regex.Matcher.replaceAll(Matcher.java:576)
at ReplaceTest.main(ReplaceTest.java:11)
$ java ReplaceTest '(y+)(a)*' '###\$(var)###'
java.lang.NumberFormatException: var
at java.lang.Integer.parseInt(Integer.java:429)
at java.lang.Integer.parseInt(Integer.java:479)
at java.util.regex.Matcher.appendReplacement(Matcher.java:507)
at java.util.regex.Matcher.replaceAll(Matcher.java:576)
at ReplaceTest.main(ReplaceTest.java:11)
$ java ReplaceTest '(y+)(a)*' '###$(2)###'
xxx###null###zzz
$ java ReplaceTest '(y+)(a)*' '###$(3)###'
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.group(Matcher.java:346)
at java.util.regex.Matcher.appendReplacement(Matcher.java:510)
at java.util.regex.Matcher.replaceAll(Matcher.java:576)
at ReplaceTest.main(ReplaceTest.java:11)
------------------------- proposal ------------------------------
I suggest that you take the same approach to substitution
variables as you did to backreferences. Within a regex, the
escapes "\1" through "\9" are always treated as backreferences--
except in character classes, of course. If the next character
also happens to be a digit, it's taken as an attempt to match
that digit, not as part of the backreference. If the regex
author needs to refer to group #10 or higher, he has to use the
alternate syntax, i.e., "\R10" through "\R99".
This approach will work just as well for substitution variables.
Within the replacement spec, a '$' followed by a digit would be
interpreted as a reference to a capturing group (or to the
whole match if the digit is zero), no matter what follows the
digit. Only when the author needed to refer to higher-numbered
groups would he have to use the "$(nn)" syntax. This way, the
regex author enjoys the ease of use of the "$n" syntax 99+% of
of the time--in fact, most people would never have to use the
longer style.
I have made these changes in a local copy of the regex package.
The heuristic I used for processing the replacement spec is:
Upon finding a dollar sign,
o If it was preceded by a backslash, drop the backslash and
insert the dollar sign into the replacement text.
o If it's followed a digit, or by a pair of parens enclosing
one or two digits, replace the whole sequence with the text
that was matched by the corresponding capturing group, if
possible. If the group doesn't exist, or it exists but it
didn't match anything, replace the sequence with an empty
string.
o Otherwise, just insert the dollar sign into the replacement
text.
I don't think this process should throw any exceptions--it
either yields the desired result or it doesn't. But that's
probably my Perl memes talking.
(Review ID: 127931)
======================================================================