Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Fixed
Priority: P5
Fix Version/s: 1.4.0
Affects Version/s: 1.4.0
Component/s: core-libs
Labels:
- webbug

Subcomponent:
java.util.regex
Resolved In Build:
beta2
CPU:

generic
OS:

generic

Name: bsC130419 Date: 07/12/2001

java version "1.4.0-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-beta-b65)
Java HotSpot(TM) Client VM (build 1.4.0-beta-b65, mixed mode)

The implementation of substitution variables in the regex
package suffers from several flaws, which I describe below
(note that I use the term "replacement spec" to mean the
string that is passed to the appendReplacement() and
replaceAll() methods, and "replacement text" to mean the
result after all replacements have been made). Below that
is some sample code and output illustrating these flaws,
followed my proposal for a better implementation.

o There is no sanity-checking. When the matcher finds the
  sequence "$(" in the replacement spec, it takes any char-
  acters between that and the next ")" as being a reference to
  a capturing group, without even looking at the characters
  to see if they are digits. It just tries to convert them to
  an integer, throwing an exception if it fails. And of
  course, if it doesn't find a matching ")", it throws an
  exception for that.

o There is no escaping. A "$(" is ALWAYS interpreted as the
  beginning of a substitution variable; there is no way to
  make that literal sequence appear in the replacement text.

o It's too noisy. If a substitution variable in the replace-
  ment spec refers to a group that didn't match anything, the
  matcher replaces the variable with a null, which rather
  bizarrely appears as the word "null" in the replacement text.
  And if there WAS no such capturing group in the regex, the
  matcher throws an exception. (Perl, in both these situations,
  will insert an 'undef' value, which prints out as an empty
  string.)

o The syntax is cumbersome. I have a terrible time remember-
  ing to add those parentheses, and I don't see why I should.
  Perl's "$n" syntax has become a de facto standard--every Java
  regex tool that I know of uses it. It isn't perfect, but it
  shouldn't be discarded lightly.

------------------------ sample code ----------------------------

import java.util.regex.*;

public class ReplaceTest
{
  public static void main(String[] argv)
  {
    try
    {
      Pattern p = Pattern.compile(argv[0]);
      Matcher m = p.matcher("xxxyyyzzz");
      System.out.println(m.replaceAll(argv[1]));
    }
    catch (Exception ex)
    {
      ex.printStackTrace();
    }
  }
}

-------------------------- output ------------------------------

$ java ReplaceTest '(y+)(a)*' '###$(1)###'
xxx###yyy###zzz

$ java ReplaceTest '(y+)(a)*' '###$(gronk)###'
java.lang.NumberFormatException: gronk
    at java.lang.Integer.parseInt(Integer.java:429)
    at java.lang.Integer.parseInt(Integer.java:479)
    at java.util.regex.Matcher.appendReplacement(Matcher.java:507)
    at java.util.regex.Matcher.replaceAll(Matcher.java:576)
    at ReplaceTest.main(ReplaceTest.java:11)

$ java ReplaceTest '(y+)(a)*' '###$(###'
java.lang.IllegalArgumentException: Unbalanced parens in replacement
    at java.util.regex.Matcher.appendReplacement(Matcher.java:502)
    at java.util.regex.Matcher.replaceAll(Matcher.java:576)
    at ReplaceTest.main(ReplaceTest.java:11)

$ java ReplaceTest '(y+)(a)*' '###\$(var)###'
java.lang.NumberFormatException: var
    at java.lang.Integer.parseInt(Integer.java:429)
    at java.lang.Integer.parseInt(Integer.java:479)
    at java.util.regex.Matcher.appendReplacement(Matcher.java:507)
    at java.util.regex.Matcher.replaceAll(Matcher.java:576)
    at ReplaceTest.main(ReplaceTest.java:11)

$ java ReplaceTest '(y+)(a)*' '###$(2)###'
xxx###null###zzz

$ java ReplaceTest '(y+)(a)*' '###$(3)###'
java.lang.IndexOutOfBoundsException: No group 3
    at java.util.regex.Matcher.group(Matcher.java:346)
    at java.util.regex.Matcher.appendReplacement(Matcher.java:510)
    at java.util.regex.Matcher.replaceAll(Matcher.java:576)
    at ReplaceTest.main(ReplaceTest.java:11)

------------------------- proposal ------------------------------

I suggest that you take the same approach to substitution
variables as you did to backreferences. Within a regex, the
escapes "\1" through "\9" are always treated as backreferences--
except in character classes, of course. If the next character
also happens to be a digit, it's taken as an attempt to match
that digit, not as part of the backreference. If the regex
author needs to refer to group #10 or higher, he has to use the
alternate syntax, i.e., "\R10" through "\R99".

This approach will work just as well for substitution variables.
Within the replacement spec, a '$' followed by a digit would be
interpreted as a reference to a capturing group (or to the
whole match if the digit is zero), no matter what follows the
digit. Only when the author needed to refer to higher-numbered
groups would he have to use the "$(nn)" syntax. This way, the
regex author enjoys the ease of use of the "$n" syntax 99+% of
of the time--in fact, most people would never have to use the
longer style.

I have made these changes in a local copy of the regex package.
The heuristic I used for processing the replacement spec is:

Upon finding a dollar sign,

o If it was preceded by a backslash, drop the backslash and
  insert the dollar sign into the replacement text.

o If it's followed a digit, or by a pair of parens enclosing
  one or two digits, replace the whole sequence with the text
  that was matched by the corresponding capturing group, if
  possible. If the group doesn't exist, or it exists but it
  didn't match anything, replace the sequence with an empty
  string.

o Otherwise, just insert the dollar sign into the replacement
  text.

I don't think this process should throw any exceptions--it
either yields the desired result or it doesn't. But that's
probably my Perl memes talking.
(Review ID: 127931)
======================================================================

Assignee:: Michael Mccloskey (Inactive)

Reporter:: Bill Strathearn (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Created:: 2001-07-12 09:55

Updated:: 2001-08-07 08:51

Resolved:: 2001-08-07 08:51

Imported:: 16/Sep/12 10:52 AM

Indexed:: 18/Jul/12 5:51 AM

Details

Description

Attachments

Activity

People

Dates