Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4479128

RFE: Matcher should use Perl-style substitution variables

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Fixed
    • Icon: P5 P5
    • 1.4.0
    • 1.4.0
    • core-libs
    • beta2
    • generic
    • generic



      Name: bsC130419 Date: 07/12/2001


      java version "1.4.0-beta"
      Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-beta-b65)
      Java HotSpot(TM) Client VM (build 1.4.0-beta-b65, mixed mode)

      The implementation of substitution variables in the regex
      package suffers from several flaws, which I describe below
      (note that I use the term "replacement spec" to mean the
      string that is passed to the appendReplacement() and
      replaceAll() methods, and "replacement text" to mean the
      result after all replacements have been made). Below that
      is some sample code and output illustrating these flaws,
      followed my proposal for a better implementation.

      o There is no sanity-checking. When the matcher finds the
        sequence "$(" in the replacement spec, it takes any char-
        acters between that and the next ")" as being a reference to
        a capturing group, without even looking at the characters
        to see if they are digits. It just tries to convert them to
        an integer, throwing an exception if it fails. And of
        course, if it doesn't find a matching ")", it throws an
        exception for that.

      o There is no escaping. A "$(" is ALWAYS interpreted as the
        beginning of a substitution variable; there is no way to
        make that literal sequence appear in the replacement text.

      o It's too noisy. If a substitution variable in the replace-
        ment spec refers to a group that didn't match anything, the
        matcher replaces the variable with a null, which rather
        bizarrely appears as the word "null" in the replacement text.
        And if there WAS no such capturing group in the regex, the
        matcher throws an exception. (Perl, in both these situations,
        will insert an 'undef' value, which prints out as an empty
        string.)

      o The syntax is cumbersome. I have a terrible time remember-
        ing to add those parentheses, and I don't see why I should.
        Perl's "$n" syntax has become a de facto standard--every Java
        regex tool that I know of uses it. It isn't perfect, but it
        shouldn't be discarded lightly.


      ------------------------ sample code ----------------------------

      import java.util.regex.*;

      public class ReplaceTest
      {
        public static void main(String[] argv)
        {
          try
          {
            Pattern p = Pattern.compile(argv[0]);
            Matcher m = p.matcher("xxxyyyzzz");
            System.out.println(m.replaceAll(argv[1]));
          }
          catch (Exception ex)
          {
            ex.printStackTrace();
          }
        }
      }


      -------------------------- output ------------------------------

      $ java ReplaceTest '(y+)(a)*' '###$(1)###'
      xxx###yyy###zzz


      $ java ReplaceTest '(y+)(a)*' '###$(gronk)###'
      java.lang.NumberFormatException: gronk
          at java.lang.Integer.parseInt(Integer.java:429)
          at java.lang.Integer.parseInt(Integer.java:479)
          at java.util.regex.Matcher.appendReplacement(Matcher.java:507)
          at java.util.regex.Matcher.replaceAll(Matcher.java:576)
          at ReplaceTest.main(ReplaceTest.java:11)


      $ java ReplaceTest '(y+)(a)*' '###$(###'
      java.lang.IllegalArgumentException: Unbalanced parens in replacement
          at java.util.regex.Matcher.appendReplacement(Matcher.java:502)
          at java.util.regex.Matcher.replaceAll(Matcher.java:576)
          at ReplaceTest.main(ReplaceTest.java:11)


      $ java ReplaceTest '(y+)(a)*' '###\$(var)###'
      java.lang.NumberFormatException: var
          at java.lang.Integer.parseInt(Integer.java:429)
          at java.lang.Integer.parseInt(Integer.java:479)
          at java.util.regex.Matcher.appendReplacement(Matcher.java:507)
          at java.util.regex.Matcher.replaceAll(Matcher.java:576)
          at ReplaceTest.main(ReplaceTest.java:11)


      $ java ReplaceTest '(y+)(a)*' '###$(2)###'
      xxx###null###zzz


      $ java ReplaceTest '(y+)(a)*' '###$(3)###'
      java.lang.IndexOutOfBoundsException: No group 3
          at java.util.regex.Matcher.group(Matcher.java:346)
          at java.util.regex.Matcher.appendReplacement(Matcher.java:510)
          at java.util.regex.Matcher.replaceAll(Matcher.java:576)
          at ReplaceTest.main(ReplaceTest.java:11)


      ------------------------- proposal ------------------------------

      I suggest that you take the same approach to substitution
      variables as you did to backreferences. Within a regex, the
      escapes "\1" through "\9" are always treated as backreferences--
      except in character classes, of course. If the next character
      also happens to be a digit, it's taken as an attempt to match
      that digit, not as part of the backreference. If the regex
      author needs to refer to group #10 or higher, he has to use the
      alternate syntax, i.e., "\R10" through "\R99".


      This approach will work just as well for substitution variables.
      Within the replacement spec, a '$' followed by a digit would be
      interpreted as a reference to a capturing group (or to the
      whole match if the digit is zero), no matter what follows the
      digit. Only when the author needed to refer to higher-numbered
      groups would he have to use the "$(nn)" syntax. This way, the
      regex author enjoys the ease of use of the "$n" syntax 99+% of
      of the time--in fact, most people would never have to use the
      longer style.


      I have made these changes in a local copy of the regex package.
      The heuristic I used for processing the replacement spec is:

      Upon finding a dollar sign,

      o If it was preceded by a backslash, drop the backslash and
        insert the dollar sign into the replacement text.

      o If it's followed a digit, or by a pair of parens enclosing
        one or two digits, replace the whole sequence with the text
        that was matched by the corresponding capturing group, if
        possible. If the group doesn't exist, or it exists but it
        didn't match anything, replace the sequence with an empty
        string.

      o Otherwise, just insert the dollar sign into the replacement
        text.


      I don't think this process should throw any exceptions--it
      either yields the desired result or it doesn't. But that's
      probably my Perl memes talking.
      (Review ID: 127931)
      ======================================================================

            mmcclosksunw Michael Mccloskey (Inactive)
            bstrathesunw Bill Strathearn (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: