Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-5013651

java.util.regex.Pattern: Optional groups take too long to compile

XMLWordPrintable

    • b65
    • x86
    • solaris_2.5.1, windows_xp

        Name: jl125535 Date: 03/15/2004


        FULL PRODUCT VERSION :
        java version "1.5.0-beta"
        Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-beta-b32c)
        Java HotSpot(TM) Client VM (build 1.5.0-beta-b32c, mixed mode)

        ADDITIONAL OS VERSION INFORMATION :
        Microsoft Windows XP [Version 5.1.2600]

        A DESCRIPTION OF THE PROBLEM :
        This problem has been present since regex support was introduced
        in version 1.4.0, and may have led some users to conclude that
        the java.util.regex package is too slow. The problem, however,
        is not the matching speed, but the time it takes to *compile* a
        pattern that contains many optional groups.

        The accompanying sample code measures how long it takes to
        compile several patterns, each containing twenty-six optional
        groups. The first pattern, consisting of capturing groups with
        the default greedy question mark, takes 13.3 seconds to compile
        on my machine (Athlon XP 1700+ w/512M RAM, running WinXP Home).
        In the next pattern, I made the question marks lazy - "(x)??" -
        and it takes even longer to compile, about 15 seconds. But when
        I make the question marks possessive - "(x)?+" - the pattern
        compiles in no time at all.

        The fourth pattern shows that changing to non-capturing groups
        makes things slightly worse; that one takes 14.5 seconds to
        compile on my machine. The fifth pattern, however, is the most
        interesting; simply adding the '^' start anchor to the beginning
        of the pattern brings the compilation time down to nothing.
        Anchoring only at the end, as in the sixth pattern, actually
        increases the compilation time 16.4 seconds), but a pattern
        that's anchored at both ends compiles instantly.

        I realize this bug is not very likely to affect the average user, so I
        don't expect it to be assigned a high priority. It is a bug, though,
        and should be fixed eventually; it shouldn't be possible to write a
        regex that takes fifteen seconds to compile. In the meantime, you
        could add a section to the Pattern class doc containing tips to help
        speed up pattern compilation and/or matching. The first two of my
        suggested workarounds are good general advice anyway, the third one
        could be generalized to "use possessive closures or independent groups
        whenever you can", and the fourth could be listed as applying only to
        this edge case.


        REPRODUCIBILITY :
        This bug can be reproduced always.

        ---------- BEGIN SOURCE ----------
        import java.util.regex.*;


        public class Test
        {
          public static void main(String[] args)
          {
            String[] regexes = {
              "(a)?(b)?(c)?(d)?(e)?(f)?(g)?(h)?(i)?(j)?(k)?(l)?(m)?" +
              "(n)?(o)?(p)?(q)?(r)?(s)?(t)?(u)?(v)?(w)?(x)?(y)?(z)?",

              "(a)??(b)??(c)??(d)??(e)??(f)??(g)??(h)??(i)??(j)??" +
              "(k)??(l)??(m)??(n)??(o)??(p)??(q)??(r)??(s)??(t)??" +
              "(u)??(v)??(w)??(x)??(y)??(z)??",

              "(a)?+(b)?+(c)?+(d)?+(e)?+(f)?+(g)?+(h)?+(i)?+(j)?+" +
              "(k)?+(l)?+(m)?+(n)?+(o)?+(p)?+(q)?+(r)?+(s)?+(t)?+" +
              "(u)?+(v)?+(w)?+(x)?+(y)?+(z)?+",

              "(?:a)?(?:b)?(?:c)?(?:d)?(?:e)?(?:f)?(?:g)?(?:h)?(?:i)?" +
              "(?:j)?(?:k)?(?:l)?(?:m)?(?:n)?(?:o)?(?:p)?(?:q)?(?:r)?" +
              "(?:s)?(?:t)?(?:u)?(?:v)?(?:w)?(?:x)?(?:y)?(?:z)?",

              "^(a)?(b)?(c)?(d)?(e)?(f)?(g)?(h)?(i)?(j)?(k)?(l)?(m)?" +
              "(n)?(o)?(p)?(q)?(r)?(s)?(t)?(u)?(v)?(w)?(x)?(y)?(z)?",

              "(a)?(b)?(c)?(d)?(e)?(f)?(g)?(h)?(i)?(j)?(k)?(l)?(m)?" +
              "(n)?(o)?(p)?(q)?(r)?(s)?(t)?(u)?(v)?(w)?(x)?(y)?(z)?$",

              "^(?:a)?(?:b)?(?:c)?(?:d)?(?:e)?(?:f)?(?:g)?(?:h)?(?:i)?" +
              "(?:j)?(?:k)?(?:l)?(?:m)?(?:n)?(?:o)?(?:p)?(?:q)?(?:r)?" +
              "(?:s)?(?:t)?(?:u)?(?:v)?(?:w)?(?:x)?(?:y)?(?:z)?$"
            };

            for (int i = 0; i < regexes.length; i++)
            {
              long startTime = System.currentTimeMillis();
              Pattern p = Pattern.compile(regexes[i]);
              long msec = (System.currentTimeMillis() - startTime);
              System.out.println("Pattern " + i + " compiles in " +
                                 msec + " milliseconds");
            }
          }
        }

        ---------- END SOURCE ----------

        CUSTOMER SUBMITTED WORKAROUND :
        1) Don't compile patterns inside loops if you don't have to.

        2) Always anchor patterns at both ends if appropriate, even if
           you're using the matches() method.

        3) Use the possesive form of the '?' closure if you can.

        4) Avoid patterns with many optional groups in them.
        (Incident Review ID: 243126)
        ======================================================================
        ###@###.### 11/1/04 20:33 GMT

              sherman Xueming Shen
              jleesunw Jon Lee (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: