Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8321390

No maximum match although greedy optional-operator

XMLWordPrintable

      ADDITIONAL SYSTEM INFORMATION :
      This occurs in any cases I tried.

      A DESCRIPTION OF THE PROBLEM :
      I have a concrete pattern:

      \A
      (%\s*!\s*T[eE]X program=(?<programMagic>[^} ]+)\R)?
      (\\(documentstyle|documentclass)\s*(\[[^]]*\])?\s*\{(?<docClass>[^} ]+)\})?

      and i want to match the following string

      % !TEX program=lualatex
      \documentclass[a4paper,12pt]{book}

      As one can see it is about matching the beginning of a latex root file and extracting the program and the document class.

      The \A is to restrict to the start of the file and the last two lines of the pattern are optional, yet greedy (operator ?). This is not the whole truth, but I want to identify root files when either a class or a program is given.

      If I remove the two ?, the given file is matched and both, program and class are identified correctly.

      If I add both ? as above, then only the second, the class is matched. If I use only one ? then only the other, non-optional part is matched.

      If I understand the theory of regex right, then ? is greedy and so both, class and program must be matched in the version of the pattern above.

      This is not the case. Is this a bug or do I misunderstand something?

      I found analogous issues with * instead of ?


      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Set up pattern and string to match as given.
      Then the pattern matches,
      but in the cases indicated the match is not maximal as expected for greedy ? operator.

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      I would expect that both groups are matched even in presence of both ? operators.
      ACTUAL -
      only the second group is matched.

      ---------- BEGIN SOURCE ----------
      String regex =
      """
      \\A
      (%\\s*!\\s*T[eE]X program=(?<programMagic>[^} ]+)\\R)?\
      (\\\\(documentstyle|documentclass)\\s*(\[[^]]*\])?\\s*\{(?<docClass>[^} ]+)\})\
      """;
      Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
      String input =
      """
      % !TEX program=lualatex
      \documentclass[a4paper,12pt]{book}
      """;
      Matcher matcher = pattern.matcher(input);
      boolean found = matcher.find();
      assert found;
      matcher.group("programMagic");// is null but shall be "lualatex"
      matcher.group("docClass");// is "book" as expected.
      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      I have no workaround.

      FREQUENCY : always


            tongwan Andrew Wang
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: