Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8247728

Regex behavior is different and now wrong comparing 8 and 11 (now)

XMLWordPrintable

      ADDITIONAL SYSTEM INFORMATION :
      Tested on Windows and MacOS

      A DESCRIPTION OF THE PROBLEM :
      Using regex for natural language processing, tokenization, to find places between non-repeating punctuations and symbol characters now finds breaks between different whitespace characters.

      REGRESSION : Last worked in version 8u251

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Results for 1st and 2nd regex patterns show different behavior for Java 8 vs Java 11 runtimes.
      3rd and 4th patterns seem stable but should be equivalent.
      Non alpha numeric, non - whitespace character not followed by the same character
      In Java 11, the 1st pattern is now matching spaces and the 2nd is also matching alphanumeric


      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      Java HotSpot(TM) 64-Bit Server VM
      Oracle Corporation
      25.221-b11
      Running Test
      Text: aabcc<space><newline><newline><tab><tab><newline><space><newline><newline>..,;;

      Pattern: ([^0-9a-z&&[^\s]])(?!\1)
      Index: 15, 16 Group: . Next Char: ,
      Index: 16, 17 Group: , Next Char: ;
      Index: 18, 19 Group: ; Next Char:
      Pattern: ([^0-9a-z&&[\s]])(?!\1)
      Index: 5, 6 Group: <space> Next Char: <newline>
      Index: 7, 8 Group: <newline> Next Char: <tab>
      Index: 9, 10 Group: <tab> Next Char: <newline>
      Index: 10, 11 Group: <newline> Next Char: <space>
      Index: 11, 12 Group: <space> Next Char: <newline>
      Index: 13, 14 Group: <newline> Next Char: .
      Pattern: ([\S&&[\w]])(?!\1)
      Index: 1, 2 Group: a Next Char: b
      Index: 2, 3 Group: b Next Char: c
      Index: 4, 5 Group: c Next Char: <space>
      Pattern: ([\S&&[\W]])(?!\1)
      Index: 15, 16 Group: . Next Char: ,
      Index: 16, 17 Group: , Next Char: ;
      Index: 18, 19 Group: ; Next Char:

      ACTUAL -
      Java HotSpot(TM) 64-Bit Server VM
      Oracle Corporation
      11.0.7+8-LTS
      Running Test
      Text: aabcc<space><newline><newline><tab><tab><newline><space><newline><newline>..,;;

      Pattern: ([^0-9a-z&&[^\s]])(?!\1)
      Index: 5, 6 Group: <space> Next Char: <newline>
      Index: 7, 8 Group: <newline> Next Char: <tab>
      Index: 9, 10 Group: <tab> Next Char: <newline>
      Index: 10, 11 Group: <newline> Next Char: <space>
      Index: 11, 12 Group: <space> Next Char: <newline>
      Index: 13, 14 Group: <newline> Next Char: .
      Index: 15, 16 Group: . Next Char: ,
      Index: 16, 17 Group: , Next Char: ;
      Index: 18, 19 Group: ; Next Char:
      Pattern: ([^0-9a-z&&[\s]])(?!\1)
      Index: 1, 2 Group: a Next Char: b
      Index: 2, 3 Group: b Next Char: c
      Index: 4, 5 Group: c Next Char: <space>
      Index: 5, 6 Group: <space> Next Char: <newline>
      Index: 7, 8 Group: <newline> Next Char: <tab>
      Index: 9, 10 Group: <tab> Next Char: <newline>
      Index: 10, 11 Group: <newline> Next Char: <space>
      Index: 11, 12 Group: <space> Next Char: <newline>
      Index: 13, 14 Group: <newline> Next Char: .
      Index: 15, 16 Group: . Next Char: ,
      Index: 16, 17 Group: , Next Char: ;
      Index: 18, 19 Group: ; Next Char:
      Pattern: ([\S&&[\w]])(?!\1)
      Index: 1, 2 Group: a Next Char: b
      Index: 2, 3 Group: b Next Char: c
      Index: 4, 5 Group: c Next Char: <space>
      Pattern: ([\S&&[\W]])(?!\1)
      Index: 15, 16 Group: . Next Char: ,
      Index: 16, 17 Group: , Next Char: ;
      Index: 18, 19 Group: ; Next Char:


      ---------- BEGIN SOURCE ----------
      import java.util.regex.Matcher;
      import java.util.regex.Pattern;

      public class TestRegex {

      public static void main(String[] args) {
      System.out.println(System.getProperty("java.vm.name"));
      System.out.println(System.getProperty("java.vm.vendor"));
      System.out.println(System.getProperty("java.vm.version"));

      System.out.println("Running Test");
      String[] testRegex = new String[] { "([^0-9a-z&&[^\\s]])(?!\\1)", "([^0-9a-z&&[\\s]])(?!\\1)", "([\\S&&[\\w]])(?!\\1)", "([\\S&&[\\W]])(?!\\1)" };
      String text = "aabcc \n\n\t\t\n \n\n..,;;";
      System.out.println("Text: " + printable(text));
      System.out.println();
      for (String regex : testRegex) {
      Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
      System.out.println("Pattern: " + pattern.pattern());
      Matcher m = pattern.matcher(text);
      while (m.find()) {
      System.out
      .println("\tIndex: " + m.start() + ", " + m.end() + " Group: " + printable(m.group()) + " Next Char: " + printable((m.end() < text.length() ? "" + text.charAt(m.end()) : "")));
      }
      }

      }

      public static String printable(String text) {
      text = text.replaceAll("\t", "<tab>");
      text = text.replaceAll("\n", "<newline>");
      text = text.replaceAll(" ", "<space>");
      return text;
      }

      }
      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      Have to use the \S and \W character classes that also allow the underscore character, which is a loss of precision.

      FREQUENCY : always


            smarks Stuart Marks
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: