Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8247546

Pattern matching does not skip correctly over supplementary characters

    XMLWordPrintable

Details

    • b09
    • x86_64
    • linux, windows_10
    • Verified

    Backports

      Description

        A DESCRIPTION OF THE PROBLEM :
        The find method in java.util.regex.Matcher incorrectly skips only the first char of a supplemental codepoint when searching for an initial pattern match. The problematic code is in the java.util.regex.Pattern.Start Node which contains the following code:

                    for (; i ]]
        </div>
        </div>
        <br /> <br /> <br /> <br /> <br /> <br />
        <div class="form-group">
        <label for="system_os_info" class="col-sm-2 control-label">System
        / OS / Java Runtime Information </label>
        <div class="col-sm-8">

        <textarea id="system_os_info" name="system_os_info" style="resize: none;" placeholder="Additional system configuration information here." class="form-control" rows="4">
        Tested on openjdk 14.0.1 and 11.0.5

        STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
        See the attached source code. The goal of the program is to replace invalid surrogate characters, properly encoded supplemental characters like the example emoji should be left unchanged.

        EXPECTED VERSUS ACTUAL BEHAVIOR :
        EXPECTED -
        The input string containing the emoji should not be matched and replaced by the pattern
        ACTUAL -
        The pattern does not match at char index 0, but then steps only one char forward (instead of one codepoint), leading to a match on the second half of the supplemental codepoint. This second char is then matched and replaced. Output (question mark is due to terminal encoding):

        ? d83d
        X 58

        ---------- BEGIN SOURCE ----------
        import java.util.regex.Pattern;

        public class ReplaceInvalidSurrogates {
            public static void main(String[] args) {
                String pileofpoo = new StringBuilder().appendCodePoint(0x1F4A9).toString();
                System.out.println(pileofpoo);

                // match low and high surrogate ranges. should only match lone surrogates, not any correctly encoded supplementary characters
                Pattern surrogates = Pattern.compile("[\\x{D800}-\\x{DBFF}\\x{DC00}-\\x{DFFF}]");

                String result = surrogates.matcher(pileofpoo).replaceAll("X");

                System.out.println(result);
                System.out.println(result.charAt(0) + " " + Integer.toHexString(result.charAt(0)));
                System.out.println(result.charAt(1) + " " + Integer.toHexString(result.charAt(1)));
            }
        }

        ---------- END SOURCE ----------

        FREQUENCY : always


        Attachments

          Issue Links

            Activity

              People

                naoto Naoto Sato
                webbuggrp Webbug Group
                Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved: