Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: P4
Fix Version/s: 16
Affects Version/s: 8, 11, 15
Component/s: core-libs
Labels:

Subcomponent:
java.util.regex
Resolved In Build:
b09
CPU:

x86_64
OS:

linux, windows_10
Verification:
Verified

Issue	Fix Version	Assignee	Priority	Status	Resolution	Resolved In Build
JDK-8292486	11.0.17	Liam Miller-Cushon	P4	Resolved	Fixed	b03

A DESCRIPTION OF THE PROBLEM :
The find method in java.util.regex.Matcher incorrectly skips only the first char of a supplemental codepoint when searching for an initial pattern match. The problematic code is in the java.util.regex.Pattern.Start Node which contains the following code:

            for (; i ]]
</div>
</div>
<br /> <br /> <br /> <br /> <br /> <br />
<div class="form-group">
<label for="system_os_info" class="col-sm-2 control-label">System
/ OS / Java Runtime Information </label>
<div class="col-sm-8">

<textarea id="system_os_info" name="system_os_info" style="resize: none;" placeholder="Additional system configuration information here." class="form-control" rows="4">
Tested on openjdk 14.0.1 and 11.0.5

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
See the attached source code. The goal of the program is to replace invalid surrogate characters, properly encoded supplemental characters like the example emoji should be left unchanged.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The input string containing the emoji should not be matched and replaced by the pattern
ACTUAL -
The pattern does not match at char index 0, but then steps only one char forward (instead of one codepoint), leading to a match on the second half of the supplemental codepoint. This second char is then matched and replaced. Output (question mark is due to terminal encoding):

? d83d
X 58

---------- BEGIN SOURCE ----------
import java.util.regex.Pattern;

public class ReplaceInvalidSurrogates {
    public static void main(String[] args) {
        String pileofpoo = new StringBuilder().appendCodePoint(0x1F4A9).toString();
        System.out.println(pileofpoo);

        // match low and high surrogate ranges. should only match lone surrogates, not any correctly encoded supplementary characters
        Pattern surrogates = Pattern.compile("[\\x{D800}-\\x{DBFF}\\x{DC00}-\\x{DFFF}]");

        String result = surrogates.matcher(pileofpoo).replaceAll("X");

        System.out.println(result);
        System.out.println(result.charAt(0) + " " + Integer.toHexString(result.charAt(0)));
        System.out.println(result.charAt(1) + " " + Integer.toHexString(result.charAt(1)));
    }
}

---------- END SOURCE ----------

FREQUENCY : always

backported by

JDK-8292486 Pattern matching does not skip correctly over supplementary characters

Resolved

links to

Commit openjdk/jdk11u-dev/38c632f1

Review openjdk/jdk11u-dev/1319

Assignee:: Naoto Sato

Reporter:: Webbug Group

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2020-06-11 15:14

Updated:: 2024-11-13 12:00

Resolved:: 2020-07-29 09:57

Details

Backports

Description

Attachments

Issue Links

Activity

People

Dates