Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Duplicate
Priority: P3
Fix Version/s: tbd
Affects Version/s: 10, 11
Component/s: core-libs
Labels:

Subcomponent:
java.util.regex
CPU:

x86_64
OS:

windows_10

ADDITIONAL SYSTEM INFORMATION :
Windows 10, x64, version 1803 (OS Build 17134.228)

Oracle JRE and JDK 10.0.2
$ java -version
java version "10.0.2" 2018-07-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.2+13)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.2+13, mixed mode)

A DESCRIPTION OF THE PROBLEM :
A flag emoji is composed of two regional indicator (RI) symbols (U+1F1E6 through U+1F1FF); multiple RI pairs can be placed adjacently for multiple flags, e.g. U+1F1FA U+1F1F8 U+1F1EB U+1F1E7 for the US and the French flags.

UAX TR29 "Unicode Text Segmentation" determines how text is split into grapheme clusters; rules GB12 and GB13 handle RI pairs (https://unicode.org/reports/tr29/#GB12).

The report states: "Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point. [...] Otherwise, break everywhere." i.e. break between flags at even boundaries. (But don't break before the last RI character if it's in an odd-numbered group.)

However java.util.regex.Pattern.compile doesn't break at all in flag sequences; any number of consecutive flag characters are treated as one grapheme cluster.

Also, I had to spend quite some time fixing my answer because this form destroyed all the flag characters (and non-ASCII punctuation) when I failed the captcha.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
$ javac BoundaryRegex.java
$ java -ea BoundaryRegex

Both assertions will fail; note that you may need to add an -encoding argument to the javac command for correct compilation.

Note: Unicode codepoints have been escaped because this form does not appear to be Unicode-friendly, but the input string is the RI characters corresponding to

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Expected string "\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7" to split into 4 flag graphemes of 2 RI characters each, i.e. {"\ud83c\udde6\ud83c\uddec", "\ud83c\uddec\ud83c\udde6", "\ud83c\uddfa\ud83c\uddf8", "\ud83c\uddeb\ud83c\uddf7"}
ACTUAL -
Got one grapheme of all flags together, { "\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7" }, i.e. the input string unchanged.

---------- BEGIN SOURCE ----------
// BoundaryRegex.java
import java.util.Arrays;
import java.util.regex.Pattern;

public class BoundaryRegex {
    public static void main(String[] args) {
        var graphemes = Pattern.compile("\\b{g}").split("\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7");
        assert graphemes.length == 4
                : "Input has 4 flags but only " + graphemes.length + " was found";
        assert Arrays.equals(graphemes, new String[] {"\ud83c\udde6\ud83c\uddec", "\ud83c\uddec\ud83c\udde6", "\ud83c\uddfa\ud83c\uddf8", "\ud83c\uddeb\ud83c\uddf7"})
                : "Flags split unexpectedly; " + Arrays.toString(graphemes);
    }
}
---------- END SOURCE ----------

FREQUENCY : always

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

JI9056766.java
2018-08-20 21:56
0.6 kB
Pallavi Sonal

duplicates

JDK-8222978 Upgrade the extended grapheme cluster support to the latest Unicode level.

Resolved

Assignee:: Xueming Shen

Reporter:: Webbug Group

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2018-08-18 18:13

Updated:: 2019-04-25 09:56

Resolved:: 2019-04-25 09:56

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates