Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8209777

\b{g} in regexes fails to break between flag emoji

    XMLWordPrintable

Details

    Description

      ADDITIONAL SYSTEM INFORMATION :
      Windows 10, x64, version 1803 (OS Build 17134.228)

      Oracle JRE and JDK 10.0.2
      $ java -version
      java version "10.0.2" 2018-07-17
      Java(TM) SE Runtime Environment 18.3 (build 10.0.2+13)
      Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.2+13, mixed mode)

      A DESCRIPTION OF THE PROBLEM :
      A flag emoji is composed of two regional indicator (RI) symbols (U+1F1E6 through U+1F1FF); multiple RI pairs can be placed adjacently for multiple flags, e.g. U+1F1FA U+1F1F8 U+1F1EB U+1F1E7 for the US and the French flags.

      UAX TR29 "Unicode Text Segmentation" determines how text is split into grapheme clusters; rules GB12 and GB13 handle RI pairs (https://unicode.org/reports/tr29/#GB12).

      The report states: "Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point. [...] Otherwise, break everywhere." i.e. break between flags at even boundaries. (But don't break before the last RI character if it's in an odd-numbered group.)

      However java.util.regex.Pattern.compile doesn't break at all in flag sequences; any number of consecutive flag characters are treated as one grapheme cluster.

      Also, I had to spend quite some time fixing my answer because this form destroyed all the flag characters (and non-ASCII punctuation) when I failed the captcha.

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      $ javac BoundaryRegex.java
      $ java -ea BoundaryRegex

      Both assertions will fail; note that you may need to add an -encoding argument to the javac command for correct compilation.

      Note: Unicode codepoints have been escaped because this form does not appear to be Unicode-friendly, but the input string is the RI characters corresponding to

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      Expected string "\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7" to split into 4 flag graphemes of 2 RI characters each, i.e. {"\ud83c\udde6\ud83c\uddec", "\ud83c\uddec\ud83c\udde6", "\ud83c\uddfa\ud83c\uddf8", "\ud83c\uddeb\ud83c\uddf7"}
      ACTUAL -
      Got one grapheme of all flags together, { "\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7" }, i.e. the input string unchanged.

      ---------- BEGIN SOURCE ----------
      // BoundaryRegex.java
      import java.util.Arrays;
      import java.util.regex.Pattern;

      public class BoundaryRegex {
          public static void main(String[] args) {
              var graphemes = Pattern.compile("\\b{g}").split("\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7");
              assert graphemes.length == 4
                      : "Input has 4 flags but only " + graphemes.length + " was found";
              assert Arrays.equals(graphemes, new String[] {"\ud83c\udde6\ud83c\uddec", "\ud83c\uddec\ud83c\udde6", "\ud83c\uddfa\ud83c\uddf8", "\ud83c\uddeb\ud83c\uddf7"})
                      : "Flags split unexpectedly; " + Arrays.toString(graphemes);
          }
      }
      ---------- END SOURCE ----------

      FREQUENCY : always


      Attachments

        Issue Links

          Activity

            People

              sherman Xueming Shen
              webbuggrp Webbug Group
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: