Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8216332

Grapheme regex does not work with emoji sequences

    XMLWordPrintable

Details

    • b15
    • x86_64
    • linux_ubuntu

    Description

      ADDITIONAL SYSTEM INFORMATION :
      openjdk version "12-ea" 2019-03-19
      OpenJDK Runtime Environment (build 12-ea+26)
      OpenJDK 64-Bit Server VM (build 12-ea+26, mixed mode, sharing)

      A DESCRIPTION OF THE PROBLEM :
      Emoji sequences like 👨🏾 or 👨‍👩‍👦 are not clustered using the regular expression matcher \b{g} (A Unicode extended grapheme cluster boundary).

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      String stringmoji = new StringBuilder().appendCodePoint(0x1f468).appendCodePoint(0x1f3fe).appendCodePoint(0x1f468).appendCodePoint(0x200d).appendCodePoint(0x1f469).appendCodePoint(0x200d).appendCodePoint(0x1f466).toString();
      Pattern pattern = Pattern.compile("\\b{g}");
      Function<String, String> toCodePointNumber = (cp) -> cp.codePoints().mapToObj(c -> String.format("%04x", c)).collect(Collectors.joining(",")); System.out.println(pattern.splitAsStream(stringmoji).map(toCodePointNumber).collect(Collectors.joining("][","[","]")));

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      [1f468,1f3fe][1f468,200d,1f469,200d,1f466]
      ACTUAL -
      [1f468][1f3fe][1f468,200d][1f469,200d][1f466]

      FREQUENCY : always


      Attachments

        Activity

          People

            naoto Naoto Sato
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: