Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8292387

Grapheme support in BreakIterator

XMLWordPrintable

    • Icon: CSR CSR
    • Resolution: Approved
    • Icon: P4 P4
    • 20
    • core-libs
    • None
    • behavioral
    • low
    • Character breaks now behaves differently. However, those should be the result of the evolution of Unicode's spec, so should not be treated as bugs. See the `Solution` section for more detail.
    • Java API
    • SE

      Summary

      Enhance the existing java.text.BreakIterator#getCharacterInstance() to support Graphemes

      Problem

      BreakIterator was designed before Unicode consortium introduced the concept of <code class="prettyprint" data-shared-secret="1740242043008-0.371104895668414">Grapheme Clusters</code>. The class has been providing getCharacterInstance() method for breaking "characters" (in user's perspective), but it cannot handle the breaks defined in the Grapheme specification.

      Solution

      Enhance getCharacterInstance() to support Grapheme Clusters. This will introduce intentional behavioral changes because the old implementation simply breaks at the code point boundaries for the vast majority of characters. For example, this is a String that contains the US flag and a grapheme for a 4-member-family.

      "πŸ‡ΊπŸ‡ΈπŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"

      This String will be broken into two graphemes with the new implementation:

      "πŸ‡ΊπŸ‡Έ", "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"

      whereas the old implementation simply breaks at the code point boundaries:

      "πŸ‡Ί", "πŸ‡Έ", "πŸ‘¨", "(zwj)", "πŸ‘©", "(zwj)", "πŸ‘§", "(zwj)"‍, "πŸ‘¦" 

      where (zwj) denotes ZERO WIDTH JOINER (U+200D).

      Specification

      Insert the following @implSpec after the character boundary analysis paragraph in the class description of BreakIterator class:

      + * @implSpec The default implementation of the character boundary analysis
      + * conforms to the Unicode Consortium's Extended Grapheme Cluster breaks.
      + * For more detail, refer to
      + * <a href="https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries">
      + * Grapheme Cluster Boundaries</a> section in the Unicode Standard Annex #29.

            naoto Naoto Sato
            naoto Naoto Sato
            Stuart Marks
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: