Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8292387

Grapheme support in BreakIterator

    XMLWordPrintable

Details

    • CSR
    • Status: Closed
    • P4
    • Resolution: Approved
    • 20
    • core-libs
    • None
    • behavioral
    • low
    • Character breaks now behaves differently. However, those should be the result of the evolution of Unicode's spec, so should not be treated as bugs. See the `Solution` section for more detail.
    • Java API
    • SE

    Description

      Summary

      Enhance the existing java.text.BreakIterator#getCharacterInstance() to support Graphemes

      Problem

      BreakIterator was designed before Unicode consortium introduced the concept of <code class="prettyprint" data-shared-secret="1675733492103-0.3736105054499027">Grapheme Clusters</code>. The class has been providing getCharacterInstance() method for breaking "characters" (in user's perspective), but it cannot handle the breaks defined in the Grapheme specification.

      Solution

      Enhance getCharacterInstance() to support Grapheme Clusters. This will introduce intentional behavioral changes because the old implementation simply breaks at the code point boundaries for the vast majority of characters. For example, this is a String that contains the US flag and a grapheme for a 4-member-family.

      "πŸ‡ΊπŸ‡ΈπŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"

      This String will be broken into two graphemes with the new implementation:

      "πŸ‡ΊπŸ‡Έ", "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"

      whereas the old implementation simply breaks at the code point boundaries:

      "πŸ‡Ί", "πŸ‡Έ", "πŸ‘¨", "(zwj)", "πŸ‘©", "(zwj)", "πŸ‘§", "(zwj)"‍, "πŸ‘¦" 

      where (zwj) denotes ZERO WIDTH JOINER (U+200D).

      Specification

      Insert the following @implSpec after the character boundary analysis paragraph in the class description of BreakIterator class:

      + * @implSpec The default implementation of the character boundary analysis
      + * conforms to the Unicode Consortium's Extended Grapheme Cluster breaks.
      + * For more detail, refer to
      + * <a href="https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries">
      + * Grapheme Cluster Boundaries</a> section in the Unicode Standard Annex #29.

      Attachments

        Issue Links

          Activity

            People

              naoto Naoto Sato
              naoto Naoto Sato
              Stuart Marks
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: