Summary
Enhance the existing java.text.BreakIterator#getCharacterInstance()
to support Graphemes
Problem
BreakIterator
was designed before Unicode consortium introduced the concept of <code class="prettyprint" data-shared-secret="1740242043008-0.371104895668414">Grapheme Clusters</code>. The class has been providing getCharacterInstance()
method for breaking "characters" (in user's perspective), but it cannot handle the breaks defined in the Grapheme specification.
Solution
Enhance getCharacterInstance()
to support Grapheme Clusters. This will introduce intentional behavioral changes because the old implementation simply breaks at the code point boundaries for the vast majority of characters. For example, this is a String that contains the US flag and a grapheme for a 4-member-family.
"πΊπΈπ¨βπ©βπ§βπ¦"
This String will be broken into two graphemes with the new implementation:
"πΊπΈ", "π¨βπ©βπ§βπ¦"
whereas the old implementation simply breaks at the code point boundaries:
"πΊ", "πΈ", "π¨", "(zwj)", "π©", "(zwj)", "π§", "(zwj)"β, "π¦"
where (zwj)
denotes ZERO WIDTH JOINER (U+200D).
Specification
Insert the following @implSpec after the character boundary analysis paragraph in the class description of BreakIterator
class:
+ * @implSpec The default implementation of the character boundary analysis
+ * conforms to the Unicode Consortium's Extended Grapheme Cluster breaks.
+ * For more detail, refer to
+ * <a href="https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries">
+ * Grapheme Cluster Boundaries</a> section in the Unicode Standard Annex #29.
- csr of
-
JDK-8291660 Grapheme support in BreakIterator
-
- Resolved
-
- relates to
-
JDK-8294008 Grapheme implementation of setText() throws IndexOutOfBoundsException
-
- Closed
-