Enhance the existing
java.text.BreakIterator#getCharacterInstance() to support Graphemes
BreakIterator was designed before Unicode consortium introduced the concept of <code class="prettyprint" data-shared-secret="1675733492103-0.3736105054499027">Grapheme Clusters</code>. The class has been providing
getCharacterInstance() method for breaking "characters" (in user's perspective), but it cannot handle the breaks defined in the Grapheme specification.
getCharacterInstance() to support Grapheme Clusters. This will introduce intentional behavioral changes because the old implementation simply breaks at the code point boundaries for the vast majority of characters. For example, this is a String that contains the US flag and a grapheme for a 4-member-family.
This String will be broken into two graphemes with the new implementation:
whereas the old implementation simply breaks at the code point boundaries:
"🇺", "🇸", "👨", "(zwj)", "👩", "(zwj)", "👧", "(zwj)", "👦"
(zwj) denotes ZERO WIDTH JOINER (U+200D).
Insert the following @implSpec after the character boundary analysis paragraph in the class description of
+ * @implSpec The default implementation of the character boundary analysis + * conforms to the Unicode Consortium's Extended Grapheme Cluster breaks. + * For more detail, refer to + * <a href="https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries"> + * Grapheme Cluster Boundaries</a> section in the Unicode Standard Annex #29.