Summary
Add a no-argument codePointCount() method to CharSequence to count the number of Unicode code points in the entire sequence.
Problem
Currently, String.codePointCount and Character.codePointCount that takes a CharSequence only provide an overload that requires start and end indices. Developers often expect an overload with no arguments that returns the code point count of the entire string or sequence. Without this, developers resort to verbose or less efficient workarounds, such as using codePoints().count() (which yields every code point, adding unnecessary overhead) or calling codePointCount(0, str.length()) (which is more verbose, requires a temporary variable, and performs an extra boundary check).
A common use case involves enforcing maximum character limits on user input, particularly for fields stored in databases such as MySQL or PostgreSQL. Both database systems can consider the declared length of VARCHAR(n) columns as the number of Unicode code points, not just the number of char units or bytes for character sets like UTF-8 (utf8mb4 in MySQL). Correctly counting code points is essential for supporting internationalized input, emoji, and non-BMP characters. For example, the NIST SP 800-63B guideline specifies that passwords should be checked in terms of the number of Unicode code points.
References:
Solution
Introduce default no-argument codePointCount() methods in both the CharSequence interface. The new method returns the number of Unicode code points in the entire character sequence, equivalent to invoking codePointCount(0, length()), but provides better readability and avoids unnecessary overhead. The implementation in CharSequence is a default method, while String provides an explicit override for potential performance optimization.
Example use cases:
// For user names stored in MySQL (or PostgreSQL) VARCHAR(20), which counts code points:
if (userName.codePointCount() > 20) {
IO.println("The user name is too long to store in VARCHAR(20) in utf8mb4 MySQL/PostgreSQL!");
}
// Password policy: require at least 8 Unicode characters (code points) as per NIST SP 800-63B:
if (password.codePointCount() < 8) {
IO.println("Password is too short!");
}
Alternatives
Addition of no-indice overloads of codePointCount to java.lang.Character for CharSequence and char[] was considered and rejected. Users can just use CharSequence.codePointCount() and CharBuffer.wrap(array).codePointCount() instead.
Specification
Add to java.lang.CharSequence interface:
(This is inherited by StringBuilder and StringBuffer with @since 27)
/**
* {@return the number of Unicode code points in this character sequence}
* Isolated surrogate code units count as one code point each.
*
* @since 27
*/
default int codePointCount() {
}
Add to java.lang.String class:
/**
* {@return the number of Unicode code points in this String}
* Isolated surrogate code units count as one code point each.
*
* @since 27
*/
@Override
public int codePointCount() {
}
- csr of
-
JDK-8364007 Add no-argument codePointCount method to CharSequence and String
-
- Open
-