Add no-argument codePointCount method to CharSequence and String

XMLWordPrintable

    • Type: CSR
    • Resolution: Unresolved
    • Priority: P4
    • 27
    • Component/s: core-libs
    • None
    • source, behavioral
    • medium
    • Hide
      A search for int codePointCount() declaration on grep.app found no implementation of CharSequence. However, given how widely CharSequence is implemented directly or indirectly, there is still chance that a clash may happen, such as if an existing implementation declares a method of the same parameter types but a different return type like long.
      Show
      A search for int codePointCount() declaration on grep.app found no implementation of CharSequence. However, given how widely CharSequence is implemented directly or indirectly, there is still chance that a clash may happen, such as if an existing implementation declares a method of the same parameter types but a different return type like long.
    • Java API
    • SE

      Summary

      Add a no-argument codePointCount() method to CharSequence to count the number of Unicode code points in the entire sequence.

      Problem

      Currently, String.codePointCount and Character.codePointCount that takes a CharSequence only provide an overload that requires start and end indices. Developers often expect an overload with no arguments that returns the code point count of the entire string or sequence. Without this, developers resort to verbose or less efficient workarounds, such as using codePoints().count() (which yields every code point, adding unnecessary overhead) or calling codePointCount(0, str.length()) (which is more verbose, requires a temporary variable, and performs an extra boundary check).

      A common use case involves enforcing maximum character limits on user input, particularly for fields stored in databases such as MySQL or PostgreSQL. Both database systems can consider the declared length of VARCHAR(n) columns as the number of Unicode code points, not just the number of char units or bytes for character sets like UTF-8 (utf8mb4 in MySQL). Correctly counting code points is essential for supporting internationalized input, emoji, and non-BMP characters. For example, the NIST SP 800-63B guideline specifies that passwords should be checked in terms of the number of Unicode code points.

      References:

      Solution

      Introduce default no-argument codePointCount() methods in both the CharSequence interface. The new method returns the number of Unicode code points in the entire character sequence, equivalent to invoking codePointCount(0, length()), but provides better readability and avoids unnecessary overhead. The implementation in CharSequence is a default method, while String provides an explicit override for potential performance optimization.

      Example use cases:

      // For user names stored in MySQL (or PostgreSQL) VARCHAR(20), which counts code points:
      if (userName.codePointCount() > 20) {
          IO.println("The user name is too long to store in VARCHAR(20) in utf8mb4 MySQL/PostgreSQL!");
      }
      // Password policy: require at least 8 Unicode characters (code points) as per NIST SP 800-63B:
      if (password.codePointCount() < 8) {
          IO.println("Password is too short!");
      }
      

      Alternatives

      Addition of no-indice overloads of codePointCount to java.lang.Character for CharSequence and char[] was considered and rejected. Users can just use CharSequence.codePointCount() and CharBuffer.wrap(array).codePointCount() instead.

      Specification

      Add to java.lang.CharSequence interface: (This is inherited by StringBuilder and StringBuffer with @since 27)

      /**
       * {@return the number of Unicode code points in this character sequence}
       * Isolated surrogate code units count as one code point each.
       *
       * @since 27
       */
      default int codePointCount() {
      }
      

      Add to java.lang.String class:

      /**
       * {@return the number of Unicode code points in this String}
       * Isolated surrogate code units count as one code point each.
       *
       * @since 27
       */
      @Override
      public int codePointCount() {
      }
      

            Assignee:
            Chen Liang
            Reporter:
            Webbug Group
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: