Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8338709

[JNI] The JNI Specification needs to address the limitations of integer UTF-8 String lengths

XMLWordPrintable

    • Icon: CSR CSR
    • Resolution: Approved
    • Icon: P4 P4
    • 24
    • hotspot
    • None
    • source, binary, behavioral
    • minimal
    • Hide
      The deprecated function is simply updated to reflect what the Hotspot JNI implementation already does, and so there is no change in regards to compatibility.

      Adding a new function has no compatibility concerns.
      Show
      The deprecated function is simply updated to reflect what the Hotspot JNI implementation already does, and so there is no change in regards to compatibility. Adding a new function has no compatibility concerns.
    • Java API
    • SE

      Summary

      Deprecate the existing JNI GetStringUTFLength method noting that it may return a truncated length, and add a new method, JNI GetStringUTFLengthAsLong that returns the string length as a jlong value.

      Problem

      The GetStringUTFLength function returns the length as a jint (jsize) value and so is limited to returning at most Integer.MAX_VALUE. But a Java string can itself consist of Integer.MAX_VALUE characters, each of which may require more than one byte to represent them in modified UTF-8 format.** It follows then that this function cannot return the correct answer for all String values and yet the specification makes no mention of this, nor of any possible error to report if this situation is encountered.

      **The modified UTF-8 format used by the VM can require up to six bytes to represent one unicode character, but six byte characters are stored as UTF16 surrogate pairs. Hence the most bytes per character is 3, and so the maximum length is 3*Integer.MAX_VALUE. With compact strings this reduces to 2*Integer.MAX_VALUE.

      Solution

      Deprecate the existing JNI GetStringUTFLength method noting that it may return a truncated length, and add a new method, JNI GetStringUTFLengthAsLong that returns the string length as a jlong value.

      Note that "Deprecation" is not something that has previously been applied to the JNI specification, and we do not have "deprecation warnings" or the like, nor are we ever likely to remove the existing method. If programmers are sure they are dealing with suitably length-limited strings then they can continue to use the old method. Otherwise, they should switch to the new method.

      This does not solve all problems related to extremely large strings within a program, as, for example, if the application logic wanted to create a Java byte array for the UTF8 version of the string, then it will be unable to do so. But at least with the new API the programmer will know that they have hit a limitation.

      Note that the JNI_VERSION number will have to be increased due to the new method.

      Note that GetStringUTFRegion is still using an int length so can't be used to obtain a giant region, but we don't expect this to be a practical concern.

      Specification

      We Deprecate GetStringUTFLength and describe what happens if the integer length limit is reached:

      +### GetStringUTFLength (Deprecated)
      
       `jsize GetStringUTFLength(JNIEnv *env, jstring string);`
      
       Returns the length in bytes of the modified UTF-8 representation of a string.
      
      +As the capacity of a `jsize` variable is not sufficient to hold the length of
      +all possible modified UTF-8 string representations (due to multi-byte encodings),
      +this function is deprecated in favor of [`GetStringUTFLengthAsLong()`](#getstringutflengthaslong).
      +If the modified UTF-8 representation of `string` has a length that exceeds the capacity
      +of a `jsize` variable, then this function returns the number of bytes up to and including the
      +last character that could be fully encoded without exceeding that capacity.
      ...
      <h4>RETURNS:</h4>
      
      -Returns the UTF-8 length of the string.
      +Returns the UTF-8 length of the string, as restricted by the capacity of a `jsize` variable.
      

      We add a new function GetStringUTFLengthAsLong

      +### GetStringUTFLengthAsLong
      +
      +`jlong GetStringUTFLengthAsLong(JNIEnv *env, jstring string);`
      +
      +Returns the length in bytes of the modified UTF-8 representation of a string.
      +
      +#### LINKAGE:
      +
      +Index 235 in the JNIEnv interface function table.
      +
      +#### PARAMETERS:
      +
      +`env`: the JNI interface pointer, must not be `NULL`.
      +
      +`string`: a Java string object, must not be `NULL`.
      +
      +#### RETURNS:
      +
      +Returns the UTF-8 length of the string.
      +
      +#### SINCE
      +
      +JDK 24

      which of course is also added to the interface table:

      -    IsVirtualThread
      +    IsVirtualThread,
      +
      +    GetStringUTFLengthAsLong
         };

      In addition we tweak the wording of GetStringUTFChars so that it refers to a byte sequence instead of a byte array (to avoid suggesting the returned sequence is limited by the capacity of a Java array);

      -Returns a pointer to an array of bytes representing the string in modified
      -UTF-8 encoding. This array is valid until it is released by
      +Returns a pointer to a sequence of bytes representing the string in modified
      +UTF-8 encoding. This sequence is valid until it is released by
       `ReleaseStringUTFChars()`.

      and we tweak the wording of GetStringUTFRegion so that it refers to the new GetStringUTFLengthAsLong function instead of the Deprecated GetStringUTFLength

      The `len` argument specifies the number of *unicode characters*. The resulting
      -number modified UTF-8 encoding characters may be greater than the given `len`
      -argument. `GetStringUTFLength()` may be used to determine the maximum size of
      +number of modified UTF-8 encoding characters may be greater than the given `len`
      +argument. `GetStringUTFLengthAsLong()` may be used to determine the maximum size of
       the required character buffer.

      The JNI version will also be bumped for this API addition.

       `jint GetVersion(JNIEnv *env);`
      
      -Returns the version of the native method interface. For Java SE Platform 21 and
      -later, it returns `JNI_VERSION_21`. The following table gives the version of JNI
      +Returns the version of the native method interface. For Java SE Platform 24 and
      +later, it returns `JNI_VERSION_24`. The following table gives the version of JNI
       included in each release of the Java SE Platform (for older versions of JNI, the
       JDK release is used instead of the Java SE Platform):

            dholmes David Holmes
            dholmes David Holmes
            Roger Riggs, Thomas Stuefe
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: