Summary
Deprecate the existing JNI GetStringUTFLength
method noting that it may return a truncated length, and add a new method, JNI GetStringUTFLengthAsLong
that returns the string length as a jlong
value.
Problem
The GetStringUTFLength
function returns the length as a jint
(jsize
) value and so is limited to returning at most Integer.MAX_VALUE
. But a Java string can itself consist of Integer.MAX_VALUE
characters, each of which may require more than one byte to represent them in modified UTF-8 format.** It follows then that this function cannot return the correct answer for all String values and yet the specification makes no mention of this, nor of any possible error to report if this situation is encountered.
**The modified UTF-8 format used by the VM can require up to six bytes to represent one unicode character, but six byte characters are stored as UTF16 surrogate pairs. Hence the most bytes per character is 3, and so the maximum length is 3*Integer.MAX_VALUE
. With compact strings this reduces to 2*Integer.MAX_VALUE
.
Solution
Deprecate the existing JNI GetStringUTFLength
method noting that it may return a truncated length, and add a new method, JNI GetStringUTFLengthAsLong
that returns the string length as a jlong
value.
Note that "Deprecation" is not something that has previously been applied to the JNI specification, and we do not have "deprecation warnings" or the like, nor are we ever likely to remove the existing method. If programmers are sure they are dealing with suitably length-limited strings then they can continue to use the old method. Otherwise, they should switch to the new method.
This does not solve all problems related to extremely large strings within a program, as, for example, if the application logic wanted to create a Java byte array for the UTF8 version of the string, then it will be unable to do so. But at least with the new API the programmer will know that they have hit a limitation.
Note that the JNI_VERSION number will have to be increased due to the new method.
Note that GetStringUTFRegion
is still using an int length so can't be used to obtain a giant region, but we don't expect this to be a practical concern.
Specification
We Deprecate GetStringUTFLength
and describe what happens if the integer length limit is reached:
+### GetStringUTFLength (Deprecated)
`jsize GetStringUTFLength(JNIEnv *env, jstring string);`
Returns the length in bytes of the modified UTF-8 representation of a string.
+As the capacity of a `jsize` variable is not sufficient to hold the length of
+all possible modified UTF-8 string representations (due to multi-byte encodings),
+this function is deprecated in favor of [`GetStringUTFLengthAsLong()`](#getstringutflengthaslong).
+If the modified UTF-8 representation of `string` has a length that exceeds the capacity
+of a `jsize` variable, then this function returns the number of bytes up to and including the
+last character that could be fully encoded without exceeding that capacity.
...
<h4>RETURNS:</h4>
-Returns the UTF-8 length of the string.
+Returns the UTF-8 length of the string, as restricted by the capacity of a `jsize` variable.
We add a new function GetStringUTFLengthAsLong
+### GetStringUTFLengthAsLong
+
+`jlong GetStringUTFLengthAsLong(JNIEnv *env, jstring string);`
+
+Returns the length in bytes of the modified UTF-8 representation of a string.
+
+#### LINKAGE:
+
+Index 235 in the JNIEnv interface function table.
+
+#### PARAMETERS:
+
+`env`: the JNI interface pointer, must not be `NULL`.
+
+`string`: a Java string object, must not be `NULL`.
+
+#### RETURNS:
+
+Returns the UTF-8 length of the string.
+
+#### SINCE
+
+JDK 24
which of course is also added to the interface table:
- IsVirtualThread
+ IsVirtualThread,
+
+ GetStringUTFLengthAsLong
};
In addition we tweak the wording of GetStringUTFChars
so that it refers to a byte sequence instead of a byte array (to avoid suggesting the returned sequence is limited by the capacity of a Java array);
-Returns a pointer to an array of bytes representing the string in modified
-UTF-8 encoding. This array is valid until it is released by
+Returns a pointer to a sequence of bytes representing the string in modified
+UTF-8 encoding. This sequence is valid until it is released by
`ReleaseStringUTFChars()`.
and we tweak the wording of GetStringUTFRegion
so that it refers to the new GetStringUTFLengthAsLong
function instead of the Deprecated GetStringUTFLength
The `len` argument specifies the number of *unicode characters*. The resulting
-number modified UTF-8 encoding characters may be greater than the given `len`
-argument. `GetStringUTFLength()` may be used to determine the maximum size of
+number of modified UTF-8 encoding characters may be greater than the given `len`
+argument. `GetStringUTFLengthAsLong()` may be used to determine the maximum size of
the required character buffer.
The JNI version will also be bumped for this API addition.
`jint GetVersion(JNIEnv *env);`
-Returns the version of the native method interface. For Java SE Platform 21 and
-later, it returns `JNI_VERSION_21`. The following table gives the version of JNI
+Returns the version of the native method interface. For Java SE Platform 24 and
+later, it returns `JNI_VERSION_24`. The following table gives the version of JNI
included in each release of the Java SE Platform (for older versions of JNI, the
JDK release is used instead of the Java SE Platform):
- csr of
-
JDK-8328877 [JNI] The JNI Specification needs to address the limitations of integer UTF-8 String lengths
- Resolved