-
Type:
CSR
-
Resolution: Unresolved
-
Priority:
P4
-
Component/s: core-libs
-
None
-
behavioral
-
minimal
-
This is a new method, as such it doesn't affect existing clients.
-
Java API
-
SE
Summary
Introduce new methods to support more efficient interoperability between strings and memory segments.
Problem
The existing FFM methods to read and writes strings to and from memory segments, as well as to allocate memory segments from existing Java strings, assume strings are zero-terminated.
There are cases where clients would like to read strings without having to look for a terminator (as they already know the size), or where they would like to write a Java string (or a portion of it) onto some destination memory segment.
For more background, see Maurizio's document Pulling the (foreign) string.
Solution
This change adds three new methods to support efficient handling of non-null terminated strings:
MemorySegment#getString(long offset, Charset charset, long length)MemorySegment#copy(String src, Charset dstEncoding, int srcIndex, MemorySegment dst, long dstOffset, int numChars)SegmentAllocator#allocateFrom(String str, Charset charset, int srcIndex, int numChars)
Several alternatives and variations on these APIs were considered (see document above).
For getString the length of the character data is specified in bytes. During the design phase, three options were identified for expressing the length of the underlying read operation:
- length in bytes
- number of code units
- the number of characters in the resulting string
(3) was rejected because for variable length encodings it requires a decoding step to convert to bytes for a bulk copy operation. This leaves (1) and (2) as candidates -- since the conversion between the two is a trivial scaling factor, either would have been a viable choice. Code units might be more natural for native strings encoded as an array of code units. Using a byte length was decided on to allow supporting arbitrary charsets, since not all charsets may have a concept of a code unit.
For copy and allocateFrom, the srcIndex and numChars are expressed in terms of character offsets into the string. This is the only practical choice here, since the client already has a Java string, and computing an offset in bytes or code units would require additional computation.
The new copy method is the dual of the new getString, and allows writing strings to a target memory segment without a terminator. There was a potential analogy to the existing MemorySegment#setString methods here, but they write strings with null terminators. This operation is more in common with the other copy overloads, where here a String is the source of data (as opposed to e.g. an array).
Specification
diff --git a/src/java.base/share/classes/java/lang/foreign/MemorySegment.java b/src/java.base/share/classes/java/lang/foreign/MemorySegment.java
index 196f44d1abe..195955b1a92 100644
--- a/src/java.base/share/classes/java/lang/foreign/MemorySegment.java
+++ b/src/java.base/share/classes/java/lang/foreign/MemorySegment.java
@@ -1296,12 +1296,7 @@ MemorySegment reinterpret(long newSize,
* over the decoding process is required.
* <p>
* Getting a string from a segment with a known byte offset and
- * known byte length can be done like so:
- * {@snippet lang=java :
- * byte[] bytes = new byte[length];
- * MemorySegment.copy(segment, JAVA_BYTE, offset, bytes, 0, length);
- * return new String(bytes, charset);
- * }
+ * known byte length can be done using {@link #getString(long, Charset, long)}.
*
* @param offset offset in bytes (relative to this segment address) at which this
* access operation will occur
@@ -1328,6 +1323,40 @@ MemorySegment reinterpret(long newSize,
*/
String getString(long offset, Charset charset);
+ /**
+ * Reads a string from this segment at the given offset, using the provided length
+ * and charset.
+ * <p>
+ * This method always replaces malformed-input and unmappable-character
+ * sequences with this charset's default replacement string. The {@link
+ * java.nio.charset.CharsetDecoder} class should be used when more control
+ * over the decoding process is required.
+ * <p>
+ * If the string contains any {@code '\0'} characters, they will be read as well.
+ * This differs from {@link #getString(long, Charset)}, which will only read up
+ * to the first {@code '\0'}, resulting in truncation for string data that contains
+ * the {@code '\0'} character.
+ *
+ * @param offset offset in bytes (relative to this segment address) at which this
+ * access operation will occur
+ * @param charset the charset used to {@linkplain Charset#newDecoder() decode} the
+ * string bytes
+ * @param length length, in bytes, of the region of memory to read and decode into
+ * a string
+ * @return a Java string constructed from the bytes read from the given starting
+ * address up to the given length
+ * @throws IllegalArgumentException if the size of the string is greater than the
+ * largest string supported by the platform
+ * @throws IndexOutOfBoundsException if {@code offset < 0}
+ * @throws IndexOutOfBoundsException if {@code offset > byteSize() - length}
+ * @throws IllegalStateException if the {@linkplain #scope() scope} associated with
+ * this segment is not {@linkplain Scope#isAlive() alive}
+ * @throws WrongThreadException if this method is called from a thread {@code T},
+ * such that {@code isAccessibleBy(T) == false}
+ * @throws IllegalArgumentException if {@code length < 0}
+ */
+ String getString(long offset, Charset charset, long length);
+
/**
* Writes the given string into this segment at the given offset, converting it to
* a null-terminated byte sequence using the {@linkplain StandardCharsets#UTF_8 UTF-8}
@@ -1366,7 +1395,8 @@ MemorySegment reinterpret(long newSize,
* If the given string contains any {@code '\0'} characters, they will be
* copied as well. This means that, depending on the method used to read
* the string, such as {@link MemorySegment#getString(long)}, the string
- * will appear truncated when read again.
+ * will appear truncated when read again. The string can be read without
+ * truncation using {@link #getString(long, Charset, long)}.
*
* @param offset offset in bytes (relative to this segment address) at which this
* access operation will occur, the final address of this write
@@ -2606,6 +2636,50 @@ static void copy(Object srcArray, int srcIndex,
elementCount);
}
+ /**
+ * Copies the byte sequence of the given string encoded using the provided charset
+ * to the destination segment.
+ * <p>
+ * This method always replaces malformed-input and unmappable-character
+ * sequences with this charset's default replacement string. The {@link
+ * java.nio.charset.CharsetDecoder} class should be used when more control
+ * over the decoding process is required.
+ * <p>
+ * If the given string contains any {@code '\0'} characters, they will be
+ * copied as well. This means that, depending on the method used to read
+ * the string, such as {@link MemorySegment#getString(long)}, the string
+ * will appear truncated when read again. The string can be read without
+ * truncation using {@link #getString(long, Charset, long)}.
+ *
+ * @param src the Java string to be written into the destination segment
+ * @param dstEncoding the charset used to {@linkplain Charset#newEncoder() encode}
+ * the string bytes.
+ * @param srcIndex the starting character index of the source string
+ * @param dst the destination segment
+ * @param dstOffset the starting offset, in bytes, of the destination segment
+ * @param numChars the number of characters to be copied
+ * @throws IllegalStateException if the {@linkplain #scope() scope} associated with
+ * {@code dst} is not {@linkplain Scope#isAlive() alive}
+ * @throws WrongThreadException if this method is called from a thread {@code T},
+ * such that {@code dst.isAccessibleBy(T) == false}
+ * @throws IndexOutOfBoundsException if either {@code srcIndex}, {@code numChars}, or {@code dstOffset}
+ * are {@code < 0}
+ * @throws IndexOutOfBoundsException if {@code srcIndex > src.length() - numChars}
+ * @throws IllegalArgumentException if {@code dst} is {@linkplain #isReadOnly() read-only}
+ * @throws IndexOutOfBoundsException if {@code dstOffset > dstSegment.byteSize() - B} where {@code B} is the size,
+ * in bytes, of the substring of {@code src} encoded using the given charset
+ * @return the number of copied bytes.
+ */
+ @ForceInline
+ static long copy(String src, Charset dstEncoding, int srcIndex, MemorySegment dst, long dstOffset, int numChars) {
+ Objects.requireNonNull(src);
+ Objects.requireNonNull(dstEncoding);
+ Objects.requireNonNull(dst);
+ Objects.checkFromIndexSize(srcIndex, numChars, src.length());
+
+ return AbstractMemorySegmentImpl.copy(src, dstEncoding, srcIndex, dst, dstOffset, numChars);
+ }
+
/**
* Finds and returns the relative offset, in bytes, of the first mismatch between the
* source and the destination segments. More specifically, the bytes at offset
diff --git a/src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java b/src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java
index 1297406dcf1..5b213af544f 100644
--- a/src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java
+++ b/src/java.base/share/classes/java/lang/foreign/SegmentAllocator.java
@@ -111,7 +111,8 @@ default MemorySegment allocateFrom(String str) {
* If the given string contains any {@code '\0'} characters, they will be
* copied as well. This means that, depending on the method used to read
* the string, such as {@link MemorySegment#getString(long)}, the string
- * will appear truncated when read again.
+ * will appear truncated when read again. The string can be read without
+ * truncation using {@link MemorySegment#getString(long, Charset, long)}.
*
* @param str the Java string to be converted into a C string
* @param charset the charset used to {@linkplain Charset#newEncoder() encode} the
@@ -137,10 +138,10 @@ default MemorySegment allocateFrom(String str, Charset charset) {
int termCharSize = StringSupport.CharsetKind.of(charset).terminatorCharSize();
MemorySegment segment;
int length;
- if (StringSupport.bytesCompatible(str, charset)) {
+ if (StringSupport.bytesCompatible(str, charset, 0, str.length())) {
length = str.length();
segment = allocateNoInit((long) length + termCharSize);
- StringSupport.copyToSegmentRaw(str, segment, 0);
+ StringSupport.copyToSegmentRaw(str, segment, 0, 0, str.length());
} else {
byte[] bytes = str.getBytes(charset);
length = bytes.length;
@@ -153,6 +154,53 @@ default MemorySegment allocateFrom(String str, Charset charset) {
return segment;
}
+ /**
+ * Encodes a Java string using the provided charset and stores the resulting
+ * byte array into a memory segment.
+ * <p>
+ * This method always replaces malformed-input and unmappable-character
+ * sequences with this charset's default replacement byte array. The
+ * {@link java.nio.charset.CharsetEncoder} class should be used when more
+ * control over the encoding process is required.
+ * <p>
+ * If the given string contains any {@code '\0'} characters, they will be
+ * copied as well. This means that, depending on the method used to read
+ * the string, such as {@link MemorySegment#getString(long)}, the string
+ * will appear truncated when read again. The string can be read without
+ * truncation using {@link MemorySegment#getString(long, Charset, long)}.
+ *
+ * @param str the Java string to be encoded
+ * @param charset the charset used to {@linkplain Charset#newEncoder() encode} the
+ * string bytes
+ * @param srcIndex the starting index of the source string
+ * @param numChars the number of characters to be copied
+ * @return a new native segment containing the encoded string
+ * @throws IndexOutOfBoundsException if either {@code srcIndex} or {@code numChars} are {@code < 0}
+ * @throws IndexOutOfBoundsException if {@code srcIndex > str.length() - numChars}
+ *
+ * @implSpec The default implementation for this method copies the contents of the
+ * provided Java string into a new memory segment obtained by calling
+ * {@code this.allocate(B)}, where {@code B} is the size, in bytes, of
+ * the string encoded using the provided charset
+ * (e.g. {@code str.getBytes(charset).length});
+ */
+ @ForceInline
+ default MemorySegment allocateFrom(String str, Charset charset, int srcIndex, int numChars) {
+ Objects.requireNonNull(charset);
+ Objects.requireNonNull(str);
+ Objects.checkFromIndexSize(srcIndex, numChars, str.length());
+ MemorySegment segment;
+ if (StringSupport.bytesCompatible(str, charset, srcIndex, numChars)) {
+ segment = allocateNoInit(numChars);
+ StringSupport.copyToSegmentRaw(str, segment, 0, srcIndex, numChars);
+ } else {
+ byte[] bytes = str.substring(srcIndex, srcIndex + numChars).getBytes(charset);
+ segment = allocateNoInit(bytes.length);
+ MemorySegment.copy(bytes, 0, segment, ValueLayout.JAVA_BYTE, 0, bytes.length);
+ }
+ return segment;
+ }
+
/**
* {@return a new memory segment initialized with the provided byte value}
* <p>
- csr of
-
JDK-8369564 Provide a MemorySegment API to read strings with known lengths
-
- New
-