-
Bug
-
Resolution: Fixed
-
P4
-
8, 11, 17, 18
-
b03
-
generic
-
generic
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8279734 | 18.0.1 | Naoto Sato | P4 | Resolved | Fixed | b02 |
JDK-8278959 | 18 | Naoto Sato | P4 | Resolved | Fixed | b29 |
A DESCRIPTION OF THE PROBLEM :
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/StringTokenizer.html#%3Cinit%3E(java.lang.String,java.lang.String,boolean) said: "Each delimiter is returned as a string of length one." This is not correct if any of the delimiter is a valid Unicode surrogate pair since the returned string will be of length two because the delimiter is represented by two code units.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
"Each delimiter is returned as a string of the code unit(s) of the delimiter."
Or remove "Each delimiter is returned as a string of length one." and clarify that "characters" in StringTokenizer documentation context refers to Unicode code points like other documentation, e.g., that of String: "The String class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., char values)." - https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/String.html.
ACTUAL -
"Each delimiter is returned as a string of length one."
---------- BEGIN SOURCE ----------
import java.util.StringTokenizer;
public class StringTokenizerPlayground {
public static void main(String[] args) {
final var s = "\uD83D\uDE00"; // Grinning Face
final var tokenizer = new StringTokenizer(s, s, true);
final var tokenCount = tokenizer.countTokens();
if (tokenCount != 1) {
throw new AssertionError();
}
final var token = tokenizer.nextToken();
if (token.length() != 2) {
throw new AssertionError();
}
if (!token.equals(s)) {
throw new AssertionError();
}
}
}
---------- END SOURCE ----------
FREQUENCY : always
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/StringTokenizer.html#%3Cinit%3E(java.lang.String,java.lang.String,boolean) said: "Each delimiter is returned as a string of length one." This is not correct if any of the delimiter is a valid Unicode surrogate pair since the returned string will be of length two because the delimiter is represented by two code units.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
"Each delimiter is returned as a string of the code unit(s) of the delimiter."
Or remove "Each delimiter is returned as a string of length one." and clarify that "characters" in StringTokenizer documentation context refers to Unicode code points like other documentation, e.g., that of String: "The String class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., char values)." - https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/String.html.
ACTUAL -
"Each delimiter is returned as a string of length one."
---------- BEGIN SOURCE ----------
import java.util.StringTokenizer;
public class StringTokenizerPlayground {
public static void main(String[] args) {
final var s = "\uD83D\uDE00"; // Grinning Face
final var tokenizer = new StringTokenizer(s, s, true);
final var tokenCount = tokenizer.countTokens();
if (tokenCount != 1) {
throw new AssertionError();
}
final var token = tokenizer.nextToken();
if (token.length() != 2) {
throw new AssertionError();
}
if (!token.equals(s)) {
throw new AssertionError();
}
}
}
---------- END SOURCE ----------
FREQUENCY : always
- backported by
-
JDK-8278959 StringTokenizer(String, String, boolean) documentation bug
-
- Resolved
-
-
JDK-8279734 StringTokenizer(String, String, boolean) documentation bug
-
- Resolved
-
- csr for
-
JDK-8278814 StringTokenizer(String, String, boolean) documentation bug
-
- Closed
-
- links to
-
Commit openjdk/jdk18/9cd70906
-
Commit openjdk/jdk/8f5fdd86
-
Review openjdk/jdk18/43
-
Review openjdk/jdk/6836
(2 links to)