-
Enhancement
-
Resolution: Unresolved
-
P4
-
None
-
6
-
Fix Understood
-
x86
-
linux
A DESCRIPTION OF THE REQUEST :
Modified UTF-8 is used in many places, including JNI, Data*Stream, JAR. The Charset class in nio is a good way to encapsulate dealing with encodings, without reimplementing the encoding process over and over again.
It would be good to have access to this encoding as a Charset class as well. Preferrably it would be a required standard character set (accessible through a constant as requested in 4884238). A name could be "x-modified-UTF-8" or "x-CESU-8-nullfree" or something like this.
The current UTF-8 decoder implementation already handles all variations of this encoding. Simple modifications to the encoder would easily provide a class for these encodings as well.
Bug 4641026 made clear to me that the encoding is more closely related to CESU-8, although the encoding of \u0000 is different. It might make sense to make the normal CESU-8 encoding available as well.
JUSTIFICATION :
I'm thinking about how to solve the much voted for bug 4244499.
My approach would be to move encoding from the native code into the java classes, and let a Charset object control what encoding to use. But to be backward compatible I would have to use modified UTF-8 by default.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Charset.forName("x-modified-UTF-8")
returns Charset encoding modified UTF-8 and decoding any UTF-8 variant.
ACTUAL -
UnsupportedCharsetException thrown.
---------- BEGIN SOURCE ----------
import java.nio.charset.Charset;
import java.util.Map;
public class CharTest {
public static void main(String[] args) throws Exception {
for (Map.Entry<String, Charset> p :
Charset.availableCharsets().entrySet()) {
System.out.println(p.getKey());
for (String a : p.getValue().aliases()) {
System.out.println("\t" + a);
}
}
System.out.println(Charset.forName("x-modified-UTF-8").name());
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
It is of course possible to implement the modified UTF-8 encoding manually, as has been done with Data*Stream or Zip*Stream. It would even be possible to have a Charset field to choose the charset, and have the special value null denote modified UTF-8. Every encoding or decoding would then check if the Charset object is present, and otherwise fall back to the manual implementation.
Modified UTF-8 is used in many places, including JNI, Data*Stream, JAR. The Charset class in nio is a good way to encapsulate dealing with encodings, without reimplementing the encoding process over and over again.
It would be good to have access to this encoding as a Charset class as well. Preferrably it would be a required standard character set (accessible through a constant as requested in 4884238). A name could be "x-modified-UTF-8" or "x-CESU-8-nullfree" or something like this.
The current UTF-8 decoder implementation already handles all variations of this encoding. Simple modifications to the encoder would easily provide a class for these encodings as well.
Bug 4641026 made clear to me that the encoding is more closely related to CESU-8, although the encoding of \u0000 is different. It might make sense to make the normal CESU-8 encoding available as well.
JUSTIFICATION :
I'm thinking about how to solve the much voted for bug 4244499.
My approach would be to move encoding from the native code into the java classes, and let a Charset object control what encoding to use. But to be backward compatible I would have to use modified UTF-8 by default.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Charset.forName("x-modified-UTF-8")
returns Charset encoding modified UTF-8 and decoding any UTF-8 variant.
ACTUAL -
UnsupportedCharsetException thrown.
---------- BEGIN SOURCE ----------
import java.nio.charset.Charset;
import java.util.Map;
public class CharTest {
public static void main(String[] args) throws Exception {
for (Map.Entry<String, Charset> p :
Charset.availableCharsets().entrySet()) {
System.out.println(p.getKey());
for (String a : p.getValue().aliases()) {
System.out.println("\t" + a);
}
}
System.out.println(Charset.forName("x-modified-UTF-8").name());
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
It is of course possible to implement the modified UTF-8 encoding manually, as has been done with Data*Stream or Zip*Stream. It would even be possible to have a Charset field to choose the charset, and have the special value null denote modified UTF-8. Every encoding or decoding would then check if the Charset object is present, and otherwise fall back to the manual implementation.
- relates to
-
JDK-6862139 (bf) Add put/getBoolean and put/getUTF methods
-
- Closed
-