-
Bug
-
Resolution: Duplicate
-
P4
-
None
-
8, 9
-
generic
-
generic
FULL PRODUCT VERSION :
A DESCRIPTION OF THE PROBLEM :
sun.nio.cs.UnicodeDecoder incorrectly rejects U+FFFE.
The test at http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/nio/cs/UnicodeDecoder.java#94 should be removed, because contrary to the comment on line 95, a reversed BOM *can* occur in the middle of a stream. The BOM/reversed-BOM are only special at the start of a stream, to distinguish UTF16BE from UTF16LE.
From the unicode.org FAQ (http://www.unicode.org/faq/private_use.html#sentinel6):
Q: I read somewhere that U+FFFE and U+FFFF were illegal in Unicode, and could be used as sentinels. Is that true?
A: Well, the short answer is no, that is not true—at least, not entirely true. U+FFFE and U+FFFF are noncharacters just like the other 64 noncharacters in the standard, and are valid in Unicode strings.
"Unicode 2.0 dropped the explicit prohibition against transmission or storage of U+FFFE and U+FFFF"
Unicode 3.0: "To ensure that round-trip transcoding is possible, a UTF mapping must also map invalid Unicode scalar values to unique code value sequences. These invalid scalar values include U+FFFE, U+FFFF, and unpaired surrogates."
Unicode 4.0: "To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences."
Mapping multiple codepoints to '\uFFFD' as is currently being done in sun.nio.cs.UnicodeDecoder means the encoding is not one-to-one.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
String a = "\uFFFE";
new String(a.getBytes("UTF-16"), "UTF-16") == a;
REPRODUCIBILITY :
This bug can be reproduced always.
A DESCRIPTION OF THE PROBLEM :
sun.nio.cs.UnicodeDecoder incorrectly rejects U+FFFE.
The test at http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/nio/cs/UnicodeDecoder.java#94 should be removed, because contrary to the comment on line 95, a reversed BOM *can* occur in the middle of a stream. The BOM/reversed-BOM are only special at the start of a stream, to distinguish UTF16BE from UTF16LE.
From the unicode.org FAQ (http://www.unicode.org/faq/private_use.html#sentinel6):
Q: I read somewhere that U+FFFE and U+FFFF were illegal in Unicode, and could be used as sentinels. Is that true?
A: Well, the short answer is no, that is not true—at least, not entirely true. U+FFFE and U+FFFF are noncharacters just like the other 64 noncharacters in the standard, and are valid in Unicode strings.
"Unicode 2.0 dropped the explicit prohibition against transmission or storage of U+FFFE and U+FFFF"
Unicode 3.0: "To ensure that round-trip transcoding is possible, a UTF mapping must also map invalid Unicode scalar values to unique code value sequences. These invalid scalar values include U+FFFE, U+FFFF, and unpaired surrogates."
Unicode 4.0: "To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences."
Mapping multiple codepoints to '\uFFFD' as is currently being done in sun.nio.cs.UnicodeDecoder means the encoding is not one-to-one.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
String a = "\uFFFE";
new String(a.getBytes("UTF-16"), "UTF-16") == a;
REPRODUCIBILITY :
This bug can be reproduced always.
- relates to
-
JDK-8150449 "A 'reversed byte-order mark' cannot occur within middle of stream" is not correct
- Closed