-
Bug
-
Resolution: Not an Issue
-
P4
-
None
-
6
-
x86
-
windows_vista
FULL PRODUCT VERSION :
java version "1.6.0_19"
Java(TM) SE Runtime Environment (build 1.6.0_19-b04)
Java HotSpot(TM) Client VM (build 16.2-b04, mixed mode, sharing)
ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows [Version 6.0.6002]
A DESCRIPTION OF THE PROBLEM :
utf-8 decoder allows for directly encoded trail surrogates
For example: the sequence 0xed 0xba 0xab is decoded to "\uDEAB"
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
I expect the behavior specified by CodingErrorAction (replace, ignore, report) to be triggered in this case. I do not expect a trail surrogate, it is not legal according to Unicode:
When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.
Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.
ACTUAL -
the decoder instead accepts the invalid byte sequence (it does not matter what you set CodingErrorAction to), and converts it to a trail surrogate.
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
// decoding this should not yield the trail surrogate itself
public void test() throws Exception {
byte[] invalid = new byte[] { (byte)0xed, (byte)0xba, (byte)0xab };
assertFalse(new String(invalid, 0, invalid.length, "UTF-8").equals("\uDEAB"));
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Write your own decoder.
java version "1.6.0_19"
Java(TM) SE Runtime Environment (build 1.6.0_19-b04)
Java HotSpot(TM) Client VM (build 16.2-b04, mixed mode, sharing)
ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows [Version 6.0.6002]
A DESCRIPTION OF THE PROBLEM :
utf-8 decoder allows for directly encoded trail surrogates
For example: the sequence 0xed 0xba 0xab is decoded to "\uDEAB"
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
I expect the behavior specified by CodingErrorAction (replace, ignore, report) to be triggered in this case. I do not expect a trail surrogate, it is not legal according to Unicode:
When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.
Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.
ACTUAL -
the decoder instead accepts the invalid byte sequence (it does not matter what you set CodingErrorAction to), and converts it to a trail surrogate.
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
// decoding this should not yield the trail surrogate itself
public void test() throws Exception {
byte[] invalid = new byte[] { (byte)0xed, (byte)0xba, (byte)0xab };
assertFalse(new String(invalid, 0, invalid.length, "UTF-8").equals("\uDEAB"));
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Write your own decoder.