-
Bug
-
Resolution: Not an Issue
-
P4
-
None
-
5.0
-
x86
-
windows_xp
FULL PRODUCT VERSION :
A DESCRIPTION OF THE PROBLEM :
Unicode code points > \uFFFF are synthesized in the JVM by 2 chars, called surrogates.
The 1st char, called high surrogate, is in the Range of \uD800..\uDBFF, and
the 2nd char, called low surrogate, is in the Range of \uDC00..\uDFFF, and
1.) If the 1st char is erroneously in the Range of \uDC00..\uDFFF, sun.nio.cs encoders return a CoderResult.malformedForLength(1). OK.
2.) If the 1st char is correctly in the Range of \uD800..\uDBFF, but the 2nd char is erroneously NOT in the Range of \uDC00..\uDFFF, sun.nio.cs encoders mostly (I have not tested all) also return a CoderResult.malformedForLength(1).
Javadoc says:
A malformed-input error is reported when a sequence of input units is not well-formed. Such errors are described by instances of this class whose isMalformed method returns true and whose length method returns the length of the malformed sequence. ...
IMO for the 2. case, the _malformed sequence_ is of lenght 2, so the encoders should return CoderResult.malformedForLength(2), because the code point, which is wrong, consists of 2 chars.
Additionally, it would be much easier to skip the wrong code point in the concerning java.nio.CharBuffer, by just utilizing CoderResult.length().
Somebody may say, only the 1st char may be corrupted, and the 2nd char could be valid:
But see multibyte coders as UTF-8. If in a byte sequence 1st byte is > 0x7F and 2nd byte is invalid in conjunction with 1st byte, CoderResult.length() is > 1. Also in this case only 1 byte may be corrupted, and next byte could be valid.
See also: http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/CoderResult.html
REPRODUCIBILITY :
This bug can be reproduced always.
A DESCRIPTION OF THE PROBLEM :
Unicode code points > \uFFFF are synthesized in the JVM by 2 chars, called surrogates.
The 1st char, called high surrogate, is in the Range of \uD800..\uDBFF, and
the 2nd char, called low surrogate, is in the Range of \uDC00..\uDFFF, and
1.) If the 1st char is erroneously in the Range of \uDC00..\uDFFF, sun.nio.cs encoders return a CoderResult.malformedForLength(1). OK.
2.) If the 1st char is correctly in the Range of \uD800..\uDBFF, but the 2nd char is erroneously NOT in the Range of \uDC00..\uDFFF, sun.nio.cs encoders mostly (I have not tested all) also return a CoderResult.malformedForLength(1).
Javadoc says:
A malformed-input error is reported when a sequence of input units is not well-formed. Such errors are described by instances of this class whose isMalformed method returns true and whose length method returns the length of the malformed sequence. ...
IMO for the 2. case, the _malformed sequence_ is of lenght 2, so the encoders should return CoderResult.malformedForLength(2), because the code point, which is wrong, consists of 2 chars.
Additionally, it would be much easier to skip the wrong code point in the concerning java.nio.CharBuffer, by just utilizing CoderResult.length().
Somebody may say, only the 1st char may be corrupted, and the 2nd char could be valid:
But see multibyte coders as UTF-8. If in a byte sequence 1st byte is > 0x7F and 2nd byte is invalid in conjunction with 1st byte, CoderResult.length() is > 1. Also in this case only 1 byte may be corrupted, and next byte could be valid.
See also: http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/CoderResult.html
REPRODUCIBILITY :
This bug can be reproduced always.