-
Enhancement
-
Resolution: Fixed
-
P3
-
7
-
b14
-
generic
-
generic
-
Verified
Unicode Standard added "Addition Constraints on conversion of ill-formed UTF-8"
in version 5.1 [1] and updated in 6.0 again with further "clarification" [2] regarding
how a "conformance" implementation should handle ill-formed UTF-8 byte
sequence. Basically it says
(1) the conversion process should not interpret any ill-formed code unit sequence
(2) such process must not treat any adjacent well-formed code unit sequences
as being part of those ill-formed code unit sequences
(3) and recommend a "best practice" of "maximal valid subpart" for replacement
The new UTF-8 charset implementation we put in JDK7 (and back-ported to previous
release since then) follows the new constraints in most cases, except
(1) The decoder still accepts "historical" 3 bytes surrogates and 6 bytes surrogate
pair (the encoder never output such sequence). Unicode Standard "tightened" UTF-8
definition in ver 3.2 [3], as
"Most notable among the corrigenda to the Standard is a further tightening
of the definition of UTF-8, to eliminate irregular UTF-8 and to bring the
Unicode specification of UTF-8 more completely into line with other
specifications of UTF-8."
So the 3-byte/6-byte surrogates are defined as "ill-formed" code unit
sequence, instead of "irregular" [5] in ver 3.1
(2) While no longer accepting the "histrical" 5-byte, 6-byte UTF-8 byte sequence,
the decoder treats these 5/6-byte sequence as ONE malformed unit. As a result
these bytes get replaced by one replacement character, when "replace for mlaformed"
is desirble (as in new String(bytes), for example). According the latest Unicode
standard [2], however, because the leading byte of these 5/6-byte sequence is no
longer an illegal appearance of the UTF-8, these bytes should be treated as 5-6
individual ill-formed bytes.
(3)Corner case like ill-fomred byte sequence ED 31 is not handled correctly/
consistently, as described in #7082884 [6]
The reason behind (1) and (2) is mostly the compatibility concern. As acknowledged
in TR#26 [4] (in which it defines CESU-8, a separate UTF encoding scheme that
uses 3-6-byte sequence for supplementary characters, instead of 4-byte sequence
in UTF-8), there are apps/data over there that do use surrogates pair in "UTF-8"
form. To change the UTF-8 charset to follow standard obviously will break
someone's code when they migrate/upgrade from JDK/JRE N to N+1, something we
really try hard to avoid.
That said, gvien almost decade has passed and we are now Unicode 6, I think the
possibility of breaking someone's code/date of upgrading UTF-8 to do the "right
thing" is small/minor. So I proposed here
(1) to upgrade the JDK8 UTF-8 implementation to strictly follow the standard to
a) reject 3-byte surrogate/6-byte surrogate pair
b) treats 5/6-byte surrogate as individual ill-formed bytes.
(2) to add CESU-8 charset into JDK/JRE's charset repository (for those still
prefer/work on 3-6 bytes surrogate, in "UTF-8" form)
[1] http://www.unicode.org/versions/Unicode5.1.0/#Notable_Changes
[2] http://www.unicode.org/versions/Unicode6.0.0/#Conformance_Changes
[3] http://www.unicode.org/reports/tr28/tr28-3.html
[4] http://unicode.org/reports/tr26/
[5] http://unicode.org/versions/corrigendum1.html
[6] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-September/007722.html
in version 5.1 [1] and updated in 6.0 again with further "clarification" [2] regarding
how a "conformance" implementation should handle ill-formed UTF-8 byte
sequence. Basically it says
(1) the conversion process should not interpret any ill-formed code unit sequence
(2) such process must not treat any adjacent well-formed code unit sequences
as being part of those ill-formed code unit sequences
(3) and recommend a "best practice" of "maximal valid subpart" for replacement
The new UTF-8 charset implementation we put in JDK7 (and back-ported to previous
release since then) follows the new constraints in most cases, except
(1) The decoder still accepts "historical" 3 bytes surrogates and 6 bytes surrogate
pair (the encoder never output such sequence). Unicode Standard "tightened" UTF-8
definition in ver 3.2 [3], as
"Most notable among the corrigenda to the Standard is a further tightening
of the definition of UTF-8, to eliminate irregular UTF-8 and to bring the
Unicode specification of UTF-8 more completely into line with other
specifications of UTF-8."
So the 3-byte/6-byte surrogates are defined as "ill-formed" code unit
sequence, instead of "irregular" [5] in ver 3.1
(2) While no longer accepting the "histrical" 5-byte, 6-byte UTF-8 byte sequence,
the decoder treats these 5/6-byte sequence as ONE malformed unit. As a result
these bytes get replaced by one replacement character, when "replace for mlaformed"
is desirble (as in new String(bytes), for example). According the latest Unicode
standard [2], however, because the leading byte of these 5/6-byte sequence is no
longer an illegal appearance of the UTF-8, these bytes should be treated as 5-6
individual ill-formed bytes.
(3)Corner case like ill-fomred byte sequence ED 31 is not handled correctly/
consistently, as described in #7082884 [6]
The reason behind (1) and (2) is mostly the compatibility concern. As acknowledged
in TR#26 [4] (in which it defines CESU-8, a separate UTF encoding scheme that
uses 3-6-byte sequence for supplementary characters, instead of 4-byte sequence
in UTF-8), there are apps/data over there that do use surrogates pair in "UTF-8"
form. To change the UTF-8 charset to follow standard obviously will break
someone's code when they migrate/upgrade from JDK/JRE N to N+1, something we
really try hard to avoid.
That said, gvien almost decade has passed and we are now Unicode 6, I think the
possibility of breaking someone's code/date of upgrading UTF-8 to do the "right
thing" is small/minor. So I proposed here
(1) to upgrade the JDK8 UTF-8 implementation to strictly follow the standard to
a) reject 3-byte surrogate/6-byte surrogate pair
b) treats 5/6-byte surrogate as individual ill-formed bytes.
(2) to add CESU-8 charset into JDK/JRE's charset repository (for those still
prefer/work on 3-6 bytes surrogate, in "UTF-8" form)
[1] http://www.unicode.org/versions/Unicode5.1.0/#Notable_Changes
[2] http://www.unicode.org/versions/Unicode6.0.0/#Conformance_Changes
[3] http://www.unicode.org/reports/tr28/tr28-3.html
[4] http://unicode.org/reports/tr26/
[5] http://unicode.org/versions/corrigendum1.html
[6] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-September/007722.html
- relates to
-
JDK-8039751 UTF-8 decoder fails to handle some edge cases correctly
- Closed
-
JDK-8067102 CharsetDecoder decode in JDK8 results in different behavior compared to JDK7
- Closed