Name: poR10007 Date: 09/23/2002
java.nio.charset.Charset specification reads that initial byte order mark (BOM) should
be omitted when decoding any UTF-encoded byte sequence:
"In any case, when a byte-order mark is read at the beginning of a decoding operation
it is omitted from the resulting sequence of characters."
However, according to The Unicode standard, in UTF-16BE, UTF-16LE character-encoding
schemes initial byte order mark should be interpreted as a ZERO WIDTH NO-BREAK SPACE.
The Unicode Standard, Version 3.0, Section 3.8 "Transformations" reads:
D33 UTF-16BE is the Unicode Transformation Format that serializes a Unicode value as
a sequence of two bytes, in big-endiang format. An initial sequence corresponding
to U+FEFF is interpreted as a ZERO WIDTH NO-BREAK SPACE.
D34 UTF-16LE is the Unicode Transformation Format that serializes a Unicode value as
a sequence of two bytes, in little-endiang format. An initial sequence corresponding
to U+FEFF is interpreted as a ZERO WIDTH NO-BREAK SPACE.
Byte order mark does not make sense for UTF-8 encoding either, so in this case initial
U+FEFF also should be interpreted as a ZERO WIDTH NO-BREAK SPACE.
JDK 1.4.2-beta-b02 meets the Unicode Standard requirements. It omits initial BOM while
decoding UTF-16 byte sequence and interpretes it as a ZERO WIDTH NO-BREAK SPACE while
decoding UTF-8, UTF-16BE, UTF-16LE.
======================================================================