Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4752069

(cs spec) BOM should not be ignored in UTF-16 charsets

XMLWordPrintable

    • b31
    • generic
    • generic
    • Not verified



      Name: poR10007 Date: 09/23/2002



      java.nio.charset.Charset specification reads that initial byte order mark (BOM) should
      be omitted when decoding any UTF-encoded byte sequence:

        "In any case, when a byte-order mark is read at the beginning of a decoding operation
         it is omitted from the resulting sequence of characters."
         
      However, according to The Unicode standard, in UTF-16BE, UTF-16LE character-encoding
      schemes initial byte order mark should be interpreted as a ZERO WIDTH NO-BREAK SPACE.

      The Unicode Standard, Version 3.0, Section 3.8 "Transformations" reads:

       D33 UTF-16BE is the Unicode Transformation Format that serializes a Unicode value as
           a sequence of two bytes, in big-endiang format. An initial sequence corresponding
           to U+FEFF is interpreted as a ZERO WIDTH NO-BREAK SPACE.
           
       D34 UTF-16LE is the Unicode Transformation Format that serializes a Unicode value as
           a sequence of two bytes, in little-endiang format. An initial sequence corresponding
           to U+FEFF is interpreted as a ZERO WIDTH NO-BREAK SPACE.

      Byte order mark does not make sense for UTF-8 encoding either, so in this case initial
      U+FEFF also should be interpreted as a ZERO WIDTH NO-BREAK SPACE.

      JDK 1.4.2-beta-b02 meets the Unicode Standard requirements. It omits initial BOM while
      decoding UTF-16 byte sequence and interpretes it as a ZERO WIDTH NO-BREAK SPACE while
      decoding UTF-8, UTF-16BE, UTF-16LE.

      ======================================================================

            sherman Xueming Shen
            passunw Pas Pas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: