Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6798514

Charset UTF-8 accepts CESU-8 codings

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Fix
    • Icon: P4 P4
    • None
    • 6
    • core-libs

      FULL PRODUCT VERSION :
      C:\Programme\Java\jdk1.6.0_03\bin>java -version
      java version "1.6.0_03"
      Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
      Java HotSpot(TM) Client VM (build 1.6.0_03-b05, mixed mode)


      ADDITIONAL OS VERSION INFORMATION :
      Windows XP SR-2

      A DESCRIPTION OF THE PROBLEM :
      RFC 3629 states that "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."

      Current implementation of UTF-8 is not protected against invalid sequences from "ED A0 80" to "ED BF BF". Surrogate pairs are created instead, like CESU-8 does.

      Maybe this is as designed. But at least this should be documented in highlighted position, and created surrogate pairs should be valid.


      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      1.) Decode following byte sequence with UTF-8 decoder: "ED, A0, 80, ED, BF,BF"
      2.) Decode following byte sequence with UTF-8 decoder: "ED, BF,BF, ED, A0, 80"


      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      1.) CoderResult.isMalformed()
      2.) CoderResult.isMalformed()

      ACTUAL -
      1.) valid surrogate pair: U+D800 + U+DFFF
      2.) invalid surrogate pair: U+DFFF + U+D800


      REPRODUCIBILITY :
      This bug can be reproduced always.

            Unassigned Unassigned
            ndcosta Nelson Dcosta (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: