Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4344267

Broken UTF-8 conversion of split surrogate-pair

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P4 P4
    • 1.4.0
    • 1.3.0
    • core-libs



      Name: rlT66838 Date: 06/08/2000


      SCSL JDK 1.3 Beta source code (Sep 1999)


      This is from the September 1999 JDK 1.3 source release; there is a (faint)
      chance that this may have been found and fixed already...

      Surrogate pairs are handled correctly if both the high half and the low half
      are in the same input[] buffer. However, if a surrogate pair straddles two
      input buffers, then it hits two bugs:

      First, there is code that does

                  inputChar = highHalfZoneCode;
                  highHalfZoneCode = 0;
                  if (input[inOff] >= 0xdc00 && input[inOff] <= 0xdfff) {
                      // This is legal UTF16 sequence.
                      int ucs4 = (highHalfZoneCode - 0xd800) * 0x400
                          + (input[inOff] - 0xdc00) + 0x10000;

      The ucs4 calculation assumes that highHalfZoneCode still contains the first
      half of the surrogate pair, but highHalfZoneCode has been zapped to 0.

        Fix: the ucs4 calculation should use inputChar instead of highHalfZoneCode.

      Next, it tries to output the ucs4 value:

                      output[0] = (byte)(0xf0 | ((ucs4 >> 18)) & 0x07);
                      output[1] = (byte)(0x80 | ((ucs4 >> 12) & 0x3f));
                      output[2] = (byte)(0x80 | ((ucs4 >> 6) & 0x3f));
                      output[3] = (byte)(0x80 | (ucs4 & 0x3f));
                      charOff++;

      This should *not* use output[], it should use outputBytes[], then set
      outputSize = 4, then execute the logic that occurs further down:

                  if (byteOff + outputSize > outEnd) {
                      throw new ConversionBufferFullException();
                  }
                  for (int i = 0; i < outputSize; i++) {
                      output[byteOff++] = outputByte[i];
                  }

      It might also be good for consistency if it set inputSize = 1 and then
      did "charOff += inputsize", rather than the current "charOff++", but
      that's probably a judgment call.

      Also, highHalfZoneCode is redundantly set to 0 again. Not bad, but looks funny.
      (Review ID: 105886)
      ======================================================================

            ilittlesunw Ian Little (Inactive)
            rlewis Roger Lewis (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: