Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4391895

UTF8 Decoder Broken

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P3 P3
    • 1.4.0
    • 1.3.0
    • core-libs



      Name: yyT116575 Date: 11/22/2000


      java version "1.3.0"
      Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0-C)
      Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode)

      This bug is related to the folloing bugs:
        4251997 - UTF-8 Surrogate Decoding is Broken
        4297837 - Silent Recovery from bad UTF-8
        4344267 - Broken UTF-8 conversion of split surrogate
      but I've included a test case to highlight the problem.

      While I've stated that this bug affects 1.3, it also
      affects all previous versions as well.

      The problem with the UTF8 decoder is that it does not
      properly handle surrogate characters. It is stated in
      the documentation that surrogates are not supported as
      of yet but 1) they should be, and 2) they seem to be
      supported anyway (or at least partially). It's alright
      to not support them at this time but the support should
      be consistent and the behavior should be defined.

      For example, when bytes are passed to the String
      constructor with an encoding name of "UTF8", the
      surrogate characters are decoded correctly. However, if
      the surrogates appear in a byte stream, the surrogates
      are silently skipped! Strange. I would at least have
      thought that both methods would use the same underlying
      decoder code.

      Also, the InputStreamReader decoding UTF8 silently skips
      surrogates in the input stream. If the decision to NOT
      support surrogates stands as is, then perhaps the reader
      should throw some kind of exception to signal the error.
      Passing over them silently can cause problems for the
      application.

      In addition, I believe that the UTF8 also does not support
      reading a UTF8 byte-order-mark (BOM) at the beginning of
      the input. (It *does* occur in the real-world -- e.g.
      Microsoft adds UTF8 BOMs to a lot of their documents.)
      It's not strictly disallowed; it's just weird and the
      decoder should be able to handle it.

      /* Test case. Doesn't test ability to detect BOM. */
      import java.io.ByteArrayInputStream;
      import java.io.FilterInputStream;
      import java.io.InputStream;
      import java.io.InputStreamReader;
      import java.io.IOException;
      import java.io.Reader;

      public class BrokenUTF8 {

          // MAIN

          public static void main(String[] argv) throws Exception {
              System.out.println("#");
              System.out.println("# Byte array");
              System.out.println("#");
              final byte[] bytes = {
                  (byte)0xF0, (byte)0x90, (byte)0x80, (byte)0x80
              };
              for (int i = 0; i < bytes.length; i++) {
                  int c = bytes[i] & 0x00FF;
                  System.out.println("byte["+i+"]: 0x"+Integer.toHexString(c));
              }
              System.out.println("#");
              System.out.println("# Converting bytes: new String(bytes, \"UTF8\")");
              System.out.println("#");
              String s = new String(bytes, "UTF8");
              int slen = s.length();
              for (int i = 0; i < slen; i++) {
                  int c = s.charAt(i);
                  System.out.println("s.charAt("+i+"): 0x"+Integer.toHexString(c));
              }
              System.out.println("#");
              System.out.println("# Converting bytes: new InputStreamReader(stream,\"UTF8\")");
              System.out.println("#");
              InputStream stream = new ByteArrayInputStream(bytes);
              InputStream streamReporter = new InputStreamReporter(stream);
              Reader reader = new InputStreamReader(streamReporter, "UTF8");
              int c = -1;
              int count = 0;
              do {
                  c = reader.read();
                  String cs = c != -1 ? "0x"+Integer.toHexString(c) : "EOF";
                  System.out.println("Reader.read(): "+cs);
              } while (c != -1);
              System.out.println("#");
              System.out.println("# Done.");
              System.out.println("#");
          }

          // Classes

          static class InputStreamReporter extends FilterInputStream {

              // Constructors

              public InputStreamReporter(InputStream stream) {
                  super(stream);
              }

              // InputStream methods

              public int read() throws IOException {
                  int c = in.read();
                  System.out.print("InputStream.read(): 0x");
                  if (c != -1) {
                      System.out.print(Integer.toHexString(c));
                  }
                  else {
                      System.out.print("EOF");
                  }
                  System.out.println();
                  return c;
              }

              public int read(byte[] buffer, int offset, int length) throws IOException {
                  int count = super.in.read(buffer, offset, length);
                  System.out.println("InputStream.read(byte[],"+offset+','+length+"): "+count);
                  return count;
              }

          } // class InputStreamReporter

      } // class BrokenUTF8
      (Review ID: 112649)
      ======================================================================

            ilittlesunw Ian Little (Inactive)
            yyoungsunw Yung-ching Young (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: