Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: P3
Fix Version/s: 1.4.0
Affects Version/s: 1.3.0
Component/s: core-libs
Labels:
- webbug

Subcomponent:
java.nio.charsets
Resolved In Build:
beta
CPU:

x86
OS:

windows_nt

Name: yyT116575 Date: 11/22/2000

java version "1.3.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0-C)
Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode)

This bug is related to the folloing bugs:
  4251997 - UTF-8 Surrogate Decoding is Broken
  4297837 - Silent Recovery from bad UTF-8
  4344267 - Broken UTF-8 conversion of split surrogate
but I've included a test case to highlight the problem.

While I've stated that this bug affects 1.3, it also
affects all previous versions as well.

The problem with the UTF8 decoder is that it does not
properly handle surrogate characters. It is stated in
the documentation that surrogates are not supported as
of yet but 1) they should be, and 2) they seem to be
supported anyway (or at least partially). It's alright
to not support them at this time but the support should
be consistent and the behavior should be defined.

For example, when bytes are passed to the String
constructor with an encoding name of "UTF8", the
surrogate characters are decoded correctly. However, if
the surrogates appear in a byte stream, the surrogates
are silently skipped! Strange. I would at least have
thought that both methods would use the same underlying
decoder code.

Also, the InputStreamReader decoding UTF8 silently skips
surrogates in the input stream. If the decision to NOT
support surrogates stands as is, then perhaps the reader
should throw some kind of exception to signal the error.
Passing over them silently can cause problems for the
application.

In addition, I believe that the UTF8 also does not support
reading a UTF8 byte-order-mark (BOM) at the beginning of
the input. (It *does* occur in the real-world -- e.g.
Microsoft adds UTF8 BOMs to a lot of their documents.)
It's not strictly disallowed; it's just weird and the
decoder should be able to handle it.

/* Test case. Doesn't test ability to detect BOM. */
import java.io.ByteArrayInputStream;
import java.io.FilterInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.IOException;
import java.io.Reader;

public class BrokenUTF8 {

    // MAIN

    public static void main(String[] argv) throws Exception {
        System.out.println("#");
        System.out.println("# Byte array");
        System.out.println("#");
        final byte[] bytes = {
            (byte)0xF0, (byte)0x90, (byte)0x80, (byte)0x80
        };
        for (int i = 0; i < bytes.length; i++) {
            int c = bytes[i] & 0x00FF;
            System.out.println("byte["+i+"]: 0x"+Integer.toHexString(c));
        }
        System.out.println("#");
        System.out.println("# Converting bytes: new String(bytes, \"UTF8\")");
        System.out.println("#");
        String s = new String(bytes, "UTF8");
        int slen = s.length();
        for (int i = 0; i < slen; i++) {
            int c = s.charAt(i);
            System.out.println("s.charAt("+i+"): 0x"+Integer.toHexString(c));
        }
        System.out.println("#");
        System.out.println("# Converting bytes: new InputStreamReader(stream,\"UTF8\")");
        System.out.println("#");
        InputStream stream = new ByteArrayInputStream(bytes);
        InputStream streamReporter = new InputStreamReporter(stream);
        Reader reader = new InputStreamReader(streamReporter, "UTF8");
        int c = -1;
        int count = 0;
        do {
            c = reader.read();
            String cs = c != -1 ? "0x"+Integer.toHexString(c) : "EOF";
            System.out.println("Reader.read(): "+cs);
        } while (c != -1);
        System.out.println("#");
        System.out.println("# Done.");
        System.out.println("#");
    }

    // Classes

    static class InputStreamReporter extends FilterInputStream {

        // Constructors

        public InputStreamReporter(InputStream stream) {
            super(stream);
        }

        // InputStream methods

        public int read() throws IOException {
            int c = in.read();
            System.out.print("InputStream.read(): 0x");
            if (c != -1) {
                System.out.print(Integer.toHexString(c));
            }
            else {
                System.out.print("EOF");
            }
            System.out.println();
            return c;
        }

        public int read(byte[] buffer, int offset, int length) throws IOException {
            int count = super.in.read(buffer, offset, length);
            System.out.println("InputStream.read(byte[],"+offset+','+length+"): "+count);
            return count;
        }

    } // class InputStreamReporter

} // class BrokenUTF8
(Review ID: 112649)
======================================================================

relates to

JDK-4328816 Unicode 2.0 surrogate support

Resolved

JDK-4251997 UTF-8 Surrogate Decoding is Broken

Resolved

JDK-4344267 Broken UTF-8 conversion of split surrogate-pair

Resolved

JDK-4297837 Silent Recorvery from bad UTF-8

Closed

Assignee:: Ian Little (Inactive)

Reporter:: Yung-ching Young (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2000-11-22 11:06

Updated:: 2000-12-19 16:00

Resolved:: 2000-12-19 16:00

Imported:: 15/Sep/12 1:15 PM

Indexed:: 17/Jul/12 10:49 AM

Details

Description

Attachments

Issue Links

Activity

People

Dates