-
Bug
-
Resolution: Fixed
-
P3
-
1.3.0
-
beta
-
x86
-
windows_nt
Name: yyT116575 Date: 11/22/2000
java version "1.3.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0-C)
Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode)
This bug is related to the folloing bugs:
4251997 - UTF-8 Surrogate Decoding is Broken
4297837 - Silent Recovery from bad UTF-8
4344267 - Broken UTF-8 conversion of split surrogate
but I've included a test case to highlight the problem.
While I've stated that this bug affects 1.3, it also
affects all previous versions as well.
The problem with the UTF8 decoder is that it does not
properly handle surrogate characters. It is stated in
the documentation that surrogates are not supported as
of yet but 1) they should be, and 2) they seem to be
supported anyway (or at least partially). It's alright
to not support them at this time but the support should
be consistent and the behavior should be defined.
For example, when bytes are passed to the String
constructor with an encoding name of "UTF8", the
surrogate characters are decoded correctly. However, if
the surrogates appear in a byte stream, the surrogates
are silently skipped! Strange. I would at least have
thought that both methods would use the same underlying
decoder code.
Also, the InputStreamReader decoding UTF8 silently skips
surrogates in the input stream. If the decision to NOT
support surrogates stands as is, then perhaps the reader
should throw some kind of exception to signal the error.
Passing over them silently can cause problems for the
application.
In addition, I believe that the UTF8 also does not support
reading a UTF8 byte-order-mark (BOM) at the beginning of
the input. (It *does* occur in the real-world -- e.g.
Microsoft adds UTF8 BOMs to a lot of their documents.)
It's not strictly disallowed; it's just weird and the
decoder should be able to handle it.
/* Test case. Doesn't test ability to detect BOM. */
import java.io.ByteArrayInputStream;
import java.io.FilterInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.IOException;
import java.io.Reader;
public class BrokenUTF8 {
// MAIN
public static void main(String[] argv) throws Exception {
System.out.println("#");
System.out.println("# Byte array");
System.out.println("#");
final byte[] bytes = {
(byte)0xF0, (byte)0x90, (byte)0x80, (byte)0x80
};
for (int i = 0; i < bytes.length; i++) {
int c = bytes[i] & 0x00FF;
System.out.println("byte["+i+"]: 0x"+Integer.toHexString(c));
}
System.out.println("#");
System.out.println("# Converting bytes: new String(bytes, \"UTF8\")");
System.out.println("#");
String s = new String(bytes, "UTF8");
int slen = s.length();
for (int i = 0; i < slen; i++) {
int c = s.charAt(i);
System.out.println("s.charAt("+i+"): 0x"+Integer.toHexString(c));
}
System.out.println("#");
System.out.println("# Converting bytes: new InputStreamReader(stream,\"UTF8\")");
System.out.println("#");
InputStream stream = new ByteArrayInputStream(bytes);
InputStream streamReporter = new InputStreamReporter(stream);
Reader reader = new InputStreamReader(streamReporter, "UTF8");
int c = -1;
int count = 0;
do {
c = reader.read();
String cs = c != -1 ? "0x"+Integer.toHexString(c) : "EOF";
System.out.println("Reader.read(): "+cs);
} while (c != -1);
System.out.println("#");
System.out.println("# Done.");
System.out.println("#");
}
// Classes
static class InputStreamReporter extends FilterInputStream {
// Constructors
public InputStreamReporter(InputStream stream) {
super(stream);
}
// InputStream methods
public int read() throws IOException {
int c = in.read();
System.out.print("InputStream.read(): 0x");
if (c != -1) {
System.out.print(Integer.toHexString(c));
}
else {
System.out.print("EOF");
}
System.out.println();
return c;
}
public int read(byte[] buffer, int offset, int length) throws IOException {
int count = super.in.read(buffer, offset, length);
System.out.println("InputStream.read(byte[],"+offset+','+length+"): "+count);
return count;
}
} // class InputStreamReporter
} // class BrokenUTF8
(Review ID: 112649)
======================================================================
- relates to
-
JDK-4328816 Unicode 2.0 surrogate support
-
- Resolved
-
-
JDK-4251997 UTF-8 Surrogate Decoding is Broken
-
- Resolved
-
-
JDK-4344267 Broken UTF-8 conversion of split surrogate-pair
-
- Resolved
-
-
JDK-4297837 Silent Recorvery from bad UTF-8
-
- Closed
-