-
Bug
-
Resolution: Fixed
-
P3
-
7u51, 8u20
-
b10
-
x86
-
windows_2008
-
Verified
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8045662 | 8u25 | Xueming Shen | P3 | Resolved | Fixed | b01 |
JDK-8043096 | 8u20 | Sean Coffey | P3 | Resolved | Fixed | b17 |
JDK-8053608 | emb-8u26 | Xueming Shen | P3 | Resolved | Fixed | b17 |
FULL PRODUCT VERSION :
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
EXTRA RELEVANT SYSTEM CONFIGURATION :
None required.
A DESCRIPTION OF THE PROBLEM :
The Apache Tomcat team has put together a test case [1] that demonstrates multiple
UTF-8 decoding bugs. You'll see from the change history of that file that the UTF-8
decoder in Java 8 is a significant improvement over the Java 7
implementation but a number of bugs still remain.
I thought it would be helpful to walk through one of the test case
examples. The code at line 99 onwards of the test case is as follows:
99 // JVM decoder does not report error until all 4 bytes are available
100 TEST_CASES.add(new Utf8TestCase(
101 "Invalid code point - out of range",
102 new int[] {0x41, 0xF4, 0x90, 0x80, 0x80, 0x41},
103 2,
104 "A\uFFFD\uFFFD\uFFFD\uFFFDA").addForJvm(ERROR_POS_PLUS2));
It is the ".addForJvm(...)" part that indicates that the standard Java
decoder does not handle this case correctly. The parameter to that
method call (or calls) indicates the problem (or problems). In this case
the invalid UTF sequence is detected however it is detected 2 bytes
later than it should have been.
The first byte is correctly decoded to 'A'.
The second byte is correctly interpreted as marking the start of a 4
byte UTF-8 sequence. Recall that a 4 byte UTF-8 sequence takes the form:
11110aaa 10bbbbbb 10cccccc 10dddddd
Recall also that the code point associated with the above four byte
sequence is:
000aaabb bbbbcccc ccdddddd
Therefore, if the first byte of the 4 byte sequence is 0xF4 then the
code point must be:
000100bb bbbbcccc ccdddddd
Recall that the valid range of UTF-8 code points is zero to 0x10FFFF or
in binary:
00010000 11111111 11111111
When the third byte (0x90) is read this maps the the second byte in the
4 byte sequence as follows:
10010000
10cccccc
This provides 6 more bits for the code point which gives:
00010010 0000cccc ccdddd
At this point is known that whatever the values of the third and fourth
bytes in the sequence, the code point is going to be greater than
0x10FFFF and therefore it can - and should - be rejected as invalid at
this point. The standard Java decoder does not do this.
The requirement to reject the invalid sequence and the importance of
doing at as soon as possible - particularly when the decoder has been
configured to use replacement characters - is discussed in the Unicode
specification 6.2, chapter 3, page 96 "Constraints on Conversion Processes".
The other test cases in the unit test test various edge cases for a
UTF-8 decoder.
The issues with the standard Java decoder may be summarised as:
- not always detecting an invalid sequence early enough
- sometimes incorrectly swallowing a valid byte as part of a preceding
invalid byte sequence
- sometime incorrectly swallowing an invalid byte as part of a preceding
invalid byte sequence
The nature of these errors is such that they often appear in combination
for a particular test case.
In order to avoid any potential security issue with the incorrect
decoding of a UTF-8 sequence - particularly in URLs - Tomcat has had to
implement its own UTF-8 decoder. I am aware that Jetty has also had to
take this approach and I assume other Servlet containers have as well.
It would be great to see these bugs in the UTF-8 decoder fixed so that
Tomcat (and the other containers that have had to implement their own
decoders) can drop that code and use the standard Java decoder.
[1] http://svn.apache.org/viewvc/tomcat/trunk/test/org/apache/tomcat/util/buf/TestUtf8.java?view=markup
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Decode the following byte sequence one byte at a time with the standard UTF-8 decoder:
0x41, 0xF4, 0x90, 0x80, 0x80, 0x41
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
An error should be thrown after processing the third byte (0x90).
ACTUAL -
An error is thrown after processing the fifth byte.
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
package org.apache.markt;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.StandardCharsets;
public class Utf8Bug {
public static void main(String[] args) {
int[] input = new int[] { 0x41, 0xF4, 0x90, 0x80, 0x80, 0x41};
int len = input.length;
ByteBuffer bb = ByteBuffer.allocate(len);
CharBuffer cb = CharBuffer.allocate(len);
// Configure decoder to fail on an error
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
// Add each byte one at a time. The decoder should fail as soon as
// an invalid sequence has been provided
for (int i = 0; i < len; i++) {
bb.put((byte) input[i]);
bb.flip();
CoderResult cr = decoder.decode(bb, cb, false);
if (cr.isError()) {
if (i == 2) {
break;
}
throw new IllegalStateException("Error first detected at index " +
i + " rather than at index 2");
}
bb.compact();
}
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Use a custom UTF-8 decoder:
http://svn.apache.org/viewvc/tomcat/trunk/java/org/apache/tomcat/util/buf/Utf8Decoder.java?view=annotate
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
EXTRA RELEVANT SYSTEM CONFIGURATION :
None required.
A DESCRIPTION OF THE PROBLEM :
The Apache Tomcat team has put together a test case [1] that demonstrates multiple
UTF-8 decoding bugs. You'll see from the change history of that file that the UTF-8
decoder in Java 8 is a significant improvement over the Java 7
implementation but a number of bugs still remain.
I thought it would be helpful to walk through one of the test case
examples. The code at line 99 onwards of the test case is as follows:
99 // JVM decoder does not report error until all 4 bytes are available
100 TEST_CASES.add(new Utf8TestCase(
101 "Invalid code point - out of range",
102 new int[] {0x41, 0xF4, 0x90, 0x80, 0x80, 0x41},
103 2,
104 "A\uFFFD\uFFFD\uFFFD\uFFFDA").addForJvm(ERROR_POS_PLUS2));
It is the ".addForJvm(...)" part that indicates that the standard Java
decoder does not handle this case correctly. The parameter to that
method call (or calls) indicates the problem (or problems). In this case
the invalid UTF sequence is detected however it is detected 2 bytes
later than it should have been.
The first byte is correctly decoded to 'A'.
The second byte is correctly interpreted as marking the start of a 4
byte UTF-8 sequence. Recall that a 4 byte UTF-8 sequence takes the form:
11110aaa 10bbbbbb 10cccccc 10dddddd
Recall also that the code point associated with the above four byte
sequence is:
000aaabb bbbbcccc ccdddddd
Therefore, if the first byte of the 4 byte sequence is 0xF4 then the
code point must be:
000100bb bbbbcccc ccdddddd
Recall that the valid range of UTF-8 code points is zero to 0x10FFFF or
in binary:
00010000 11111111 11111111
When the third byte (0x90) is read this maps the the second byte in the
4 byte sequence as follows:
10010000
10cccccc
This provides 6 more bits for the code point which gives:
00010010 0000cccc ccdddd
At this point is known that whatever the values of the third and fourth
bytes in the sequence, the code point is going to be greater than
0x10FFFF and therefore it can - and should - be rejected as invalid at
this point. The standard Java decoder does not do this.
The requirement to reject the invalid sequence and the importance of
doing at as soon as possible - particularly when the decoder has been
configured to use replacement characters - is discussed in the Unicode
specification 6.2, chapter 3, page 96 "Constraints on Conversion Processes".
The other test cases in the unit test test various edge cases for a
UTF-8 decoder.
The issues with the standard Java decoder may be summarised as:
- not always detecting an invalid sequence early enough
- sometimes incorrectly swallowing a valid byte as part of a preceding
invalid byte sequence
- sometime incorrectly swallowing an invalid byte as part of a preceding
invalid byte sequence
The nature of these errors is such that they often appear in combination
for a particular test case.
In order to avoid any potential security issue with the incorrect
decoding of a UTF-8 sequence - particularly in URLs - Tomcat has had to
implement its own UTF-8 decoder. I am aware that Jetty has also had to
take this approach and I assume other Servlet containers have as well.
It would be great to see these bugs in the UTF-8 decoder fixed so that
Tomcat (and the other containers that have had to implement their own
decoders) can drop that code and use the standard Java decoder.
[1] http://svn.apache.org/viewvc/tomcat/trunk/test/org/apache/tomcat/util/buf/TestUtf8.java?view=markup
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Decode the following byte sequence one byte at a time with the standard UTF-8 decoder:
0x41, 0xF4, 0x90, 0x80, 0x80, 0x41
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
An error should be thrown after processing the third byte (0x90).
ACTUAL -
An error is thrown after processing the fifth byte.
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
package org.apache.markt;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.StandardCharsets;
public class Utf8Bug {
public static void main(String[] args) {
int[] input = new int[] { 0x41, 0xF4, 0x90, 0x80, 0x80, 0x41};
int len = input.length;
ByteBuffer bb = ByteBuffer.allocate(len);
CharBuffer cb = CharBuffer.allocate(len);
// Configure decoder to fail on an error
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
// Add each byte one at a time. The decoder should fail as soon as
// an invalid sequence has been provided
for (int i = 0; i < len; i++) {
bb.put((byte) input[i]);
bb.flip();
CoderResult cr = decoder.decode(bb, cb, false);
if (cr.isError()) {
if (i == 2) {
break;
}
throw new IllegalStateException("Error first detected at index " +
i + " rather than at index 2");
}
bb.compact();
}
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Use a custom UTF-8 decoder:
http://svn.apache.org/viewvc/tomcat/trunk/java/org/apache/tomcat/util/buf/Utf8Decoder.java?view=annotate
- backported by
-
JDK-8043096 UTF-8 decoder fails to handle some edge cases correctly
-
- Resolved
-
-
JDK-8045662 UTF-8 decoder fails to handle some edge cases correctly
-
- Resolved
-
-
JDK-8053608 UTF-8 decoder fails to handle some edge cases correctly
-
- Resolved
-
- relates to
-
JDK-7096080 UTF8 update and new CESU-8 charset
-
- Closed
-