Loading...

Type: Bug
Resolution: Fixed
Priority: P3
Fix Version/s: 9
Affects Version/s: 7u51, 8u20
Component/s: core-libs
Labels:
- apache-tomcat-found
- webbug

Subcomponent:
java.nio.charsets
Resolved In Build:
b10
CPU:

x86
OS:

windows_2008
Verification:
Verified

Issue	Fix Version	Assignee	Priority	Status	Resolution	Resolved In Build
JDK-8045662	8u25	Xueming Shen	P3	Resolved	Fixed	b01
JDK-8043096	8u20	Sean Coffey	P3	Resolved	Fixed	b17
JDK-8053608	emb-8u26	Xueming Shen	P3	Resolved	Fixed	b17

FULL PRODUCT VERSION :
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)

EXTRA RELEVANT SYSTEM CONFIGURATION :
None required.

A DESCRIPTION OF THE PROBLEM :
The Apache Tomcat team has put together a test case [1] that demonstrates multiple
UTF-8 decoding bugs. You'll see from the change history of that file that the UTF-8
decoder in Java 8 is a significant improvement over the Java 7
implementation but a number of bugs still remain.

I thought it would be helpful to walk through one of the test case
examples. The code at line 99 onwards of the test case is as follows:

99 // JVM decoder does not report error until all 4 bytes are available
100 TEST_CASES.add(new Utf8TestCase(
101 "Invalid code point - out of range",
102 new int[] {0x41, 0xF4, 0x90, 0x80, 0x80, 0x41},
103 2,
104 "A\uFFFD\uFFFD\uFFFD\uFFFDA").addForJvm(ERROR_POS_PLUS2));

It is the ".addForJvm(...)" part that indicates that the standard Java
decoder does not handle this case correctly. The parameter to that
method call (or calls) indicates the problem (or problems). In this case
the invalid UTF sequence is detected however it is detected 2 bytes
later than it should have been.

The first byte is correctly decoded to 'A'.

The second byte is correctly interpreted as marking the start of a 4
byte UTF-8 sequence. Recall that a 4 byte UTF-8 sequence takes the form:

11110aaa 10bbbbbb 10cccccc 10dddddd

Recall also that the code point associated with the above four byte
sequence is:

000aaabb bbbbcccc ccdddddd

Therefore, if the first byte of the 4 byte sequence is 0xF4 then the
code point must be:

000100bb bbbbcccc ccdddddd

Recall that the valid range of UTF-8 code points is zero to 0x10FFFF or
in binary:
00010000 11111111 11111111

When the third byte (0x90) is read this maps the the second byte in the
4 byte sequence as follows:
10010000
10cccccc

This provides 6 more bits for the code point which gives:

00010010 0000cccc ccdddd

At this point is known that whatever the values of the third and fourth
bytes in the sequence, the code point is going to be greater than
0x10FFFF and therefore it can - and should - be rejected as invalid at
this point. The standard Java decoder does not do this.

The requirement to reject the invalid sequence and the importance of
doing at as soon as possible - particularly when the decoder has been
configured to use replacement characters - is discussed in the Unicode
specification 6.2, chapter 3, page 96 "Constraints on Conversion Processes".

The other test cases in the unit test test various edge cases for a
UTF-8 decoder.

The issues with the standard Java decoder may be summarised as:
- not always detecting an invalid sequence early enough
- sometimes incorrectly swallowing a valid byte as part of a preceding
  invalid byte sequence
- sometime incorrectly swallowing an invalid byte as part of a preceding
  invalid byte sequence

The nature of these errors is such that they often appear in combination
for a particular test case.

In order to avoid any potential security issue with the incorrect
decoding of a UTF-8 sequence - particularly in URLs - Tomcat has had to
implement its own UTF-8 decoder. I am aware that Jetty has also had to
take this approach and I assume other Servlet containers have as well.

It would be great to see these bugs in the UTF-8 decoder fixed so that
Tomcat (and the other containers that have had to implement their own
decoders) can drop that code and use the standard Java decoder.

[1] http://svn.apache.org/viewvc/tomcat/trunk/test/org/apache/tomcat/util/buf/TestUtf8.java?view=markup

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Decode the following byte sequence one byte at a time with the standard UTF-8 decoder:

0x41, 0xF4, 0x90, 0x80, 0x80, 0x41

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
An error should be thrown after processing the third byte (0x90).
ACTUAL -
An error is thrown after processing the fifth byte.

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
package org.apache.markt;

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.StandardCharsets;

public class Utf8Bug {

    public static void main(String[] args) {
        int[] input = new int[] { 0x41, 0xF4, 0x90, 0x80, 0x80, 0x41};

        int len = input.length;

        ByteBuffer bb = ByteBuffer.allocate(len);
        CharBuffer cb = CharBuffer.allocate(len);

        // Configure decoder to fail on an error
        CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
        decoder.onMalformedInput(CodingErrorAction.REPORT);
        decoder.onUnmappableCharacter(CodingErrorAction.REPORT);

        // Add each byte one at a time. The decoder should fail as soon as
        // an invalid sequence has been provided
        for (int i = 0; i < len; i++) {
            bb.put((byte) input[i]);
            bb.flip();
            CoderResult cr = decoder.decode(bb, cb, false);
            if (cr.isError()) {
                if (i == 2) {
                    break;
                }
                throw new IllegalStateException("Error first detected at index " +
                        i + " rather than at index 2");
            }
            bb.compact();
        }
    }
}
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Use a custom UTF-8 decoder:
http://svn.apache.org/viewvc/tomcat/trunk/java/org/apache/tomcat/util/buf/Utf8Decoder.java?view=annotate

backported by

JDK-8043096 UTF-8 decoder fails to handle some edge cases correctly

Resolved

JDK-8045662 UTF-8 decoder fails to handle some edge cases correctly

Resolved

JDK-8053608 UTF-8 decoder fails to handle some edge cases correctly

Resolved

relates to

JDK-7096080 UTF8 update and new CESU-8 charset

Closed

Details

Backports

Description

Attachments

Issue Links

Activity

People

Dates