Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8039751

UTF-8 decoder fails to handle some edge cases correctly

XMLWordPrintable

    • b10
    • x86
    • windows_2008
    • Verified

        FULL PRODUCT VERSION :
        java version "1.7.0_51"
        Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
        Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

        java version "1.8.0"
        Java(TM) SE Runtime Environment (build 1.8.0-b132)
        Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)

        EXTRA RELEVANT SYSTEM CONFIGURATION :
        None required.

        A DESCRIPTION OF THE PROBLEM :
        The Apache Tomcat team has put together a test case [1] that demonstrates multiple
        UTF-8 decoding bugs. You'll see from the change history of that file that the UTF-8
        decoder in Java 8 is a significant improvement over the Java 7
        implementation but a number of bugs still remain.

        I thought it would be helpful to walk through one of the test case
        examples. The code at line 99 onwards of the test case is as follows:

        99 // JVM decoder does not report error until all 4 bytes are available
        100 TEST_CASES.add(new Utf8TestCase(
        101 "Invalid code point - out of range",
        102 new int[] {0x41, 0xF4, 0x90, 0x80, 0x80, 0x41},
        103 2,
        104 "A\uFFFD\uFFFD\uFFFD\uFFFDA").addForJvm(ERROR_POS_PLUS2));

        It is the ".addForJvm(...)" part that indicates that the standard Java
        decoder does not handle this case correctly. The parameter to that
        method call (or calls) indicates the problem (or problems). In this case
        the invalid UTF sequence is detected however it is detected 2 bytes
        later than it should have been.

        The first byte is correctly decoded to 'A'.

        The second byte is correctly interpreted as marking the start of a 4
        byte UTF-8 sequence. Recall that a 4 byte UTF-8 sequence takes the form:

        11110aaa 10bbbbbb 10cccccc 10dddddd

        Recall also that the code point associated with the above four byte
        sequence is:

        000aaabb bbbbcccc ccdddddd


        Therefore, if the first byte of the 4 byte sequence is 0xF4 then the
        code point must be:

        000100bb bbbbcccc ccdddddd

        Recall that the valid range of UTF-8 code points is zero to 0x10FFFF or
        in binary:
        00010000 11111111 11111111

        When the third byte (0x90) is read this maps the the second byte in the
        4 byte sequence as follows:
        10010000
        10cccccc

        This provides 6 more bits for the code point which gives:

        00010010 0000cccc ccdddd

        At this point is known that whatever the values of the third and fourth
        bytes in the sequence, the code point is going to be greater than
        0x10FFFF and therefore it can - and should - be rejected as invalid at
        this point. The standard Java decoder does not do this.

        The requirement to reject the invalid sequence and the importance of
        doing at as soon as possible - particularly when the decoder has been
        configured to use replacement characters - is discussed in the Unicode
        specification 6.2, chapter 3, page 96 "Constraints on Conversion Processes".


        The other test cases in the unit test test various edge cases for a
        UTF-8 decoder.

        The issues with the standard Java decoder may be summarised as:
        - not always detecting an invalid sequence early enough
        - sometimes incorrectly swallowing a valid byte as part of a preceding
          invalid byte sequence
        - sometime incorrectly swallowing an invalid byte as part of a preceding
          invalid byte sequence

        The nature of these errors is such that they often appear in combination
        for a particular test case.

        In order to avoid any potential security issue with the incorrect
        decoding of a UTF-8 sequence - particularly in URLs - Tomcat has had to
        implement its own UTF-8 decoder. I am aware that Jetty has also had to
        take this approach and I assume other Servlet containers have as well.

        It would be great to see these bugs in the UTF-8 decoder fixed so that
        Tomcat (and the other containers that have had to implement their own
        decoders) can drop that code and use the standard Java decoder.


        [1] http://svn.apache.org/viewvc/tomcat/trunk/test/org/apache/tomcat/util/buf/TestUtf8.java?view=markup


        STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
        Decode the following byte sequence one byte at a time with the standard UTF-8 decoder:

        0x41, 0xF4, 0x90, 0x80, 0x80, 0x41

        EXPECTED VERSUS ACTUAL BEHAVIOR :
        EXPECTED -
        An error should be thrown after processing the third byte (0x90).
        ACTUAL -
        An error is thrown after processing the fifth byte.

        REPRODUCIBILITY :
        This bug can be reproduced always.

        ---------- BEGIN SOURCE ----------
        package org.apache.markt;

        import java.nio.ByteBuffer;
        import java.nio.CharBuffer;
        import java.nio.charset.CharsetDecoder;
        import java.nio.charset.CoderResult;
        import java.nio.charset.CodingErrorAction;
        import java.nio.charset.StandardCharsets;

        public class Utf8Bug {

            public static void main(String[] args) {
                int[] input = new int[] { 0x41, 0xF4, 0x90, 0x80, 0x80, 0x41};

                int len = input.length;

                ByteBuffer bb = ByteBuffer.allocate(len);
                CharBuffer cb = CharBuffer.allocate(len);

                // Configure decoder to fail on an error
                CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
                decoder.onMalformedInput(CodingErrorAction.REPORT);
                decoder.onUnmappableCharacter(CodingErrorAction.REPORT);

                // Add each byte one at a time. The decoder should fail as soon as
                // an invalid sequence has been provided
                for (int i = 0; i < len; i++) {
                    bb.put((byte) input[i]);
                    bb.flip();
                    CoderResult cr = decoder.decode(bb, cb, false);
                    if (cr.isError()) {
                        if (i == 2) {
                            break;
                        }
                        throw new IllegalStateException("Error first detected at index " +
                                i + " rather than at index 2");
                    }
                    bb.compact();
                }
            }
        }
        ---------- END SOURCE ----------

        CUSTOMER SUBMITTED WORKAROUND :
        Use a custom UTF-8 decoder:
        http://svn.apache.org/viewvc/tomcat/trunk/java/org/apache/tomcat/util/buf/Utf8Decoder.java?view=annotate

              sherman Xueming Shen
              webbuggrp Webbug Group
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: