Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8072081

Supplementary characters are rejected in comments

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P3 P3
    • 9
    • 8, 9
    • xml

        FULL PRODUCT VERSION :
        Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
        Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

        ADDITIONAL OS VERSION INFORMATION :
        Darwin boolean.local 14.1.0 Darwin Kernel Version 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 x86_64

        A DESCRIPTION OF THE PROBLEM :
        SAX/Xerces rejects unicode characters (>= U+10000) in XML 1.0/1.1 comments. This was a bug in the original Apache codebase (XMLScanner), which was fixed with revision 319636 (2003-12-16).

        Parsing the following XML snippet results in
        SAXParseException; systemId: <omitted>; lineNumber: 1; columnNumber: 25; An invalid XML character (Unicode: 0xd840) was found in the comment.

        After removing the comment, the literal, with the same unicode character as in the comment, gets parsed just fine.

        <!-- Entry for Kanji: 𠀋� -->
        <character>
        <literal>𠀋�</literal>
        <codepoint>
        <cp_value cp_type="ucs">2000B</cp_value>
        <cp_value cp_type="jis213">1-14-2</cp_value>
        </codepoint>
        </character>

        The problem is in XMLScanner.scanComment(XMLStringBuffer). After a surrogate pair was detected and successfully parsed, an additional check on the current character is performed (isInvalidLiteral(c)), This check has to go to an else-branch, after XMLChar.isHighSurrogate(c).

        The following diff from the Apache codebase summarizes the necessary change:

        Index: src/org/apache/xerces/impl/XMLScanner.java
        ===================================================================
        --- src/org/apache/xerces/impl/XMLScanner.java (revision 319635)
        +++ src/org/apache/xerces/impl/XMLScanner.java (revision 319636)
        @@ -757,7 +757,7 @@
                         if (XMLChar.isHighSurrogate(c)) {
                             scanSurrogates(text);
                         }
        - if (isInvalidLiteral(c)) {
        + else if (isInvalidLiteral(c)) {
                             reportFatalError("InvalidCharInComment",
                                              new Object[] { Integer.toHexString(c) });
                             fEntityScanner.scanChar();
        @@ -951,6 +951,7 @@
                             }
                         }
                         else if (c != -1 && XMLChar.isHighSurrogate(c)) {
        + fStringBuffer3.clear();
                             if (scanSurrogates(fStringBuffer3)) {
                                 fStringBuffer.append(fStringBuffer3);
                                 if (entityDepth == fEntityDepth) {
        @@ -1354,6 +1355,14 @@
                 return (XMLChar.isNameStart(value));
             } // isValidNameStartChar(int): boolean
             
        + // returns true if the given character is
        + // a valid high surrogate for a nameStartChar
        + // with respect to the version of XML understood
        + // by this scanner.
        + protected boolean isValidNameStartHighSurrogate(int value) {
        + return false;
        + } // isValidNameStartHighSurrogate(int): boolean
        +
             protected boolean versionSupported(String version ) {
                 return version.equals("1.0");
             } // version Supported

        STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
        Parse XML with supplemental characters in a comment position. See test case below.

        EXPECTED VERSUS ACTUAL BEHAVIOR :
        EXPECTED -
        Since supplemental characters are valid for XML 1.0/1.1 comments, the expected result is that such XML can be parsed with SAX/Xerces.
        ACTUAL -
        SAXParseException: An invalid XML character (Unicode: <omitted>) was found in the comment.


        ERROR MESSAGES/STACK TRACES THAT OCCUR :
        Exception in thread "main" org.xml.sax.SAXParseException; systemId: file:/Users/dehmer/Development/carbon/kanjidic2.small.xml; lineNumber: 1; columnNumber: 25; An invalid XML character (Unicode: 0xd840) was found in the comment.
        at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203)
        at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
        at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
        at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
        at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1436)
        at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanComment(XMLScanner.java:789)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanComment(XMLDocumentFragmentScannerImpl.java:1038)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:904)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:333)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:328)


        REPRODUCIBILITY :
        This bug can be reproduced always.

        ---------- BEGIN SOURCE ----------
        import org.xml.sax.SAXParseException;
        import org.xml.sax.helpers.DefaultHandler;

        import javax.xml.parsers.SAXParser;
        import javax.xml.parsers.SAXParserFactory;
        import java.io.ByteArrayInputStream;

        /**
         * Unicode U+2000B should are valid in XML 1.0/1.1 comments.
         * The corresponding surrogate pair is (high) 0xd840, (low) 0xdc0b.
         *
         * org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 8;
         * An invalid XML character (Unicode: 0xd840) was found in the comment.
         */
        public class XMLScannerSupplementalCharactersInComment {
            private final static String XML[] = {
                    "<tag>\uD840\uDC0B</tag>", // passes, since char is not in comment position.
                    "<!-- \uD840\uDC0B --><dontCare/>" // fails => SAXParseException
            };

            public static void main(String[] args) throws Exception {
                SAXParserFactory factory = SAXParserFactory.newInstance();
                SAXParser parser = factory.newSAXParser();

                for(String xml : XML) {
                    try (ByteArrayInputStream stream = new ByteArrayInputStream(xml.getBytes("UTF-8"))) {
                        System.out.print("parsing: '" + xml + "'... ");
                        parser.parse(stream, new DefaultHandler());
                        System.out.println("passed.");
                    }
                    catch(SAXParseException unexpected) {
                        System.out.println("failed. " + unexpected.getMessage());
                    }
                }
            }
        }

        ---------- END SOURCE ----------

              joehw Joe Wang
              webbuggrp Webbug Group
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: