-
Bug
-
Resolution: Duplicate
-
P4
-
None
-
8u45
-
generic
-
generic
FULL PRODUCT VERSION :
1.8.0_25. Also reproduced with 1.6.0_27
ADDITIONAL OS VERSION INFORMATION :
OS X 10.10.5
EXTRA RELEVANT SYSTEM CONFIGURATION :
Tested in multiple configurations
A DESCRIPTION OF THE PROBLEM :
Parsing an XML file in UTF-8 encoding, containing a single element with a single attribute; the attribute contains two non-BMP characters (U+1D6A4 repeated twice). In the string as reported to the SAX ContentHandler, the attribute contains three non-BMP characters (U+1D6A4 repeated thrice).
The problem does not occur with Apache Xerces.
The problem occurs with all known versions of the JDK XML parser.
We have been aware of occasional corruptions of XML attribute values for years but this is the first time a client has provided such a simple demonstration of the problem.
Reported (incorrectly) as a bug on the Saxon product here: https://saxonica.plan.io/issues/2533
ADDITIONAL REGRESSION INFORMATION:
1.8.0_25
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Ensure that Apache Xerces is on the classpath. Run the attached program supplying the name of the attached XML file as the only argument:
java commands.JDKParserBug input.xml
The program gives output for the JDK parser and for the Apache Xerces parser. The Apache output is correct, the JDK output is incorrect.
(Alternatively, comment out the reference to Apache Xerces. It's only there to provide additional verification).
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
SAXParser: ....
ELEMENT: name
ATTRIBUTE sortable: d835 dea4 d835 dea4
ACTUAL -
SAXParser: com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl
ELEMENT: name
ATTRIBUTE sortable: d835 dea4 d835 dea4 d835 dea4
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
package commands;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLFilterImpl;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.*;
public class JDKParserBug {
public static void main(String[] args) {
try {
System.err.println(System.getProperty("java.version"));
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><name sortable=\"\uD835\uDEA4\uD835\uDEA4\"/>";
for (String factoryName : new String[]{
"org.apache.xerces.jaxp.SAXParserFactoryImpl",
"com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl"}) {
SAXParserFactory factory = SAXParserFactory.newInstance(factoryName, "".getClass().getClassLoader());
factory.setNamespaceAware(true);
SAXParser parser = factory.newSAXParser();
System.err.println("SAXParser: " + parser.getClass().getName());
XMLReader reader = parser.getXMLReader();
reader.setContentHandler(new XMLFilterImpl() {
@Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
System.err.println("ELEMENT: " + localName);
for (int i=0; i<atts.getLength(); i++) {
System.err.println(" ATTRIBUTE " + atts.getLocalName(i) + ": " +
showString(atts.getValue(i)));
}
}
});
reader.parse(new InputSource(new StringReader(xml)));
}
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public static String showString(String s) {
StringBuilder result = new StringBuilder();
for (int i=0; i<s.length(); i++) {
int c = s.charAt(i);
result.append(Integer.toHexString(c)).append(" ");
}
return result.toString();
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Use Apache Xerces in place of the JDK parser. (I have been advising my clients to do this for years, largely because of this bug)
1.8.0_25. Also reproduced with 1.6.0_27
ADDITIONAL OS VERSION INFORMATION :
OS X 10.10.5
EXTRA RELEVANT SYSTEM CONFIGURATION :
Tested in multiple configurations
A DESCRIPTION OF THE PROBLEM :
Parsing an XML file in UTF-8 encoding, containing a single element with a single attribute; the attribute contains two non-BMP characters (U+1D6A4 repeated twice). In the string as reported to the SAX ContentHandler, the attribute contains three non-BMP characters (U+1D6A4 repeated thrice).
The problem does not occur with Apache Xerces.
The problem occurs with all known versions of the JDK XML parser.
We have been aware of occasional corruptions of XML attribute values for years but this is the first time a client has provided such a simple demonstration of the problem.
Reported (incorrectly) as a bug on the Saxon product here: https://saxonica.plan.io/issues/2533
ADDITIONAL REGRESSION INFORMATION:
1.8.0_25
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Ensure that Apache Xerces is on the classpath. Run the attached program supplying the name of the attached XML file as the only argument:
java commands.JDKParserBug input.xml
The program gives output for the JDK parser and for the Apache Xerces parser. The Apache output is correct, the JDK output is incorrect.
(Alternatively, comment out the reference to Apache Xerces. It's only there to provide additional verification).
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
SAXParser: ....
ELEMENT: name
ATTRIBUTE sortable: d835 dea4 d835 dea4
ACTUAL -
SAXParser: com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl
ELEMENT: name
ATTRIBUTE sortable: d835 dea4 d835 dea4 d835 dea4
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
package commands;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLFilterImpl;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.*;
public class JDKParserBug {
public static void main(String[] args) {
try {
System.err.println(System.getProperty("java.version"));
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><name sortable=\"\uD835\uDEA4\uD835\uDEA4\"/>";
for (String factoryName : new String[]{
"org.apache.xerces.jaxp.SAXParserFactoryImpl",
"com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl"}) {
SAXParserFactory factory = SAXParserFactory.newInstance(factoryName, "".getClass().getClassLoader());
factory.setNamespaceAware(true);
SAXParser parser = factory.newSAXParser();
System.err.println("SAXParser: " + parser.getClass().getName());
XMLReader reader = parser.getXMLReader();
reader.setContentHandler(new XMLFilterImpl() {
@Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
System.err.println("ELEMENT: " + localName);
for (int i=0; i<atts.getLength(); i++) {
System.err.println(" ATTRIBUTE " + atts.getLocalName(i) + ": " +
showString(atts.getValue(i)));
}
}
});
reader.parse(new InputSource(new StringReader(xml)));
}
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public static String showString(String s) {
StringBuilder result = new StringBuilder();
for (int i=0; i<s.length(); i++) {
int c = s.charAt(i);
result.append(Integer.toHexString(c)).append(" ");
}
return result.toString();
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Use Apache Xerces in place of the JDK parser. (I have been advising my clients to do this for years, largely because of this bug)
- duplicates
-
JDK-8058175 [XML 1.0/1.1] - Attribute values with supplemental characters are being corrupted.
- Resolved