-
Bug
-
Resolution: Not an Issue
-
P4
-
None
-
8u171
-
x86_64
-
linux_ubuntu
ADDITIONAL SYSTEM INFORMATION :
Ubuntu 18.4, Oracle Java 1.8 171
A DESCRIPTION OF THE PROBLEM :
When trying to serialize XML with char consisting of unicode surogate char "\uD840\uDC0B" I have tried several and non worked. XML Transformer creates XML string with escaped surogate pair separately, which makes XML unparseable. eg.: SAXParseException; Character reference "�" is an invalid XML character.
Ouput of my test:
Character: ð
EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>𠀋</a>
ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>��</a>
EXPECTED PARSED CHAR ð
This seems to be same issue https://stackoverflow.com/questions/41636186/xml-support-for-new-utf-8-like-smileys
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Serialize XML with char consisting of high surrogate followed by a low surrogate "\uD840\uDC0B"
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
𠀋
ACTUAL -
��
---------- BEGIN SOURCE ----------
String value = "\uD840\uDC0B";
System.out.println("Character: " + value);
System.out.println("EXPECTED: <?xml version=\"1.0\" encoding=\"UTF-8\"?><a>&#" + value.codePointAt(0) + ";</a>");
StringWriter writer = new StringWriter();
final DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document dom = documentBuilder.newDocument();
final Element rootEl = dom.createElement("a");
rootEl.setTextContent(value);
dom.appendChild(rootEl);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new DOMSource(dom), new javax.xml.transform.stream.StreamResult(writer));
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");
String xml = writer.toString();
System.out.println(" ACTUAL: " + xml);
InputSource inputSource = new InputSource();
inputSource.setCharacterStream(new StringReader(xml));
System.out.println("ACTUAL PARSED CHAR " + documentBuilder.parse(inputSource).getDocumentElement().getTextContent());
---------- END SOURCE ----------
FREQUENCY : always
Ubuntu 18.4, Oracle Java 1.8 171
A DESCRIPTION OF THE PROBLEM :
When trying to serialize XML with char consisting of unicode surogate char "\uD840\uDC0B" I have tried several and non worked. XML Transformer creates XML string with escaped surogate pair separately, which makes XML unparseable. eg.: SAXParseException; Character reference "�" is an invalid XML character.
Ouput of my test:
Character: ð
EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>𠀋</a>
ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>��</a>
EXPECTED PARSED CHAR ð
This seems to be same issue https://stackoverflow.com/questions/41636186/xml-support-for-new-utf-8-like-smileys
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Serialize XML with char consisting of high surrogate followed by a low surrogate "\uD840\uDC0B"
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
𠀋
ACTUAL -
��
---------- BEGIN SOURCE ----------
String value = "\uD840\uDC0B";
System.out.println("Character: " + value);
System.out.println("EXPECTED: <?xml version=\"1.0\" encoding=\"UTF-8\"?><a>&#" + value.codePointAt(0) + ";</a>");
StringWriter writer = new StringWriter();
final DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document dom = documentBuilder.newDocument();
final Element rootEl = dom.createElement("a");
rootEl.setTextContent(value);
dom.appendChild(rootEl);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new DOMSource(dom), new javax.xml.transform.stream.StreamResult(writer));
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");
String xml = writer.toString();
System.out.println(" ACTUAL: " + xml);
InputSource inputSource = new InputSource();
inputSource.setCharacterStream(new StringReader(xml));
System.out.println("ACTUAL PARSED CHAR " + documentBuilder.parse(inputSource).getDocumentElement().getTextContent());
---------- END SOURCE ----------
FREQUENCY : always