Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8203810

XML Transformer produces separately escaped surrogate pair instead of codepoint

XMLWordPrintable

      ADDITIONAL SYSTEM INFORMATION :
      Ubuntu 18.4, Oracle Java 1.8 171

      A DESCRIPTION OF THE PROBLEM :
      When trying to serialize XML with char consisting of unicode surogate char "\uD840\uDC0B" I have tried several and non worked. XML Transformer creates XML string with escaped surogate pair separately, which makes XML unparseable. eg.: SAXParseException; Character reference "&#55360" is an invalid XML character.


      Ouput of my test:
      Character: 𠀋
      EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>&#131083;</a>
        ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>&#55360;&#56331;</a>
      EXPECTED PARSED CHAR 𠀋

      This seems to be same issue https://stackoverflow.com/questions/41636186/xml-support-for-new-utf-8-like-smileys

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Serialize XML with char consisting of high surrogate followed by a low surrogate "\uD840\uDC0B"

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      &#131083;
      ACTUAL -
      &#55360;&#56331;

      ---------- BEGIN SOURCE ----------
              String value = "\uD840\uDC0B";
              System.out.println("Character: " + value);
              System.out.println("EXPECTED: <?xml version=\"1.0\" encoding=\"UTF-8\"?><a>&#" + value.codePointAt(0) + ";</a>");
              StringWriter writer = new StringWriter();

              final DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
              Document dom = documentBuilder.newDocument();
              final Element rootEl = dom.createElement("a");
              rootEl.setTextContent(value);
              dom.appendChild(rootEl);

              Transformer transformer = TransformerFactory.newInstance().newTransformer();
              transformer.transform(new DOMSource(dom), new javax.xml.transform.stream.StreamResult(writer));
              transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");
              String xml = writer.toString();
              System.out.println(" ACTUAL: " + xml);

              InputSource inputSource = new InputSource();
              inputSource.setCharacterStream(new StringReader(xml));
              System.out.println("ACTUAL PARSED CHAR " + documentBuilder.parse(inputSource).getDocumentElement().getTextContent());
      ---------- END SOURCE ----------

      FREQUENCY : always


            psonal Pallavi Sonal (Inactive)
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: