Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8242927

LSSerializer produces character references in UTF-8 encoding

    • x86
    • linux

      ADDITIONAL SYSTEM INFORMATION :
      Latest OpenJDK 11 on 64bit Linux, can also reproduce with latest OpenJDK 8.

      A DESCRIPTION OF THE PROBLEM :
      Quoting the Javadoc for LSSerializer: "[...] Any characters that cannot be represented directly in the output character encoding are serialized as numeric character references [...]"

      Serializing text nodes that have a codepoint represented in 4 bytes in UTF-8 (for example <Tag>🚩</Tag>) produces a numeric character reference. Following the javadoc, this code point however is directly representable and should therefore not getting serialized as a numeric character reference. I haven't tested this with 2 byte code points yet.

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Parse an XML document with a text node containing a 4 byte UTF-8 code point into a DOM document, serialize it back.

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      Text nodes should look exactly the same.
      ACTUAL -
      Original text node has the code point as-is, produced text node has the respective character reference.

      ---------- BEGIN SOURCE ----------
      import org.w3c.dom.bootstrap.DOMImplementationRegistry;
      import org.w3c.dom.ls.*;
      import org.w3c.dom.Document;
      import java.io.*;
      import java.nio.file.Files;
      import java.nio.file.Path;
      import java.nio.file.Paths;

      public class Main
      {
        public static void main(String[] args) throws Exception
        {
          if (args.length == 0) {
            System.out.println("Please enter a file name");
            return;
          }
          
          Path filenameIn = Paths.get(args[0]);
          Path filenameOut = filenameIn.resolve(".out");
          
          try (InputStream in = Files.newInputStream(filenameIn);
               OutputStream out = Files.newOutputStream(filenameOut)) {
            
            DOMImplementationLS impl = (DOMImplementationLS) DOMImplementationRegistry.newInstance().getDOMImplementation("LS");
            LSInput input = impl.createLSInput();
            LSOutput output = impl.createLSOutput();
            LSParser parser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
            LSSerializer serializer = impl.createLSSerializer();
            
            input.setByteStream(in);
            output.setByteStream(out);
            output.setEncoding("UTF-8");
            
            Document doc = parser.parse(input);
            
            serializer.write(doc, output);
          }
        }
      }
      ---------- END SOURCE ----------

      FREQUENCY : always


        1. image.png
          3 kB
          swati sharma
        2. issue.java
          1 kB
          swati sharma
        3. issue2.java
          8 kB
          swati sharma
        4. out
          0.1 kB
          swati sharma

            joehw Joe Wang
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: