-
Bug
-
Resolution: Unresolved
-
P4
-
None
-
11.0.6
-
x86
-
linux
ADDITIONAL SYSTEM INFORMATION :
Latest OpenJDK 11 on 64bit Linux, can also reproduce with latest OpenJDK 8.
A DESCRIPTION OF THE PROBLEM :
Quoting the Javadoc for LSSerializer: "[...] Any characters that cannot be represented directly in the output character encoding are serialized as numeric character references [...]"
Serializing text nodes that have a codepoint represented in 4 bytes in UTF-8 (for example <Tag>ð©</Tag>) produces a numeric character reference. Following the javadoc, this code point however is directly representable and should therefore not getting serialized as a numeric character reference. I haven't tested this with 2 byte code points yet.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Parse an XML document with a text node containing a 4 byte UTF-8 code point into a DOM document, serialize it back.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Text nodes should look exactly the same.
ACTUAL -
Original text node has the code point as-is, produced text node has the respective character reference.
---------- BEGIN SOURCE ----------
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.*;
import org.w3c.dom.Document;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Main
{
public static void main(String[] args) throws Exception
{
if (args.length == 0) {
System.out.println("Please enter a file name");
return;
}
Path filenameIn = Paths.get(args[0]);
Path filenameOut = filenameIn.resolve(".out");
try (InputStream in = Files.newInputStream(filenameIn);
OutputStream out = Files.newOutputStream(filenameOut)) {
DOMImplementationLS impl = (DOMImplementationLS) DOMImplementationRegistry.newInstance().getDOMImplementation("LS");
LSInput input = impl.createLSInput();
LSOutput output = impl.createLSOutput();
LSParser parser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
LSSerializer serializer = impl.createLSSerializer();
input.setByteStream(in);
output.setByteStream(out);
output.setEncoding("UTF-8");
Document doc = parser.parse(input);
serializer.write(doc, output);
}
}
}
---------- END SOURCE ----------
FREQUENCY : always
Latest OpenJDK 11 on 64bit Linux, can also reproduce with latest OpenJDK 8.
A DESCRIPTION OF THE PROBLEM :
Quoting the Javadoc for LSSerializer: "[...] Any characters that cannot be represented directly in the output character encoding are serialized as numeric character references [...]"
Serializing text nodes that have a codepoint represented in 4 bytes in UTF-8 (for example <Tag>ð©</Tag>) produces a numeric character reference. Following the javadoc, this code point however is directly representable and should therefore not getting serialized as a numeric character reference. I haven't tested this with 2 byte code points yet.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Parse an XML document with a text node containing a 4 byte UTF-8 code point into a DOM document, serialize it back.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Text nodes should look exactly the same.
ACTUAL -
Original text node has the code point as-is, produced text node has the respective character reference.
---------- BEGIN SOURCE ----------
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.*;
import org.w3c.dom.Document;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Main
{
public static void main(String[] args) throws Exception
{
if (args.length == 0) {
System.out.println("Please enter a file name");
return;
}
Path filenameIn = Paths.get(args[0]);
Path filenameOut = filenameIn.resolve(".out");
try (InputStream in = Files.newInputStream(filenameIn);
OutputStream out = Files.newOutputStream(filenameOut)) {
DOMImplementationLS impl = (DOMImplementationLS) DOMImplementationRegistry.newInstance().getDOMImplementation("LS");
LSInput input = impl.createLSInput();
LSOutput output = impl.createLSOutput();
LSParser parser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
LSSerializer serializer = impl.createLSSerializer();
input.setByteStream(in);
output.setByteStream(out);
output.setEncoding("UTF-8");
Document doc = parser.parse(input);
serializer.write(doc, output);
}
}
}
---------- END SOURCE ----------
FREQUENCY : always
- relates to
-
JDK-8249643 Clarify DOM documentation
-
- Resolved
-