-
Bug
-
Resolution: Unresolved
-
P4
-
11, 17, 19, 20
-
generic
-
generic
ADDITIONAL SYSTEM INFORMATION :
Manjaro Linux kernel 6.1.1
Open JDK:
19.0.2
15.0.2
Oracle JDK:
19.0.2
Temurin JDK:
17.0.5
17.0.2
A DESCRIPTION OF THE PROBLEM :
When parsing large files with large embedded content, e.g. JSON, the content is corrupted and cannot be parsed. This only happens if the XML is declared as version 1.1. In an XML file that contains 1000s of identical records the content is only corrupted in a few records. Looking through the source code it appears to be a bug related to `com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl`.
There is a variable `fTempString` that sometimes contains content left from a previous state that is returned by `getCharacterData()`. I believe that the content should have been cleared but in some instances seems to remain but I can't see why this happens. Every record in my test XML file is identical but only some records end up with corrupted content so I'm assuming this is due to use of buffers somewhere.
I have tested with multiple JDK versions and flavours and all have the same issue. Similar issues have been raised in the past but have been closed.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Create a very large XML version 1.1 file with 1000s of identical records, e.g.
<?xml 1.1 decl...>
<root>
<row>LARGE JSON CONTENT</row>
... 1000s, identical rows ....
</root>
If each row is identical containing large JSON strings some of the rows will end up with corrupt JSON that cannot be parsed.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
I expect all tests in `TestXmlParser` to pass
ACTUAL -
The tests pass for XML version 1.0 but not for XML version 1.1.
---------- BEGIN SOURCE ----------
See attachment.
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Using the Xerces library 12.0.2 (or any other version) fixes the issue
FREQUENCY : always
Manjaro Linux kernel 6.1.1
Open JDK:
19.0.2
15.0.2
Oracle JDK:
19.0.2
Temurin JDK:
17.0.5
17.0.2
A DESCRIPTION OF THE PROBLEM :
When parsing large files with large embedded content, e.g. JSON, the content is corrupted and cannot be parsed. This only happens if the XML is declared as version 1.1. In an XML file that contains 1000s of identical records the content is only corrupted in a few records. Looking through the source code it appears to be a bug related to `com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl`.
There is a variable `fTempString` that sometimes contains content left from a previous state that is returned by `getCharacterData()`. I believe that the content should have been cleared but in some instances seems to remain but I can't see why this happens. Every record in my test XML file is identical but only some records end up with corrupted content so I'm assuming this is due to use of buffers somewhere.
I have tested with multiple JDK versions and flavours and all have the same issue. Similar issues have been raised in the past but have been closed.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Create a very large XML version 1.1 file with 1000s of identical records, e.g.
<?xml 1.1 decl...>
<root>
<row>LARGE JSON CONTENT</row>
... 1000s, identical rows ....
</root>
If each row is identical containing large JSON strings some of the rows will end up with corrupt JSON that cannot be parsed.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
I expect all tests in `TestXmlParser` to pass
ACTUAL -
The tests pass for XML version 1.0 but not for XML version 1.1.
---------- BEGIN SOURCE ----------
See attachment.
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Using the Xerces library 12.0.2 (or any other version) fixes the issue
FREQUENCY : always