Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6782001

Can't parse some unicode characters (surrorate pair)

XMLWordPrintable

      FULL PRODUCT VERSION :
      java version "1.6.0_10"
      Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
      Java HotSpot(TM) Client VM (build 11.0-b15, mixed mode, sharing)

      ADDITIONAL OS VERSION INFORMATION :
      Windows XP

      A DESCRIPTION OF THE PROBLEM :
      If a document contains a traditional chinese (4-bytes UTF-8 character) after a numeric character reference, the resulting DOM has garbage characters.



      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Run the test case.


      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      All tests should be successful.

      ACTUAL -
      testCharRefAndRawChineseChar() fails.
      The characters of the the numeric reference itself are inserted before the unescaped chinese character( "80" in the test case).

      REPRODUCIBILITY :
      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------
      import java.io.ByteArrayInputStream;

      import javax.xml.parsers.DocumentBuilder;
      import javax.xml.parsers.DocumentBuilderFactory;

      import junit.framework.TestCase;

      import org.w3c.dom.Document;



      public class XMLChineseTest extends TestCase {

          
          static final String CHINESE_STR = new String(Character.toChars(65766));
         
          
          public void testRawChineseChar() throws Exception {
              
              checkXMLParsing(CHINESE_STR, CHINESE_STR);
          }

          
          public void testCharRefAndEscapedChineseChar() throws Exception {
              
              checkXMLParsing("P𐃦", (char)(80) + CHINESE_STR);
          }
          
          
          public void testCharRefAndRawChineseChar() throws Exception {

              checkXMLParsing("P" + CHINESE_STR, (char)(80) + CHINESE_STR);
          }
          
          
          private void checkXMLParsing(String encodedValue, String expectedDOMValue) throws Exception {
              
              String xml = "<truc value=\"" + encodedValue + "\" />";
              System.out.println("xml input: " + xml);
              byte[] xmlBytes = xml.getBytes("UTF-8");
              
              DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
              Document doc = builder.parse(new ByteArrayInputStream(xmlBytes));
              
              String readValue = doc.getDocumentElement().getAttribute("value");
              System.out.println("Read value: " + readValue);
              assertEquals(expectedDOMValue, readValue);
          }
      }


      Release Regression From : 5.0u12
      The above release value was the last known release where this
      bug was not reproducible. Since then there has been a regression.

            joehw Joe Wang
            ndcosta Nelson Dcosta (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: