Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8141097

StaX: data corruption when reading Unicode SMP characters in UTF-8 XML

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: P3 P3
    • None
    • 7, 8u25, 8u66
    • xml
    • generic
    • generic

      FULL PRODUCT VERSION :
      java version "1.8.0_25"
      Java(TM) SE Runtime Environment (build 1.8.0_25-b18)
      Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

      ADDITIONAL OS VERSION INFORMATION :
      Microsoft Windows [version 6.3.9600]

      A DESCRIPTION OF THE PROBLEM :
      This is an upstream bug for:
      https://josm.openstreetmap.de/ticket/3290

      The attached small XML file contains a chinese character and the first gothic character (U+10330 : http://www.unicode.org/charts/PDF/U10330.pdf)
      When parsing this file using StaX, the attribute value containing the gothic character is corrupted: it contains also the chinese character from the previous attribute.
      See the console output:
      From XML chinese:[-16, -92, -83, -94]
      Expected chinese:[-16, -92, -83, -94]
      From XML gothic:[-16, -92, -83, -94, -16, -112, -116, -80]
      Expected gothic:[-16, -112, -116, -80]

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Run attached program

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      No error
      ACTUAL -
      Characters are corrupted

      REPRODUCIBILITY :
      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------
      import java.io.FileInputStream;
      import java.io.InputStreamReader;
      import java.util.Arrays;
      import java.util.HashMap;
      import java.util.Map;

      import javax.xml.stream.XMLInputFactory;
      import javax.xml.stream.XMLStreamConstants;
      import javax.xml.stream.XMLStreamReader;

      public class Test {

          public static void main(String[] args) {
              
              Map<String, String> map = new HashMap<String, String>();
              
              try {
                  InputStreamReader ir = new InputStreamReader(new FileInputStream("D:\\Users\\Vincent\\Desktop\\JOSM_work\\gottic.osm"), "UTF-8");
                  XMLStreamReader parser = XMLInputFactory.newInstance().createXMLStreamReader(ir);
                  
                  int event = parser.getEventType();
                  while (true) {
                      if (event == XMLStreamConstants.START_ELEMENT) {
                          String key = parser.getAttributeValue(null, "k");
                          String value = parser.getAttributeValue(null, "v");
                          if (key != null && value != null) {
                              map.put(key.intern(), value.intern());
                          }
                      }
                      if (parser.hasNext()) {
                          event = parser.next();
                      } else {
                          break;
                      }
                  }
                  parser.close();
                  
                  String value = map.get("name:ch");
                  System.out.println("From XML chinese:" + Arrays.toString(value.getBytes("UTF-8")));

                  value = new String(Character.toChars(0x24B62));
                  System.out.println("Expected chinese:" + Arrays.toString(value.getBytes("UTF-8")));

                  value = map.get("name:got");
                  System.out.println("From XML gothic:" + Arrays.toString(value.getBytes("UTF-8")));
                  
                  value = new String(Character.toChars(0x10330));
                  System.out.println("Expected gothic:" + Arrays.toString(value.getBytes("UTF-8")));
                  
              } catch (Exception e) {
                  e.printStackTrace();
              }
          }
      }

      ---------- END SOURCE ----------

            aefimov Aleksej Efimov
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: