Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6428672

Add Charset support for "modified UTF-8"

XMLWordPrintable

      A DESCRIPTION OF THE REQUEST :
      Modified UTF-8 is used in many places, including JNI, Data*Stream, JAR. The Charset class in nio is a good way to encapsulate dealing with encodings, without reimplementing the encoding process over and over again.

      It would be good to have access to this encoding as a Charset class as well. Preferrably it would be a required standard character set (accessible through a constant as requested in 4884238). A name could be "x-modified-UTF-8" or "x-CESU-8-nullfree" or something like this.

      The current UTF-8 decoder implementation already handles all variations of this encoding. Simple modifications to the encoder would easily provide a class for these encodings as well.

        Bug 4641026 made clear to me that the encoding is more closely related to CESU-8, although the encoding of \u0000 is different. It might make sense to make the normal CESU-8 encoding available as well.

      JUSTIFICATION :
      I'm thinking about how to solve the much voted for bug 4244499.

      My approach would be to move encoding from the native code into the java classes, and let a Charset object control what encoding to use. But to be backward compatible I would have to use modified UTF-8 by default.

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      Charset.forName("x-modified-UTF-8")
      returns Charset encoding modified UTF-8 and decoding any UTF-8 variant.
      ACTUAL -
      UnsupportedCharsetException thrown.

      ---------- BEGIN SOURCE ----------
      import java.nio.charset.Charset;
      import java.util.Map;

      public class CharTest {
          public static void main(String[] args) throws Exception {
      for (Map.Entry<String, Charset> p :
      Charset.availableCharsets().entrySet()) {
      System.out.println(p.getKey());
      for (String a : p.getValue().aliases()) {
      System.out.println("\t" + a);
      }
      }
      System.out.println(Charset.forName("x-modified-UTF-8").name());
          }
      }

      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      It is of course possible to implement the modified UTF-8 encoding manually, as has been done with Data*Stream or Zip*Stream. It would even be possible to have a Charset field to choose the charset, and have the special value null denote modified UTF-8. Every encoding or decoding would then check if the Charset object is present, and otherwise fall back to the manual implementation.

            sherman Xueming Shen
            ndcosta Nelson Dcosta (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Imported:
              Indexed: