Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8215464

Java API documentation on CharsetDecoder usage should be refined

XMLWordPrintable

      ADDITIONAL SYSTEM INFORMATION :
      Windows 10/7, JRE 1.8.0_171 (as well as 1.8.0_191).

      A DESCRIPTION OF THE PROBLEM :
      The API documentation for class java.nio.charset.CharsetDecoder is not very accurate. The following is the an excerpt, taken from URL https://docs.oracle.com/javase/8/docs/api/java/nio/charset/CharsetDecoder.html, that might have to be refined:

      The input byte sequence is provided in a byte buffer or a series of such buffers. The output character sequence is written to a character buffer or a series of such buffers. A decoder should always be used by making the following sequence of method invocations, hereinafter referred to as a decoding operation:

      1. Reset the decoder via the reset method, unless it has not been used before;

      2. Invoke the decode method zero or more times, as long as additional input may be available, passing false for the endOfInput argument and filling the input buffer and flushing the output buffer between invocations;

      3. Invoke the decode method one final time, passing true for the endOfInput argument; and then

      4. Invoke the flush method so that the decoder can flush any internal state to the output buffer.

      The usage protocol above omits an issue - that the final decode method call (with endOfInput as true) cannot handle the scenario where the CharBuffer (for output) has not enough room to hold the decoded content for all the remaining bytes in ByteBuffer. In that scenario the ByteBuffer is not exhausted (there are some residue bytes). By protocol this decode method call (with endOfInput as true) is done only once, thus the subsequent flush method call, which takes no ByteBuffer as input parameter, will miss those residue bytes.

      The following is an example to showcase that scenario (tested in JRE 1.8.0_171):

      // the 5 Chinese chars below (meaning "king of Spain") might not display well
      // final byte[] data = "西班牙国王".getBytes("UTF-8");
      // assign the UTF-8 bytes of the 5 Chinese chars above directly into a byte array
      final byte[] data = new byte[] {
      (byte)0xE8, (byte)0xA5, (byte)0xBF, (byte)0xE7, (byte)0x8F, (byte)0xAD,
      (byte)0xE7, (byte)0x89, (byte)0x99, (byte)0xE5, (byte)0x9B, (byte)0xBD,
      (byte)0xE7, (byte)0x8E, (byte)0x8B };
      System.out.println("data size "+data.length);
      ByteBuffer bb = ByteBuffer.allocate(data.length);
      bb.put(data).flip();
      System.out.println("byte buffer has "+bb.remaining()+" remaining bytes");
      System.out.println("-----------");
      CharsetDecoder cd = Charset.forName("UTF-8").newDecoder();
      CharBuffer cb = CharBuffer.allocate(4);
      System.out.println("decode with endOfInput==true ...");
      CoderResult cr = cd.decode(bb, cb, true);
      System.out.println("byte buffer has "+bb.remaining()+" remaining bytes");
      System.out.println("coder result "+(cr.isUnderflow()?"underflow":(cr.isOverflow()?"overflow":"error")));
      cb.flip();
      System.out.println("char buffer has "+cb.remaining()+" remaining chars");
      while (cb.remaining()>0) System.out.print(cb.get());
      System.out.println();
      System.out.println("-----------");
      cb.clear();
      System.out.println("flush ...");
      cr = cd.flush(cb);
      System.out.println("byte buffer has "+bb.remaining()+" remaining bytes");
      System.out.println("coder result "+(cr.isUnderflow()?"underflow":(cr.isOverflow()?"overflow":"error")));
      cb.flip();
      System.out.println("char buffer has "+cb.remaining()+" remaining chars");
      while (cb.remaining()>0) System.out.print(cb.get());

      and here's the output:

      data size 15
      byte buffer has 15 remaining bytes
      -----------
      decode with endOfInput==true ...
      byte buffer has 3 remaining bytes
      coder result overflow
      char buffer has 4 remaining chars
      西班牙国
      -----------
      flush ...
      byte buffer has 3 remaining bytes
      coder result underflow
      char buffer has 0 remaining chars

      So, does this mean that the CharsetDecoder API design is flawed, or the UTF-8 decoder implementation is flawed? Probably not - the decoder implementations are not obliged to cache any input bytes, although they have to cache some of the decoded characters in the case that the output char buffer is not roomy enough. The decoders would work perfectly well for the scenario above if the protocol is more refined. And here are the refinements I propose to the protocol:

      (a) The pre-conditions of calling decode(,,true), in step #3 of the current document above, should be changed to:
      (i) there exists a previous decode(,,false) call that has a ByteBuffer parameter holding the final input bytes (no more data after that), and
      (ii) the decode method call in (i) above results in CoderResult.UNDERFLOW
      The ByteBuffer parameter, after (i) and (ii), is either empty (no remaining bytes) or holds some residue bytes. It can now be safely handed as a parameter to decode(,,true) call.
      (b) Because of (i) in (a), the step #1 in the current document, which states "zero or more times", should be changed to "one or more times".

      That keeps the protocol requirement that decode(,,true) be called once and only once. In my tests the refined protocol worked well in the presented scenario (as well as some other unusual scenarios).

      The following are some deeper technical insight into the proposal above, for those who doubt it. To simplify discussion I assume that in all the scenarios below CharBuffer has room for at least one character during decode method calls (no matter endOfInput is true or false) and flush calls. (A no-room CharBuffer in those calls won't break the proposed protocol, and won't introduce any new issues in the existing protocol. But that's another topic.)

      Doubt #1: in (a) above, after (i) and (ii), if the ByteBuffer is empty (no residue bytes), it seems you don't have to call decode(,,true), but can call flush directly.
      Answer: well that will move away from a uniform protocol of calling decode(,,true) once, and will make the protocol details unduly complicated. It will also make decoder implementation complicated, as it will have to know end of input from two sources - the decode(,,true) call and the first flush call.

      Doubt #2: in (a) above, after (i) and (ii), if the ByteBuffer has residue bytes, the decode(,,true) is bound to return error. Hence the client code can skip decode(,,true) and abort directly.
      Answer: residue bytes might not mean error. For example, if there's a text encoding that says, among others, that byte 0x00 should be translated to character 'a' if this is the last byte, and be translated to 'b' if it is followed by byte 0x01. For such encoding, a residue single byte of 0x00 will not result in an error when decode(,,true) is called - it will happily return an 'a' instead. Such text encoding is probably nerdy in the real world, but is still technically sound.

      Doubt #3: (a) proposed above is unnecessarily restrictive. (ii) of (a) can be relaxed to "if the ByteBuffer's remaining bytes are no more than the CharBuffer's remaining room, it's safe to call decode(,,true)".
      Answer: the claim implies that a byte can be decoded to at most one char, which is not always true. Suppose there's a text coding that have 0x00 translates to 100 character 'a', and the ByteBuffer has two such bytes in it (0x00 0x00). The call to decode(,,true) with an empty CharBuffer of capcity 90 will consume the first 0x00 byte, have an overflow result (CharBuffer holds 90 character 'a'), and the decoder caches remaining 10 'a' internally, and the ByteBuffer holding the last 0x00 byte. The subsequent flush will send out those 10 'a' (if the CharBuffer is emptied before call), but the last 0x00 byte in ByteBuffer will be lost as flush method does not accept any ByteBuffer as input.

      Doubt #4: (a) proposed above is unnecessarily restrictive. (ii) of (a) can be relaxed to "if the ByteBuffer's has only one remaining byte, it's safe to call decode(,,true)".
      Answer: not always true. Take the same text encoding above. The ByteBuffer has two bytes, both 0x00. You call decode(,,false) first to consume one byte. Suppose the CharBuffer is empty with capacity of 40, the call will consume one byte of 0x00, fill up the CharBuffer with 40 'a' char, cache inside the decoder 60 'a' char. Subsequent call to decode(,,true) (since the ByteBuffer now has only one byte left) will not consume that last byte of 0x00. Instead the decoder will fill the CharBuffer with 40 'a' (coder result overflow), reducing the decoder's internal cache to 20 'a', and leave the last 0x00 byte in ByteBuffer. Thus a subsequent flush call will not pick up that last input byte. However, if you keep calling decode(,,false) until you get an underflow, the subsequent decode(,,true) call will surely pick up that last 0x00 byte (and then you keep calling flush() until you get an underflow, as required by the current protocol).

      Thus the underflow status of the last decode(,,false) call (before a decode(,,true) call) is not only sufficient, but also necessary.



            naoto Naoto Sato
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: