Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6636317

Optimize UTF-8 coder for ASCII input

XMLWordPrintable

    • b35
    • generic
    • generic
    • Not verified

      The UTF-8 coder can get dramatic speedup by having a special method
      that handles only ASCII, and delegates to a general purpose method
      if the input contains non-ASCII.

      Here's the kind of method I'm thinking of:

      private CoderResult decodeArrayLoop(ByteBuffer src,
      CharBuffer dst)
      {
                  byte[] sa = src.array();
      int sp = src.arrayOffset() + src.position();
      int sl = src.arrayOffset() + src.limit();

                  char[] da = dst.array();
      int dp = dst.arrayOffset() + dst.position();
      int dl = dst.arrayOffset() + dst.limit();

                  CoderResult result = null;

                  for (;;) {
                      if (sp >= sl) {
                          result = CoderResult.UNDERFLOW;
                          break;
                      }
                      int b = sa[sp];
                      if (b < 0)
                          break;
                      if (dp >= dl) {
                          result = CoderResult.OVERFLOW;
                          break;
                      }
                      da[dp++] = (char) b;
                      sp++;
                  }
                  src.position(sp - src.arrayOffset());
                  dst.position(dp - dst.arrayOffset());
                  return result != null ? result : decodeArrayLoop1(src,dst);
              }

      The non-ASCII decoder case can be sped up as well, by not using the big switch.
      More minor improvements:

      ---

      We can get rid of the code below,
      since our implementation always guarantees it,
      and users cannot create their own buggy ByteBuffer
      or CharBuffer implementations, and even if they did,
      our code is allowed to assume it is non-buggy.

      // assert (sp <= sl);
      // sp = (sp <= sl ? sp : sl);

      ---

      In the ASCII case, the &-ing with 0x7f is useless,
      since the 0x80 bit is already guaranteed to be off.

      // da[dp++] = (char)(b1 & 0x7f);
      da[dp++] = (char) b1;

      ---

      More deviously, we can snatch a few cycles in the 2-byte case
      as follows:

                          da[dp++] = (char) (((b << 6) ^ b2) ^ 0x0f80);
      // da[dp++] = ((char)(((b1 & 0x1f) << 6) |
      // ((b2 & 0x3f) << 0)));


      ---

      Only significant for smaller coding operations, but we should only
      instantiate a Surrogate.Generator or Surrogate.Parser in the unlikely
      (in the real world) event of surrogates in the input stream.

                              if (sgg == null)
                                  sgg = new Surrogate.Generator();
      int gn = sgg.generate(uc, n, da, dp, dl);
                              ....

      ---
      The comparison below is vacuously true, since c is of type char.

      if (c <= '\uFFFF') {

      ---

            sherman Xueming Shen
            martin Martin Buchholz
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: