Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6957230

CharsetEncoder.maxBytesPerChar() reports 4 for UTF-8; should be 3

XMLWordPrintable

    • b121
    • x86
    • linux
    • Verified

      A DESCRIPTION OF THE REQUEST :
      Short summary: CharsetEncoder.maxBytesPerChar() returns a value of 4.0 for UTF-8. However, the *real* value should be 3.0. While it is possible for a code point to produce a 4 byte UTF-8 sequence, these code points require *two UTF-16 characters*, thus these code points have a bytes per char value of 2.


      JUSTIFICATION :
      This is a performance issue, not a correctness issue: The code path for String.getBytes("UTF-8") ends up allocating a *worst case* sized buffer, computed based on this value. Reducing this from 4.0 to 3.0 will reduce garbage collection rates for string processing applications.


      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      Charset.forName("UTF-8").newEncoder().maxBytesPerChar() should return 3.0

      See the example code for a program that computes and verifies this value.
      ACTUAL -
      Charset.forName("UTF-8").newEncoder().maxBytesPerChar() returns 4.0

      ---------- BEGIN SOURCE ----------
      import java.nio.charset.Charset;

      public class Test {
          public static void main(String[] arguments)
                  throws java.io.UnsupportedEncodingException {
              System.out.println("Reported max bytes per char: " +
                      Charset.forName("UTF-8").newEncoder().maxBytesPerChar());

              double maxBytesPerChar = -1;
              for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
                  String s = new String(Character.toChars(i));
                  assert 0 < s.length() && s.length() <= 2;
                  byte[] utf8 = s.getBytes("UTF-8");

                  double bytesPerChar = utf8.length / (double) s.length();
                  if (bytesPerChar > maxBytesPerChar) {
                      maxBytesPerChar = bytesPerChar;
                  }
              }

              System.out.println("Computed real max bytes per char: " +
                      maxBytesPerChar);
          }
      }

      ---------- END SOURCE ----------

            sherman Xueming Shen
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: