Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6764325

(str) String.getBytes(Charset) is slower than getBytes(String)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P4 P4
    • 9
    • 6u10
    • core-libs
    • inapplicable
    • x86
    • linux

      FULL PRODUCT VERSION :
      Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
      Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)

      and

      Java(TM) SE Runtime Environment (build 1.7.0-ea-b38)
      Java HotSpot(TM) 64-Bit Server VM (build 14.0-b05, mixed mode)

      ADDITIONAL OS VERSION INFORMATION :
      Linux 2.6.27.3 #1 SMP Sat Oct 25 10:15:42 EEST 2008 x86_64 AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ AuthenticAMD GNU/Linux


      A DESCRIPTION OF THE PROBLEM :
      byte[] String.getBytes(Charset charset) is slower than byte[] String.getBytes(String encoding).
      It should not bee because the latter needs to do extra work compared to the first version.

      There are two reasons why the Charset version is slower:
      1) java.lang.StringCoding.encode(Charset, ...) needlessly copies the char[]
      - This causes slowdown of 4-7% for small strings. The slowdown grows for large strings.

      2) java.lang.StringCoding.encode(Charset, ...) always creates a new StringEncoder
      - Creating of new StringEncoder is slower than using thread local cached one when repeatedly using the same charset

        Suggested replacement code for Java6 and Java5:

      static byte[] encode(Charset cs, char[] ca, int off, int len) {
          StringEncoder se = deref(encoder);
          if (se == null || se.cs != cs) {
              se = new StringEncoder(cs, cs.name());
              set(encoder, se);
          }
          return se.encode(ca, off, len);
      }

      I did not use cs.equals(se.cs) because I think Charset instances are cached and it is not easily possible to create an two unique Charset instances with same name.

      Tests:
      T1: Repeated encodings of 4 character long strings in ISO-8859-1.
      T2: Encodings of 4 character long strings in altering patterns of ISO-8859-1 and UTF-8 so that there are always 2 calls with each encoding before switching.

      M1 = getBytes(String)
      M2 = getBytes(Charset)
      M3 = getBytes(Charset) with attached modifications

      Results for Java6:
           M1 M2 M3
      T1 1.000 1.792 1.000
      T2 1.000 1.073 0.902

      Results for Java7:
           M1 M2 M3
      T1 1.000 1.792 1.042
      T2 1.000 0.974 0.872

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      compare speed of String.getBytes(Charset) to String.getBytes(String)

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      String.getBytes(Charset) should as fast or faster than String.getBytes(String)
      ACTUAL -
      String.getBytes(Charset) is upto 80% slower than String.getBytes(String)

      REPRODUCIBILITY :
      This bug can be reproduced always.

      CUSTOMER SUBMITTED WORKAROUND :
      Use String.getBytes(charset.name()) instead of String.getBytes(charset)

            Unassigned Unassigned
            ryeung Roger Yeung (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: