Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4877689

(str) Need a ByteArray implementation for searching a string stored as byte

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Unresolved
    • Icon: P4 P4
    • None
    • 1.4.2_02
    • core-libs
    • sparc
    • solaris_9

      ###@###.### 2003-06-11

      I propose to create a new class called ByteArray to help searching a pattern
      in a character set stored as bytes for all network related applications. Or
      we should allow a String be stored as byte[] instead of char[] as indicated
      in Bug Id: 4091286

      I am writing a java based HTTP driver code for SPECjApp2003 and SPECweb 200x
      java harness. The driver has to parse the HTTP header every time it
      gets a response buffer. This involves heavy searching for multiple patterns
      in the header. Currently java only provides indexOf() method in String class
      for searching a pattern. But since network data can only be read() as byte
      arrays, we have to convert the byte array to a new String in order to take
      advantage of this method. This impose a huge performance inefficiency both
      in terms of memory utilization and response time. Since the String object
      is created for each HTTP response and is used only once, we see huge amount
      of Garbage Collection activities and response time of header processing in
      not acceptable.

      Because of this problem, our java harness for SPECweb200x benchmark supports
      2x fewer throughput than the C based implementation. For this reason, a few
      companies in SPEC committee (including Microsoft) had voted to use C/C++ for
      the next gerneration of the benchmark but this will definitely hurt the idea
      of having a platform independent benchmark.

      As a workaround, we implemented a ByteArray class which takes the algorithm
      of String.indexOf() and search against on byte[]. It instantly brought the
      java harness performance to be the same level as C based implementation.

      We wrote a microbenchmark (attached) for measure the perf difference of
      3 approaches. Note the 3rd approach is listed just for perf comparison
      purpose to prove how bad to have to create a new String every time.

      1. allocation a new String every time a response comes in
           a. read header into a byte[]
           b. create a new String() to hold the header
           c. parse the header for 3 pattens
              (CRLF, Content-Length:, Connection: Close)
           d. repeat the above b & c 20000 times
         cpu time is measured for d.

      2. search against the byte[] directly
           a. read header into a byte[]
           b. search directly against byte[] using a new class called ByteArray
              which implements the same searching algorithm as String.indexOf()
           c. repeat b 20000 times
         cpu time is measured for c.

      3. re-use the String (not feasible for a real HTTP driver since each
         response is different)
           a. read header into a byte[]
           b. create a new String() to hold the header
           c. parse the header for 3 patterns
              (CRLF, Content-Length:, Connection: Close)
           d. repeat only c 20000 times
         cpu time is measured for d.

      The perf number looks like (in msec):

      # java StringPerf 20000

      1. create new string: 908
      2. opt byte[] parser: 398
      3. re-use old string: 328

      This proves that converting to a new String is very costly. For all
      network related application, it's important to have a faster way
      to parse the network data without having to convert everything into
      new Strings.
       
      Of course users can always implement their own ByteArray class to help them
      do the search directly. But this has to be done for each of such application.
      An enchancement of java.lang package can help offload this burden from the
      user and can implement a optimized search algorithm as the one for
      String.indexOf.

      As a side note, a sub-optimized search algorithm can slow down the byte
      array parser substaintially.

      4. sub-opt byte[] parser: 754

            Unassigned Unassigned
            nisun Ning Sun (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Imported:
              Indexed: