###@###.### 2003-06-11
I propose to create a new class called ByteArray to help searching a pattern
in a character set stored as bytes for all network related applications. Or
we should allow a String be stored as byte[] instead of char[] as indicated
in Bug Id: 4091286
I am writing a java based HTTP driver code for SPECjApp2003 and SPECweb 200x
java harness. The driver has to parse the HTTP header every time it
gets a response buffer. This involves heavy searching for multiple patterns
in the header. Currently java only provides indexOf() method in String class
for searching a pattern. But since network data can only be read() as byte
arrays, we have to convert the byte array to a new String in order to take
advantage of this method. This impose a huge performance inefficiency both
in terms of memory utilization and response time. Since the String object
is created for each HTTP response and is used only once, we see huge amount
of Garbage Collection activities and response time of header processing in
not acceptable.
Because of this problem, our java harness for SPECweb200x benchmark supports
2x fewer throughput than the C based implementation. For this reason, a few
companies in SPEC committee (including Microsoft) had voted to use C/C++ for
the next gerneration of the benchmark but this will definitely hurt the idea
of having a platform independent benchmark.
As a workaround, we implemented a ByteArray class which takes the algorithm
of String.indexOf() and search against on byte[]. It instantly brought the
java harness performance to be the same level as C based implementation.
We wrote a microbenchmark (attached) for measure the perf difference of
3 approaches. Note the 3rd approach is listed just for perf comparison
purpose to prove how bad to have to create a new String every time.
1. allocation a new String every time a response comes in
a. read header into a byte[]
b. create a new String() to hold the header
c. parse the header for 3 pattens
(CRLF, Content-Length:, Connection: Close)
d. repeat the above b & c 20000 times
cpu time is measured for d.
2. search against the byte[] directly
a. read header into a byte[]
b. search directly against byte[] using a new class called ByteArray
which implements the same searching algorithm as String.indexOf()
c. repeat b 20000 times
cpu time is measured for c.
3. re-use the String (not feasible for a real HTTP driver since each
response is different)
a. read header into a byte[]
b. create a new String() to hold the header
c. parse the header for 3 patterns
(CRLF, Content-Length:, Connection: Close)
d. repeat only c 20000 times
cpu time is measured for d.
The perf number looks like (in msec):
# java StringPerf 20000
1. create new string: 908
2. opt byte[] parser: 398
3. re-use old string: 328
This proves that converting to a new String is very costly. For all
network related application, it's important to have a faster way
to parse the network data without having to convert everything into
new Strings.
Of course users can always implement their own ByteArray class to help them
do the search directly. But this has to be done for each of such application.
An enchancement of java.lang package can help offload this burden from the
user and can implement a optimized search algorithm as the one for
String.indexOf.
As a side note, a sub-optimized search algorithm can slow down the byte
array parser substaintially.
4. sub-opt byte[] parser: 754
I propose to create a new class called ByteArray to help searching a pattern
in a character set stored as bytes for all network related applications. Or
we should allow a String be stored as byte[] instead of char[] as indicated
in Bug Id: 4091286
I am writing a java based HTTP driver code for SPECjApp2003 and SPECweb 200x
java harness. The driver has to parse the HTTP header every time it
gets a response buffer. This involves heavy searching for multiple patterns
in the header. Currently java only provides indexOf() method in String class
for searching a pattern. But since network data can only be read() as byte
arrays, we have to convert the byte array to a new String in order to take
advantage of this method. This impose a huge performance inefficiency both
in terms of memory utilization and response time. Since the String object
is created for each HTTP response and is used only once, we see huge amount
of Garbage Collection activities and response time of header processing in
not acceptable.
Because of this problem, our java harness for SPECweb200x benchmark supports
2x fewer throughput than the C based implementation. For this reason, a few
companies in SPEC committee (including Microsoft) had voted to use C/C++ for
the next gerneration of the benchmark but this will definitely hurt the idea
of having a platform independent benchmark.
As a workaround, we implemented a ByteArray class which takes the algorithm
of String.indexOf() and search against on byte[]. It instantly brought the
java harness performance to be the same level as C based implementation.
We wrote a microbenchmark (attached) for measure the perf difference of
3 approaches. Note the 3rd approach is listed just for perf comparison
purpose to prove how bad to have to create a new String every time.
1. allocation a new String every time a response comes in
a. read header into a byte[]
b. create a new String() to hold the header
c. parse the header for 3 pattens
(CRLF, Content-Length:, Connection: Close)
d. repeat the above b & c 20000 times
cpu time is measured for d.
2. search against the byte[] directly
a. read header into a byte[]
b. search directly against byte[] using a new class called ByteArray
which implements the same searching algorithm as String.indexOf()
c. repeat b 20000 times
cpu time is measured for c.
3. re-use the String (not feasible for a real HTTP driver since each
response is different)
a. read header into a byte[]
b. create a new String() to hold the header
c. parse the header for 3 patterns
(CRLF, Content-Length:, Connection: Close)
d. repeat only c 20000 times
cpu time is measured for d.
The perf number looks like (in msec):
# java StringPerf 20000
1. create new string: 908
2. opt byte[] parser: 398
3. re-use old string: 328
This proves that converting to a new String is very costly. For all
network related application, it's important to have a faster way
to parse the network data without having to convert everything into
new Strings.
Of course users can always implement their own ByteArray class to help them
do the search directly. But this has to be done for each of such application.
An enchancement of java.lang package can help offload this burden from the
user and can implement a optimized search algorithm as the one for
String.indexOf.
As a side note, a sub-optimized search algorithm can slow down the byte
array parser substaintially.
4. sub-opt byte[] parser: 754
- relates to
-
JDK-4091286 (str) String.getBytes() is inefficient, can't pass dest byte[], app can't reuse buffs
-
- Open
-