-
Enhancement
-
Resolution: Fixed
-
P3
-
1.4.2, 6
-
b78
-
generic, x86
-
generic, solaris_2.5.1
Name: rmT116609 Date: 03/01/2004
A DESCRIPTION OF THE REQUEST :
I have a large number of byte[] arrays and i know the charset name (UTF-8) of data in bytearrays. I am using jdk 1.4.2 . I was wondering what is the most efficient way to create java.lang.string objects out of these byte arrays. The default string constructor of java accepts a charsetname and when i look into the sun java source code, it is doing comparisions etc.,. before it actually performs the conversion which i feel like inefficient
what is missing is a string constructor which takes a charset decoder as argument. Is there any way to get this into jdk 1.5 ?
I am looking at the most efficient solution because it is such a basic operation..
From what i can see, the performance improvements is 40%.
I am running the tests on a 2.2 Ghz intel P4 box and am using version of java as below.
C:\cadyformatter\src\java\lang>java -version
java version "1.4.0_01"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0_01-b03)
Java HotSpot(TM) Client VM (build 1.4.0_01-b03, mixed mode)
C:\cadyformatter\src\java\lang>java -Xbootclasspath:.;c:\jdk1.4\jre\lib\rt.jar StringTest
New api time : 10453
Old api time : 14282
C:\cadyformatter\src\java\lang>java -Xbootclasspath:.;c:\jdk1.4\jre\lib\rt.jar StringTest
New api time : 10515
Old api time : 14297
C:\cadyformatter\src\java\lang>java -Xbootclasspath:.;c:\jdk1.4\jre\lib\rt.jar StringTest
New api time : 10453
Old api time : 14281
The method i added to java.lang.String is
public String(byte bytes[], int offset, int length, CharsetDecoder cd)
{
// note - the cd is not threadsafe, throw concurrent exception if it is used badly
if (cd == null)
throw new NullPointerException("null charset decoder");
checkBounds(bytes, offset, length);
cd.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE);
int en = (int)(cd.maxCharsPerByte() * length);
char[] ca = new char[en];
cd.reset();
ByteBuffer bb = ByteBuffer.wrap(bytes, offset, length);
CharBuffer cb = CharBuffer.wrap(ca);
try {
CoderResult cr = cd.decode(bb, cb, true);
if (!cr.isUnderflow())
cr.throwException();
cr = cd.flush(cb);
if (!cr.isUnderflow())
cr.throwException();
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen
throw new Error(x);
}
value = ca;
count = cb.position();
}
JUSTIFICATION :
Converting bytes to strings is the most common operation in i/o bound apps and without this it is kind of inefficient.
Also, note that if I might be using custom UTF-8 Converter that i myself wrote rather than using the default, may be i can get much faster performance (I will writeup another test case for this and send it later).
Note that in a server side environment, this is the most frequently performed operation and hence it should be ultra efficient. I noted that many servers like Caucho resin server etc.,. which are opensource have gone extra lengths to make this as fast as possible getting around the current bottleneck in java.lang.String class.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Need to see a String ctor with charsetdecoder as argument - more better ultra optimizations for utf-8 strings which is a common case
ACTUAL -
Currently, a charset name string is passed as argument which internally does thread local lookup and a string comparision etc.,. before it actually does the converstion - the overhead seems to bemore than actual comparision for small strings.
(Incident Review ID: 240188)
======================================================================
###@###.### 2004-03-01
A DESCRIPTION OF THE REQUEST :
I have a large number of byte[] arrays and i know the charset name (UTF-8) of data in bytearrays. I am using jdk 1.4.2 . I was wondering what is the most efficient way to create java.lang.string objects out of these byte arrays. The default string constructor of java accepts a charsetname and when i look into the sun java source code, it is doing comparisions etc.,. before it actually performs the conversion which i feel like inefficient
what is missing is a string constructor which takes a charset decoder as argument. Is there any way to get this into jdk 1.5 ?
I am looking at the most efficient solution because it is such a basic operation..
From what i can see, the performance improvements is 40%.
I am running the tests on a 2.2 Ghz intel P4 box and am using version of java as below.
C:\cadyformatter\src\java\lang>java -version
java version "1.4.0_01"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0_01-b03)
Java HotSpot(TM) Client VM (build 1.4.0_01-b03, mixed mode)
C:\cadyformatter\src\java\lang>java -Xbootclasspath:.;c:\jdk1.4\jre\lib\rt.jar StringTest
New api time : 10453
Old api time : 14282
C:\cadyformatter\src\java\lang>java -Xbootclasspath:.;c:\jdk1.4\jre\lib\rt.jar StringTest
New api time : 10515
Old api time : 14297
C:\cadyformatter\src\java\lang>java -Xbootclasspath:.;c:\jdk1.4\jre\lib\rt.jar StringTest
New api time : 10453
Old api time : 14281
The method i added to java.lang.String is
public String(byte bytes[], int offset, int length, CharsetDecoder cd)
{
// note - the cd is not threadsafe, throw concurrent exception if it is used badly
if (cd == null)
throw new NullPointerException("null charset decoder");
checkBounds(bytes, offset, length);
cd.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE);
int en = (int)(cd.maxCharsPerByte() * length);
char[] ca = new char[en];
cd.reset();
ByteBuffer bb = ByteBuffer.wrap(bytes, offset, length);
CharBuffer cb = CharBuffer.wrap(ca);
try {
CoderResult cr = cd.decode(bb, cb, true);
if (!cr.isUnderflow())
cr.throwException();
cr = cd.flush(cb);
if (!cr.isUnderflow())
cr.throwException();
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen
throw new Error(x);
}
value = ca;
count = cb.position();
}
JUSTIFICATION :
Converting bytes to strings is the most common operation in i/o bound apps and without this it is kind of inefficient.
Also, note that if I might be using custom UTF-8 Converter that i myself wrote rather than using the default, may be i can get much faster performance (I will writeup another test case for this and send it later).
Note that in a server side environment, this is the most frequently performed operation and hence it should be ultra efficient. I noted that many servers like Caucho resin server etc.,. which are opensource have gone extra lengths to make this as fast as possible getting around the current bottleneck in java.lang.String class.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Need to see a String ctor with charsetdecoder as argument - more better ultra optimizations for utf-8 strings which is a common case
ACTUAL -
Currently, a charset name string is passed as argument which internally does thread local lookup and a string comparision etc.,. before it actually does the converstion - the overhead seems to bemore than actual comparision for small strings.
(Incident Review ID: 240188)
======================================================================
###@###.### 2004-03-01
- relates to
-
JDK-6402819 String(a, charset) slower than String(a, charsetname)
-
- Resolved
-
-
JDK-6393232 (spec) String methods which take Charset should specify behaviour for invalid bytes
-
- Resolved
-
-
JDK-6400767 Method for binary data in strings
-
- Closed
-