FULL PRODUCT VERSION :
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)
and
Java(TM) SE Runtime Environment (build 1.7.0-ea-b38)
Java HotSpot(TM) 64-Bit Server VM (build 14.0-b05, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Linux 2.6.27.3 #1 SMP Sat Oct 25 10:15:42 EEST 2008 x86_64 AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ AuthenticAMD GNU/Linux
A DESCRIPTION OF THE PROBLEM :
byte[] String.getBytes(Charset charset) is slower than byte[] String.getBytes(String encoding).
It should not bee because the latter needs to do extra work compared to the first version.
There are two reasons why the Charset version is slower:
1) java.lang.StringCoding.encode(Charset, ...) needlessly copies the char[]
- This causes slowdown of 4-7% for small strings. The slowdown grows for large strings.
2) java.lang.StringCoding.encode(Charset, ...) always creates a new StringEncoder
- Creating of new StringEncoder is slower than using thread local cached one when repeatedly using the same charset
Suggested replacement code for Java6 and Java5:
static byte[] encode(Charset cs, char[] ca, int off, int len) {
StringEncoder se = deref(encoder);
if (se == null || se.cs != cs) {
se = new StringEncoder(cs, cs.name());
set(encoder, se);
}
return se.encode(ca, off, len);
}
I did not use cs.equals(se.cs) because I think Charset instances are cached and it is not easily possible to create an two unique Charset instances with same name.
Tests:
T1: Repeated encodings of 4 character long strings in ISO-8859-1.
T2: Encodings of 4 character long strings in altering patterns of ISO-8859-1 and UTF-8 so that there are always 2 calls with each encoding before switching.
M1 = getBytes(String)
M2 = getBytes(Charset)
M3 = getBytes(Charset) with attached modifications
Results for Java6:
M1 M2 M3
T1 1.000 1.792 1.000
T2 1.000 1.073 0.902
Results for Java7:
M1 M2 M3
T1 1.000 1.792 1.042
T2 1.000 0.974 0.872
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
compare speed of String.getBytes(Charset) to String.getBytes(String)
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
String.getBytes(Charset) should as fast or faster than String.getBytes(String)
ACTUAL -
String.getBytes(Charset) is upto 80% slower than String.getBytes(String)
REPRODUCIBILITY :
This bug can be reproduced always.
CUSTOMER SUBMITTED WORKAROUND :
Use String.getBytes(charset.name()) instead of String.getBytes(charset)
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)
and
Java(TM) SE Runtime Environment (build 1.7.0-ea-b38)
Java HotSpot(TM) 64-Bit Server VM (build 14.0-b05, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Linux 2.6.27.3 #1 SMP Sat Oct 25 10:15:42 EEST 2008 x86_64 AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ AuthenticAMD GNU/Linux
A DESCRIPTION OF THE PROBLEM :
byte[] String.getBytes(Charset charset) is slower than byte[] String.getBytes(String encoding).
It should not bee because the latter needs to do extra work compared to the first version.
There are two reasons why the Charset version is slower:
1) java.lang.StringCoding.encode(Charset, ...) needlessly copies the char[]
- This causes slowdown of 4-7% for small strings. The slowdown grows for large strings.
2) java.lang.StringCoding.encode(Charset, ...) always creates a new StringEncoder
- Creating of new StringEncoder is slower than using thread local cached one when repeatedly using the same charset
Suggested replacement code for Java6 and Java5:
static byte[] encode(Charset cs, char[] ca, int off, int len) {
StringEncoder se = deref(encoder);
if (se == null || se.cs != cs) {
se = new StringEncoder(cs, cs.name());
set(encoder, se);
}
return se.encode(ca, off, len);
}
I did not use cs.equals(se.cs) because I think Charset instances are cached and it is not easily possible to create an two unique Charset instances with same name.
Tests:
T1: Repeated encodings of 4 character long strings in ISO-8859-1.
T2: Encodings of 4 character long strings in altering patterns of ISO-8859-1 and UTF-8 so that there are always 2 calls with each encoding before switching.
M1 = getBytes(String)
M2 = getBytes(Charset)
M3 = getBytes(Charset) with attached modifications
Results for Java6:
M1 M2 M3
T1 1.000 1.792 1.000
T2 1.000 1.073 0.902
Results for Java7:
M1 M2 M3
T1 1.000 1.792 1.042
T2 1.000 0.974 0.872
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
compare speed of String.getBytes(Charset) to String.getBytes(String)
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
String.getBytes(Charset) should as fast or faster than String.getBytes(String)
ACTUAL -
String.getBytes(Charset) is upto 80% slower than String.getBytes(String)
REPRODUCIBILITY :
This bug can be reproduced always.
CUSTOMER SUBMITTED WORKAROUND :
Use String.getBytes(charset.name()) instead of String.getBytes(charset)