-
Bug
-
Resolution: Fixed
-
P2
-
1.1.1
-
1.2alpha
-
sparc
-
solaris_2.5
-
Not verified
The JLS 20.12.10 (page 535) describes the java.lang.String.hashCode method in great detail. Unfortunately, it doesn't describe what the code actually does. The JLS says that for short strings the hashCode is (excuse the made up syntax for integer exponentiation):
((c[0] * 37^0) + (c[1] * 37^1) + (c[2] * 37^2) + ...)
and something similar, but sampling, for long strings. In fact, the code we have been shipping since 1994 is
public int hashCode() {
int h = 0;
int off = offset;
char val[] = value;
int len = count;
if (len < 16) {
for (int i = len ; i > 0; i--) {
h = (h * 37) + val[off++];
}
} else {
// only sample some characters
int skip = len / 8;
for (int i = len ; i > 0; i -= skip, off += skip) {
h = (h * 39) + val[off];
}
}
return h;
}
in which the exponentiation has been strength-reduced to multiplication, but in the process is applied to the characters of the string in reverse order.
Here's a test program that shows the difference:
% cat java/HashTest.java
public class HashTest {
public static void main(String[] argv) {
String one = new String("\u0001");
int oneHash = one.hashCode();
System.out.println("oneHash: " + Integer.toString(oneHash));
int oneJLS = ((((int) '\u0001') * ((int) Math.pow(37.0, 0))));
System.out.println("((1 * 37^0)): " +
Integer.toString(oneJLS));
String two = new String("\u0001\u0002");
int twoHash = two.hashCode();
System.out.println("twoHash: " + Integer.toString(twoHash));
int twoJLS = ((((int) '\u0001') * ((int) Math.pow(37.0, 0))) +
(((int) '\u0002') * ((int) Math.pow(37.0, 1))));
System.out.println("((1 * 37^0) + (2 * 37^1)): " +
Integer.toString(twoJLS));
String three = new String("\u0001\u0002\u0003");
int threeHash = three.hashCode();
System.out.println("threeHash: " + Integer.toString(threeHash));
int threeJLS = ((((int) '\u0001') * ((int) Math.pow(37.0, 0))) +
(((int) '\u0002') * ((int) Math.pow(37.0, 1))) +
(((int) '\u0003') * ((int) Math.pow(37.0, 2))));
System.out.println("((1 * 37^0) + (2 * 37^1) + (3 * 37^2)): " +
Integer.toString(threeJLS));
}
}
% $VM12/build/bin/java -classpath classes:$VM12/build/classes HashTest
oneHash: 1
((1 * 37^0)): 1
twoHash: 39
((1 * 37^0) + (2 * 37^1)): 75
threeHash: 1446
((1 * 37^0) + (2 * 37^1) + (3 * 37^2)): 4182
You can see that it's off even by the two-character string.
Usually I side with the JLS over the code, but in this case I think the JLS was trying to describe what the code was doing but got it wrong. I don't think we can change the code, since there will be a lot of codes out there that are depending on the unchanging computation of the String.hashCode method (I think of all those serialized hash tables which would be confused if they were deserialized in a VM whose String.hashCode method was different).
Either way, something needs to be changed, and quickly.
peter.kessler@Eng 1997-04-16
It turns out that there's another, unrelated bug in the spec: there is an error in the formulae for calculating the increment (k) and maximum value (m)
that are used in the sigma expression for the "sampled hash" that is used
if n > 15. As a result, the specified expression references charcters
that lie out of bounds, and would cause runtime exceptions if correctly
implemented.
joshua.bloch@Eng 1997-04-24
((c[0] * 37^0) + (c[1] * 37^1) + (c[2] * 37^2) + ...)
and something similar, but sampling, for long strings. In fact, the code we have been shipping since 1994 is
public int hashCode() {
int h = 0;
int off = offset;
char val[] = value;
int len = count;
if (len < 16) {
for (int i = len ; i > 0; i--) {
h = (h * 37) + val[off++];
}
} else {
// only sample some characters
int skip = len / 8;
for (int i = len ; i > 0; i -= skip, off += skip) {
h = (h * 39) + val[off];
}
}
return h;
}
in which the exponentiation has been strength-reduced to multiplication, but in the process is applied to the characters of the string in reverse order.
Here's a test program that shows the difference:
% cat java/HashTest.java
public class HashTest {
public static void main(String[] argv) {
String one = new String("\u0001");
int oneHash = one.hashCode();
System.out.println("oneHash: " + Integer.toString(oneHash));
int oneJLS = ((((int) '\u0001') * ((int) Math.pow(37.0, 0))));
System.out.println("((1 * 37^0)): " +
Integer.toString(oneJLS));
String two = new String("\u0001\u0002");
int twoHash = two.hashCode();
System.out.println("twoHash: " + Integer.toString(twoHash));
int twoJLS = ((((int) '\u0001') * ((int) Math.pow(37.0, 0))) +
(((int) '\u0002') * ((int) Math.pow(37.0, 1))));
System.out.println("((1 * 37^0) + (2 * 37^1)): " +
Integer.toString(twoJLS));
String three = new String("\u0001\u0002\u0003");
int threeHash = three.hashCode();
System.out.println("threeHash: " + Integer.toString(threeHash));
int threeJLS = ((((int) '\u0001') * ((int) Math.pow(37.0, 0))) +
(((int) '\u0002') * ((int) Math.pow(37.0, 1))) +
(((int) '\u0003') * ((int) Math.pow(37.0, 2))));
System.out.println("((1 * 37^0) + (2 * 37^1) + (3 * 37^2)): " +
Integer.toString(threeJLS));
}
}
% $VM12/build/bin/java -classpath classes:$VM12/build/classes HashTest
oneHash: 1
((1 * 37^0)): 1
twoHash: 39
((1 * 37^0) + (2 * 37^1)): 75
threeHash: 1446
((1 * 37^0) + (2 * 37^1) + (3 * 37^2)): 4182
You can see that it's off even by the two-character string.
Usually I side with the JLS over the code, but in this case I think the JLS was trying to describe what the code was doing but got it wrong. I don't think we can change the code, since there will be a lot of codes out there that are depending on the unchanging computation of the String.hashCode method (I think of all those serialized hash tables which would be confused if they were deserialized in a VM whose String.hashCode method was different).
Either way, something needs to be changed, and quickly.
peter.kessler@Eng 1997-04-16
It turns out that there's another, unrelated bug in the spec: there is an error in the formulae for calculating the increment (k) and maximum value (m)
that are used in the sigma expression for the "sampled hash" that is used
if n > 15. As a result, the specified expression references charcters
that lie out of bounds, and would cause runtime exceptions if correctly
implemented.
joshua.bloch@Eng 1997-04-24