-
Bug
-
Resolution: Fixed
-
P3
-
7
-
b06
-
generic
-
generic
-
Not verified
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-2215449 | 8 | Vladimir Kozlov | P3 | Resolved | Fixed | b08 |
JDK-2214903 | 7u2 | Vladimir Kozlov | P3 | Closed | Fixed | b08 |
A perf regression of approximately 13.5% was observed in jdk7b144, when compared with jdk6u25 (score of 553MB/s vs 623MB/s). The
benchmark is actually from the hadoop common community and its a pure java crc32 implementation of update (I changed the
test to only spew scores for 65536 bytes size).
It has a while loop and when I looked at the generated code for jdk7b144
and compared it to jdk6u25, I saw that there was loop unrolling done in jdk6, so then I tried setting
-XX:LoopUnrollLimit=0 for both (to bring them to a common ground). This didn't change jdk7 at all (as expected), and
dropped jdk6's score to 601MB/s (so now the diff is 8.7%). The I tried the with XX-UseLoopPredicate and the score for
jdk6 dropped a bit to 610MB/s and score for jdk7 increased a bit to 572MB/s (so now the difference is 6.6%). Combining
both loopunrolllimit=0 and -looppredicate I get 528MB/s for jdk6 and 542MB/s for jdk7 which is little confusing to me...
I have attached the generate outputs (Solaris Studio print) for both.
Here's the source:
line# 59:public void update(byte[] b, int off, int len) {
line# 60: while(len > 7) {
line# 61: int c0 = b[off++] ^ crc;
line# 62: int c1 = b[off++] ^ (crc >>>= 8);
line# 63: int c2 = b[off++] ^ (crc >>>= 8);
line# 64: int c3 = b[off++] ^ (crc >>>= 8);
line# 65: crc = (T8_7[c0 & 0xff] ^ T8_6[c1 & 0xff])
line# 66: ^ (T8_5[c2 & 0xff] ^ T8_4[c3 & 0xff]);
line# 67:
line# 68: crc ^= (T8_3[b[off++] & 0xff] ^ T8_2[b[off++] & 0xff])
line# 69: ^ (T8_1[b[off++] & 0xff] ^ T8_0[b[off++] & 0xff]);
line# 70:
line# 71: len -= 8;
line# 72: }
line# 73: while(len > 0) {
line# 74: crc = (crc >>> 8) ^ T8_0[(crc ^ b[off++]) & 0xff];
line# 75: len--;
line# 76: }
line# 77: }
Matching sections: line 2b0 onwards in jdk6u25.txt and line e0 onwards for jdk7b144.txt.
benchmark is actually from the hadoop common community and its a pure java crc32 implementation of update (I changed the
test to only spew scores for 65536 bytes size).
It has a while loop and when I looked at the generated code for jdk7b144
and compared it to jdk6u25, I saw that there was loop unrolling done in jdk6, so then I tried setting
-XX:LoopUnrollLimit=0 for both (to bring them to a common ground). This didn't change jdk7 at all (as expected), and
dropped jdk6's score to 601MB/s (so now the diff is 8.7%). The I tried the with XX-UseLoopPredicate and the score for
jdk6 dropped a bit to 610MB/s and score for jdk7 increased a bit to 572MB/s (so now the difference is 6.6%). Combining
both loopunrolllimit=0 and -looppredicate I get 528MB/s for jdk6 and 542MB/s for jdk7 which is little confusing to me...
I have attached the generate outputs (Solaris Studio print) for both.
Here's the source:
line# 59:public void update(byte[] b, int off, int len) {
line# 60: while(len > 7) {
line# 61: int c0 = b[off++] ^ crc;
line# 62: int c1 = b[off++] ^ (crc >>>= 8);
line# 63: int c2 = b[off++] ^ (crc >>>= 8);
line# 64: int c3 = b[off++] ^ (crc >>>= 8);
line# 65: crc = (T8_7[c0 & 0xff] ^ T8_6[c1 & 0xff])
line# 66: ^ (T8_5[c2 & 0xff] ^ T8_4[c3 & 0xff]);
line# 67:
line# 68: crc ^= (T8_3[b[off++] & 0xff] ^ T8_2[b[off++] & 0xff])
line# 69: ^ (T8_1[b[off++] & 0xff] ^ T8_0[b[off++] & 0xff]);
line# 70:
line# 71: len -= 8;
line# 72: }
line# 73: while(len > 0) {
line# 74: crc = (crc >>> 8) ^ T8_0[(crc ^ b[off++]) & 0xff];
line# 75: len--;
line# 76: }
line# 77: }
Matching sections: line 2b0 onwards in jdk6u25.txt and line e0 onwards for jdk7b144.txt.
- backported by
-
JDK-2215449 No loop unrolling done in jdk7b144 for a test update() while loop
- Resolved
-
JDK-2214903 No loop unrolling done in jdk7b144 for a test update() while loop
- Closed
- relates to
-
JDK-5091921 Sign flip issues in loop optimizer
- Closed