Details
-
Enhancement
-
Resolution: Fixed
-
P4
-
17, 20
-
b05
-
x86_64
Backports
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8310197 | 17.0.9-oracle | Tobias Hartmann | P4 | Resolved | Fixed | b01 |
JDK-8308190 | 17.0.8 | Andrei Pangin | P4 | Resolved | Fixed | b03 |
Description
When comparing performance of JDK built-in CRC32C intrinsics with the native implementation, we noticed that JDK performs notably worse on smaller array sizes (up to 1024 bytes).
The main reason is that the algorithm based on the "Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction" paper [1], processes input buffer in chunks of 6144, 2064, and 648 bytes [2], and the rest of the buffer is processed sequentially with 4-byte variant of CRC32 instruction [3].
Experiments show that CRC32C computation performance can be improved for smaller inputs without performance impact for larger inputs, if the smallest chunk size is reduced from 648 to 216 bytes. Actually, the original paper [1] suggests the algorithms CRC_216 / CRC_240 for processing blocks of 216 and 240 bytes respectively. The value 648 might have resulted from an "x3" mistake (the algorithm accumulates 3 checksums per iteration).
Additionally, when processing the tail (less than 216 bytes), we can use 8-byte variant of CRC32 instruction instead of 4-byte variant to speed up the algorithm even further.
[1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf
[2] https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/crc32c.h#L29-L50
[3] https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L8231-L8236
The proposed PR includes the following changes:
1) CRC32C_LOW is decreased from 8*27 to 8*9 (which corresponds to 216 bytes).
2) CRC32C_MIDDLE is decreased from 8*86 to 8*74 to compensate for performance drop for mid-size arrays. The value 74 has been chosen empirically. According to experiments, values from 70 to 75 yield the best results, while with higher or lower values, there is a degradation for certain array sizes.
3) On x86_64, CRC32 r32, [m32] instruction is replaced with CRC32 r64, [m64] in the tail loop.
4) Unconditional JMP is replaced with a conditional jump at the end of the loop.
5) The hot tail loop is aligned.
Notes:
- The changes are only for x86 architecture.
- If the hardware supports AVX-512 extensions + VPCLMULQDQ instruction (introduced with Ice Lake), another version of CRC32C intrinsics is generated, and the changes do not have effect.
=== Performance ===
Performance was validated with the attached JMH benchmark: CrcBench.java
Tested hardware:
- Intel Xeon Platinum 8259CL
- Intel Xeon Platinum 8124M
- Intel Core i7-1280P
- AMD EPYC 7251
In general, for arrays smaller than 648 bytes, the new version demonstrated 1.5x - 3.5x higher throughput.
For arrays larger than 32 KB, performance remained the same.
For mid-size arrays, the new version appeared better overall, but a minor performance drop (<5%) was observed on certain array sizes, e.g. 724 bytes. This is explained by the reduction of the algorithm's smallest chunk size from 648 to 216 bytes.
See the attached table for details.
The main reason is that the algorithm based on the "Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction" paper [1], processes input buffer in chunks of 6144, 2064, and 648 bytes [2], and the rest of the buffer is processed sequentially with 4-byte variant of CRC32 instruction [3].
Experiments show that CRC32C computation performance can be improved for smaller inputs without performance impact for larger inputs, if the smallest chunk size is reduced from 648 to 216 bytes. Actually, the original paper [1] suggests the algorithms CRC_216 / CRC_240 for processing blocks of 216 and 240 bytes respectively. The value 648 might have resulted from an "x3" mistake (the algorithm accumulates 3 checksums per iteration).
Additionally, when processing the tail (less than 216 bytes), we can use 8-byte variant of CRC32 instruction instead of 4-byte variant to speed up the algorithm even further.
[1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf
[2] https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/crc32c.h#L29-L50
[3] https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L8231-L8236
The proposed PR includes the following changes:
1) CRC32C_LOW is decreased from 8*27 to 8*9 (which corresponds to 216 bytes).
2) CRC32C_MIDDLE is decreased from 8*86 to 8*74 to compensate for performance drop for mid-size arrays. The value 74 has been chosen empirically. According to experiments, values from 70 to 75 yield the best results, while with higher or lower values, there is a degradation for certain array sizes.
3) On x86_64, CRC32 r32, [m32] instruction is replaced with CRC32 r64, [m64] in the tail loop.
4) Unconditional JMP is replaced with a conditional jump at the end of the loop.
5) The hot tail loop is aligned.
Notes:
- The changes are only for x86 architecture.
- If the hardware supports AVX-512 extensions + VPCLMULQDQ instruction (introduced with Ice Lake), another version of CRC32C intrinsics is generated, and the changes do not have effect.
=== Performance ===
Performance was validated with the attached JMH benchmark: CrcBench.java
Tested hardware:
- Intel Xeon Platinum 8259CL
- Intel Xeon Platinum 8124M
- Intel Core i7-1280P
- AMD EPYC 7251
In general, for arrays smaller than 648 bytes, the new version demonstrated 1.5x - 3.5x higher throughput.
For arrays larger than 32 KB, performance remained the same.
For mid-size arrays, the new version appeared better overall, but a minor performance drop (<5%) was observed on certain array sizes, e.g. 724 bytes. This is explained by the reduction of the algorithm's smallest chunk size from 648 to 216 bytes.
See the attached table for details.
Attachments
Issue Links
- backported by
-
JDK-8308190 Improve performance of CRC32C intrinsics (non-AVX-512) for small inputs
- Resolved
-
JDK-8310197 Improve performance of CRC32C intrinsics (non-AVX-512) for small inputs
- Resolved
- links to
-
Commit openjdk/jdk17u-dev/e55bea42
-
Commit openjdk/jdk/8c70bf3f
-
Review openjdk/jdk17u-dev/1345
-
Review openjdk/jdk/11838
(1 links to)