Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Fixed
Priority: P4
Fix Version/s: 21
Affects Version/s: 17, 20
Component/s: hotspot
Labels:

Subcomponent:
compiler
Resolved In Build:
b05
CPU:

x86_64

Issue	Fix Version	Assignee	Priority	Status	Resolution	Resolved In Build
JDK-8310197	17.0.9-oracle	Tobias Hartmann	P4	Resolved	Fixed	b01
JDK-8308190	17.0.8	Andrei Pangin	P4	Resolved	Fixed	b03

When comparing performance of JDK built-in CRC32C intrinsics with the native implementation, we noticed that JDK performs notably worse on smaller array sizes (up to 1024 bytes).

The main reason is that the algorithm based on the "Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction" paper [1], processes input buffer in chunks of 6144, 2064, and 648 bytes [2], and the rest of the buffer is processed sequentially with 4-byte variant of CRC32 instruction [3].

Experiments show that CRC32C computation performance can be improved for smaller inputs without performance impact for larger inputs, if the smallest chunk size is reduced from 648 to 216 bytes. Actually, the original paper [1] suggests the algorithms CRC_216 / CRC_240 for processing blocks of 216 and 240 bytes respectively. The value 648 might have resulted from an "x3" mistake (the algorithm accumulates 3 checksums per iteration).

Additionally, when processing the tail (less than 216 bytes), we can use 8-byte variant of CRC32 instruction instead of 4-byte variant to speed up the algorithm even further.

[1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf
[2] https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/crc32c.h#L29-L50
[3] https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L8231-L8236

The proposed PR includes the following changes:

1) CRC32C_LOW is decreased from 8*27 to 8*9 (which corresponds to 216 bytes).
2) CRC32C_MIDDLE is decreased from 8*86 to 8*74 to compensate for performance drop for mid-size arrays. The value 74 has been chosen empirically. According to experiments, values from 70 to 75 yield the best results, while with higher or lower values, there is a degradation for certain array sizes.
3) On x86_64, CRC32 r32, [m32] instruction is replaced with CRC32 r64, [m64] in the tail loop.
4) Unconditional JMP is replaced with a conditional jump at the end of the loop.
5) The hot tail loop is aligned.

Notes:
- The changes are only for x86 architecture.
- If the hardware supports AVX-512 extensions + VPCLMULQDQ instruction (introduced with Ice Lake), another version of CRC32C intrinsics is generated, and the changes do not have effect.

=== Performance ===

Performance was validated with the attached JMH benchmark: CrcBench.java

Tested hardware:
- Intel Xeon Platinum 8259CL
- Intel Xeon Platinum 8124M
- Intel Core i7-1280P
- AMD EPYC 7251

In general, for arrays smaller than 648 bytes, the new version demonstrated 1.5x - 3.5x higher throughput.
For arrays larger than 32 KB, performance remained the same.
For mid-size arrays, the new version appeared better overall, but a minor performance drop (<5%) was observed on certain array sizes, e.g. 724 bytes. This is explained by the reduction of the algorithm's smallest chunk size from 648 to 216 bytes.

See the attached table for details.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

crc32c-perf-curve.png
87 kB
2023-01-05 07:07
crc32c-perf-graph.png
36 kB
2023-01-05 07:07
crc32c-perf-table.png
74 kB
2023-01-05 07:07
CrcBench.java
1 kB
2023-01-03 18:22

backported by

JDK-8308190 Improve performance of CRC32C intrinsics (non-AVX-512) for small inputs

Resolved

JDK-8310197 Improve performance of CRC32C intrinsics (non-AVX-512) for small inputs

Resolved

links to

Commit openjdk/jdk17u-dev/e55bea42

Commit openjdk/jdk/8c70bf3f

Review openjdk/jdk17u-dev/1345

Review openjdk/jdk/11838

(1 links to)

Assignee:: Andrei Pangin
Reporter:: Andrei Pangin
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: 2023-01-03 06:24
Updated:: 2023-07-04 23:30
Resolved:: 2023-01-06 11:05

Details

Backports

Description

Attachments

Attachments

Issue Links

Activity

People

Dates