Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8299544

Improve performance of CRC32C intrinsics (non-AVX-512) for small inputs

XMLWordPrintable

    • b05
    • x86_64

        When comparing performance of JDK built-in CRC32C intrinsics with the native implementation, we noticed that JDK performs notably worse on smaller array sizes (up to 1024 bytes).

        The main reason is that the algorithm based on the "Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction" paper [1], processes input buffer in chunks of 6144, 2064, and 648 bytes [2], and the rest of the buffer is processed sequentially with 4-byte variant of CRC32 instruction [3].

        Experiments show that CRC32C computation performance can be improved for smaller inputs without performance impact for larger inputs, if the smallest chunk size is reduced from 648 to 216 bytes. Actually, the original paper [1] suggests the algorithms CRC_216 / CRC_240 for processing blocks of 216 and 240 bytes respectively. The value 648 might have resulted from an "x3" mistake (the algorithm accumulates 3 checksums per iteration).

        Additionally, when processing the tail (less than 216 bytes), we can use 8-byte variant of CRC32 instruction instead of 4-byte variant to speed up the algorithm even further.

        [1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf
        [2] https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/crc32c.h#L29-L50
        [3] https://github.com/openjdk/jdk/blob/92dfc735f2297441a99b3e39464fb8f77a354d55/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L8231-L8236

        The proposed PR includes the following changes:

        1) CRC32C_LOW is decreased from 8*27 to 8*9 (which corresponds to 216 bytes).
        2) CRC32C_MIDDLE is decreased from 8*86 to 8*74 to compensate for performance drop for mid-size arrays. The value 74 has been chosen empirically. According to experiments, values from 70 to 75 yield the best results, while with higher or lower values, there is a degradation for certain array sizes.
        3) On x86_64, CRC32 r32, [m32] instruction is replaced with CRC32 r64, [m64] in the tail loop.
        4) Unconditional JMP is replaced with a conditional jump at the end of the loop.
        5) The hot tail loop is aligned.

        Notes:
         - The changes are only for x86 architecture.
         - If the hardware supports AVX-512 extensions + VPCLMULQDQ instruction (introduced with Ice Lake), another version of CRC32C intrinsics is generated, and the changes do not have effect.

        === Performance ===

        Performance was validated with the attached JMH benchmark: CrcBench.java

        Tested hardware:
         - Intel Xeon Platinum 8259CL
         - Intel Xeon Platinum 8124M
         - Intel Core i7-1280P
         - AMD EPYC 7251

        In general, for arrays smaller than 648 bytes, the new version demonstrated 1.5x - 3.5x higher throughput.
        For arrays larger than 32 KB, performance remained the same.
        For mid-size arrays, the new version appeared better overall, but a minor performance drop (<5%) was observed on certain array sizes, e.g. 724 bytes. This is explained by the reduction of the algorithm's smallest chunk size from 648 to 216 bytes.

        See the attached table for details.

          1. CrcBench.java
            1 kB
          2. crc32c-perf-curve.png
            crc32c-perf-curve.png
            87 kB
          3. crc32c-perf-graph.png
            crc32c-perf-graph.png
            36 kB
          4. crc32c-perf-table.png
            crc32c-perf-table.png
            74 kB

              apangin Andrei Pangin
              apangin Andrei Pangin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: