Loading...

Type: Enhancement
Resolution: Unresolved
Priority: P3
Fix Version/s: tbd
Affects Version/s: 21, 22
Component/s: hotspot
Labels:

Subcomponent:
compiler
CPU:

aarch64
OS:

generic

As described in the thread below, the issue is with 2-element reductions. They are currently not deemed profitable, and may not be profitable on some platforms.

The reported example first assumed that the issue was with bitCount, but it is really the issue with 2-element reductions not being vectorized. We have a 2-element reduction because of 16 byte on aarch64 machine, and long-vectors can thus only have 2 elements.

Relates to JDK-8307516, and may be fixed with JDK-8340093.

---------------------- Original Description -------------------------

Contrary to intuition Integer::bitcount performs better that Long::bitcount on the same underlying data on AArch64.

Summary

The issue is that Long::bitcount does not vectorized, while Integer::bitcount uses NEON SIMD instructions to operate on groups of 4 32-bit ints at a time. This is a counterintuitive, as one would expect striding with a wider bit-width to be more performant.

Linux x64 shows no such issue - performance of int and long bit counts is identical.

Background

We ran into this anomaly when implementing hamming distance between two bit vectors. The values for each bit vector are packed into a byte[], where each bit represents a feature of the vector. We then chose to view the vector data as long, via MethodHandles::byteArrayViewVarHandle(long[].class, ..), in order to perform an xor between the vectors and count the set bits. Surprisingly, implementing the same but using MethodHandles::byteArrayViewVarHandle(int[].class, ..) performs significantly better on ARM.

There may be more going on between our implementation of hamming distance and plain use of bitCount, but it's likely that this is the root cause for the performance differential that we see. Additionally, it is useful to analyse the plain use of bitCount independently (of our hamming distance impl).

Long::bitCount - unrolling appears to almost "just" stamp out the non-unrolled loop body 8 times. So, one unrolled iteration operates on 8 64-bit long values at a time, which covers 512 bits per loop iteration.

Integer::bitCount - unrolling vectorizes. So, one unrolled iteration operates on 32 32-bit int values at a time - effectively loading 4 32-bit int values into a single 128-bit register. And it does this 8 times, which covers 1024 bits per loop iteration.

The int variant vectorizes (uses NEON SIMD) when it unrolls, whereas the long variant does not. The result is that the unrolled int variant processes twice as many values per iteration of the unrolled loop, when compared to that of the long variant.

Trivial benchmark:

public class BitCountBenchmark {
    @Param({ "100000" })
    int times = 100_000;

    @Param({ "1024" })
    int size = 1024;

    long[][] la;
    int[][] ia;

    @Setup
    public void setup() {
        Random rand = new Random();
        this.ia = new int[times][size];
        this.la = new long[times][size / 2];
        for (int i = 0; i < times; i++) {
            for (int j = 0; j < size / 2; j++) {
                int x = rand.nextInt();
                int y = rand.nextInt();
                ia[i][j * 2] = x;
                ia[i][j * 2 + 1] = y;
                la[i][j] = (((long)x) << 32) | (y & 0xffffffffL);
            }
        }

        if (bitCountLong() != bitCountInt()) {
            throw new AssertionError();
        }
    }

    @Benchmark
    public int bitCountLong() {
        int tot = 0;
        for (int i = 0; i < times; i++) {
            tot += longBitCount(la[i]);
        }
        return tot;
    }

    @Benchmark
    public int bitCountInt() {
        int tot = 0;
        for (int i = 0; i < times; i++) {
            tot += intBitCount(ia[i]);
        }
        return tot;
    }

    static int longBitCount(long[] la) {
        int distance = 0;
        for (int i = 0; i < la.length; i++) {
            distance += Long.bitCount(la[i]);
        }
        return distance;
    }

    static int intBitCount(int[] ia) {
        int distance = 0;
        for (int i = 0; i < ia.length; i++) {
            distance += Integer.bitCount(ia[i]);
        }
        return distance;
    }

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

BitCountBenchmark.java
3 kB
2024-07-09 06:01
BitCount.java
23 kB
2024-07-09 08:04

relates to

JDK-8307516 C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction

Open

JDK-8340093 C2 SuperWord: implement cost model

Open

JDK-8357530 C2 SuperWord: Diagnostic flag AutoVectorizationOverrideProfitability

Resolved

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates