-
Enhancement
-
Resolution: Unresolved
-
P3
-
21, 22
-
aarch64
-
generic
Contrary to intuition Integer::bitcount performs better that Long::bitcount on the same underlying data on AArch64.
Summary
The issue is that Long::bitcount does not vectorized, while Integer::bitcount uses NEON SIMD instructions to operate on groups of 4 32-bit ints at a time. This is a counterintuitive, as one would expect striding with a wider bit-width to be more performant.
Linux x64 shows no such issue - performance of int and long bit counts is identical.
Background
We ran into this anomaly when implementing hamming distance between two bit vectors. The values for each bit vector are packed into a byte[], where each bit represents a feature of the vector. We then chose to view the vector data as long, via MethodHandles::byteArrayViewVarHandle(long[].class, ..), in order to perform an xor between the vectors and count the set bits. Surprisingly, implementing the same but using MethodHandles::byteArrayViewVarHandle(int[].class, ..) performs significantly better on ARM.
There may be more going on between our implementation of hamming distance and plain use of bitCount, but it's likely that this is the root cause for the performance differential that we see. Additionally, it is useful to analyse the plain use of bitCount independently (of our hamming distance impl).
Long::bitCount - unrolling appears to almost "just" stamp out the non-unrolled loop body 8 times. So, one unrolled iteration operates on 8 64-bit long values at a time, which covers 512 bits per loop iteration.
Integer::bitCount - unrolling vectorizes. So, one unrolled iteration operates on 32 32-bit int values at a time - effectively loading 4 32-bit int values into a single 128-bit register. And it does this 8 times, which covers 1024 bits per loop iteration.
The int variant vectorizes (uses NEON SIMD) when it unrolls, whereas the long variant does not. The result is that the unrolled int variant processes twice as many values per iteration of the unrolled loop, when compared to that of the long variant.
Trivial benchmark:
public class BitCountBenchmark {
@Param({ "100000" })
int times = 100_000;
@Param({ "1024" })
int size = 1024;
long[][] la;
int[][] ia;
@Setup
public void setup() {
Random rand = new Random();
this.ia = new int[times][size];
this.la = new long[times][size / 2];
for (int i = 0; i < times; i++) {
for (int j = 0; j < size / 2; j++) {
int x = rand.nextInt();
int y = rand.nextInt();
ia[i][j * 2] = x;
ia[i][j * 2 + 1] = y;
la[i][j] = (((long)x) << 32) | (y & 0xffffffffL);
}
}
if (bitCountLong() != bitCountInt()) {
throw new AssertionError();
}
}
@Benchmark
public int bitCountLong() {
int tot = 0;
for (int i = 0; i < times; i++) {
tot += longBitCount(la[i]);
}
return tot;
}
@Benchmark
public int bitCountInt() {
int tot = 0;
for (int i = 0; i < times; i++) {
tot += intBitCount(ia[i]);
}
return tot;
}
static int longBitCount(long[] la) {
int distance = 0;
for (int i = 0; i < la.length; i++) {
distance += Long.bitCount(la[i]);
}
return distance;
}
static int intBitCount(int[] ia) {
int distance = 0;
for (int i = 0; i < ia.length; i++) {
distance += Integer.bitCount(ia[i]);
}
return distance;
}
Summary
The issue is that Long::bitcount does not vectorized, while Integer::bitcount uses NEON SIMD instructions to operate on groups of 4 32-bit ints at a time. This is a counterintuitive, as one would expect striding with a wider bit-width to be more performant.
Linux x64 shows no such issue - performance of int and long bit counts is identical.
Background
We ran into this anomaly when implementing hamming distance between two bit vectors. The values for each bit vector are packed into a byte[], where each bit represents a feature of the vector. We then chose to view the vector data as long, via MethodHandles::byteArrayViewVarHandle(long[].class, ..), in order to perform an xor between the vectors and count the set bits. Surprisingly, implementing the same but using MethodHandles::byteArrayViewVarHandle(int[].class, ..) performs significantly better on ARM.
There may be more going on between our implementation of hamming distance and plain use of bitCount, but it's likely that this is the root cause for the performance differential that we see. Additionally, it is useful to analyse the plain use of bitCount independently (of our hamming distance impl).
Long::bitCount - unrolling appears to almost "just" stamp out the non-unrolled loop body 8 times. So, one unrolled iteration operates on 8 64-bit long values at a time, which covers 512 bits per loop iteration.
Integer::bitCount - unrolling vectorizes. So, one unrolled iteration operates on 32 32-bit int values at a time - effectively loading 4 32-bit int values into a single 128-bit register. And it does this 8 times, which covers 1024 bits per loop iteration.
The int variant vectorizes (uses NEON SIMD) when it unrolls, whereas the long variant does not. The result is that the unrolled int variant processes twice as many values per iteration of the unrolled loop, when compared to that of the long variant.
Trivial benchmark:
public class BitCountBenchmark {
@Param({ "100000" })
int times = 100_000;
@Param({ "1024" })
int size = 1024;
long[][] la;
int[][] ia;
@Setup
public void setup() {
Random rand = new Random();
this.ia = new int[times][size];
this.la = new long[times][size / 2];
for (int i = 0; i < times; i++) {
for (int j = 0; j < size / 2; j++) {
int x = rand.nextInt();
int y = rand.nextInt();
ia[i][j * 2] = x;
ia[i][j * 2 + 1] = y;
la[i][j] = (((long)x) << 32) | (y & 0xffffffffL);
}
}
if (bitCountLong() != bitCountInt()) {
throw new AssertionError();
}
}
@Benchmark
public int bitCountLong() {
int tot = 0;
for (int i = 0; i < times; i++) {
tot += longBitCount(la[i]);
}
return tot;
}
@Benchmark
public int bitCountInt() {
int tot = 0;
for (int i = 0; i < times; i++) {
tot += intBitCount(ia[i]);
}
return tot;
}
static int longBitCount(long[] la) {
int distance = 0;
for (int i = 0; i < la.length; i++) {
distance += Long.bitCount(la[i]);
}
return distance;
}
static int intBitCount(int[] ia) {
int distance = 0;
for (int i = 0; i < ia.length; i++) {
distance += Integer.bitCount(ia[i]);
}
return distance;
}