-
Enhancement
-
Resolution: Unresolved
-
P3
-
11, 17, 18
-
x86
If you have a dumb benchmark like this:
@Param({"1024"})
int size;
byte[] pad;
long[] source, destination;
@Setup(Level.Iteration)
public void setUp() {
Random r = new Random(42);
pad = new byte[r.nextInt(1024)];
source = new long[size];
destination = new long[size];
for (int i = 0; i < size; ++i) {
source[i] = r.nextInt();
}
// Promote all the arrays
System.gc();
}
@Benchmark
public void arraycopy() {
System.arraycopy(source, 0, destination, 0, size);
}
And run it with JDK 9b107 on i7-4790K @ 4.0 GHz, Linux x86_64, then you will see that performance fluctuates a lot:
# Warmup Iteration 1: 351.178 ns/op
# Warmup Iteration 2: 385.568 ns/op
# Warmup Iteration 3: 366.771 ns/op
# Warmup Iteration 4: 341.570 ns/op
# Warmup Iteration 5: 420.488 ns/op
Iteration 1: 309.817 ns/op
Iteration 2: 346.652 ns/op
Iteration 3: 408.156 ns/op
Iteration 4: 343.857 ns/op
Iteration 5: 137.810 ns/op
Iteration 6: 283.327 ns/op
Iteration 7: 356.355 ns/op
Iteration 8: 319.256 ns/op
Iteration 9: 136.157 ns/op
Iteration 10: 302.372 ns/op
Iteration 11: 299.792 ns/op
Iteration 12: 389.018 ns/op
Iteration 13: 329.284 ns/op
Iteration 14: 142.508 ns/op
Iteration 15: 297.566 ns/op
Since this run with the same generated code that ends up calling jlong_disjoint_arraycopy, and the hottest piece of code is AVX-assisted copy:
1.90% 0.69% │ ↗ │││ 0x00007feb44f11a70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
36.10% 36.21% │ │ │││ 0x00007feb44f11a76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
10.28% 11.38% │ │ │││ 0x00007feb44f11a7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
29.87% 26.29% │ │ │││ 0x00007feb44f11a82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
15.40% 18.50% ↘ │ │││ 0x00007feb44f11a88: add $0x8,%rdx
╰ │││ 0x00007feb44f11a8c: jle Stub::jlong_disjoint_arraycopy+48 0x00007feb44f11a70
...the suspicion obviously falls to data alignment.
This work targets to substantially improve the arraycopy performance and variance on x86_64.
The key things the new code does:
*) Aligns destination address to perform aligned stores: the cost on misaligned stores, especially on wide AVX stores is horrible;
*) Uses the widest copy, basically all vector registers that we have to perform the bulk copy: this minimizes some ill effects from dependencies and forwarding;
*) Does the tail copies with the widest operations possible, this optimizes the copy lengths that are not exactly the power of 2;
@Param({"1024"})
int size;
byte[] pad;
long[] source, destination;
@Setup(Level.Iteration)
public void setUp() {
Random r = new Random(42);
pad = new byte[r.nextInt(1024)];
source = new long[size];
destination = new long[size];
for (int i = 0; i < size; ++i) {
source[i] = r.nextInt();
}
// Promote all the arrays
System.gc();
}
@Benchmark
public void arraycopy() {
System.arraycopy(source, 0, destination, 0, size);
}
And run it with JDK 9b107 on i7-4790K @ 4.0 GHz, Linux x86_64, then you will see that performance fluctuates a lot:
# Warmup Iteration 1: 351.178 ns/op
# Warmup Iteration 2: 385.568 ns/op
# Warmup Iteration 3: 366.771 ns/op
# Warmup Iteration 4: 341.570 ns/op
# Warmup Iteration 5: 420.488 ns/op
Iteration 1: 309.817 ns/op
Iteration 2: 346.652 ns/op
Iteration 3: 408.156 ns/op
Iteration 4: 343.857 ns/op
Iteration 5: 137.810 ns/op
Iteration 6: 283.327 ns/op
Iteration 7: 356.355 ns/op
Iteration 8: 319.256 ns/op
Iteration 9: 136.157 ns/op
Iteration 10: 302.372 ns/op
Iteration 11: 299.792 ns/op
Iteration 12: 389.018 ns/op
Iteration 13: 329.284 ns/op
Iteration 14: 142.508 ns/op
Iteration 15: 297.566 ns/op
Since this run with the same generated code that ends up calling jlong_disjoint_arraycopy, and the hottest piece of code is AVX-assisted copy:
1.90% 0.69% │ ↗ │││ 0x00007feb44f11a70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
36.10% 36.21% │ │ │││ 0x00007feb44f11a76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
10.28% 11.38% │ │ │││ 0x00007feb44f11a7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
29.87% 26.29% │ │ │││ 0x00007feb44f11a82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
15.40% 18.50% ↘ │ │││ 0x00007feb44f11a88: add $0x8,%rdx
╰ │││ 0x00007feb44f11a8c: jle Stub::jlong_disjoint_arraycopy+48 0x00007feb44f11a70
...the suspicion obviously falls to data alignment.
This work targets to substantially improve the arraycopy performance and variance on x86_64.
The key things the new code does:
*) Aligns destination address to perform aligned stores: the cost on misaligned stores, especially on wide AVX stores is horrible;
*) Uses the widest copy, basically all vector registers that we have to perform the bulk copy: this minimizes some ill effects from dependencies and forwarding;
*) Does the tail copies with the widest operations possible, this optimizes the copy lengths that are not exactly the power of 2;
- relates to
-
JDK-8310159 Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
-
- Resolved
-
-
JDK-8277893 Arraycopy stress tests
-
- Resolved
-
-
JDK-8279621 x86_64 arraycopy stubs should use 256-bit copies with AVX=1
-
- Closed
-
-
JDK-8149758 Small arraycopy of non-constant length is slower than individual load/stores
-
- Open
-