-
Enhancement
-
Resolution: Fixed
-
P4
-
25
-
b25
-
riscv
-
linux
Currently, we use a loop to handle array filling of size less than 8 bytes in the array filling stub:
```
// Handle copies less than 8 bytes.
Label L_loop1, L_loop2, L_exit2;
__ bind(L_fill_elements);
__ beqz(count, L_exit2);
switch (t) {
case T_BYTE:
__ bind(L_loop1);
__ sb(value, Address(to, 0));
__ addi(to, to, 1);
__ subiw(count, count, 1);
__ bnez(count, L_loop1);
break;
case T_SHORT:
__ bind(L_loop2);
__ sh(value, Address(to, 0));
__ addi(to, to, 2);
__ subiw(count, count, 2 >> shift);
__ bnez(count, L_loop2);
break;
case T_INT:
__ sw(value, Address(to, 0));
break;
default: ShouldNotReachHere();
```
We can eliminate the loop for the T_BYTE and T_SHORT cases by unrolling sb and sh.
We have witnessed the additional performance gains for the small-size byte array fills:
Before:
Benchmark (size) Mode Cnt Score Error Units
ArrayFill.fillByteArray 7 avgt 12 27.036 ± 0.061 ns/op
ArrayFill.fillIntArray 7 avgt 12 28.628 ± 0.013 ns/op
ArrayFill.fillShortArray 7 avgt 12 30.775 ± 0.008 ns/op
ArrayFill.zeroByteArray 7 avgt 12 27.076 ± 0.013 ns/op
ArrayFill.zeroIntArray 7 avgt 12 28.624 ± 0.003 ns/op
ArrayFill.zeroShortArray 7 avgt 12 30.776 ± 0.009 ns/op
After:
Benchmark (size) Mode Cnt Score Error Units
ArrayFill.fillByteArray 7 avgt 12 19.347 ± 0.079 ns/op
ArrayFill.fillIntArray 7 avgt 12 28.639 ± 0.012 ns/op
ArrayFill.fillShortArray 7 avgt 12 30.777 ± 0.015 ns/op
ArrayFill.zeroByteArray 7 avgt 12 19.646 ± 0.599 ns/op
ArrayFill.zeroIntArray 7 avgt 12 28.631 ± 0.008 ns/op
ArrayFill.zeroShortArray 7 avgt 12 30.780 ± 0.009 ns/op
```
// Handle copies less than 8 bytes.
Label L_loop1, L_loop2, L_exit2;
__ bind(L_fill_elements);
__ beqz(count, L_exit2);
switch (t) {
case T_BYTE:
__ bind(L_loop1);
__ sb(value, Address(to, 0));
__ addi(to, to, 1);
__ subiw(count, count, 1);
__ bnez(count, L_loop1);
break;
case T_SHORT:
__ bind(L_loop2);
__ sh(value, Address(to, 0));
__ addi(to, to, 2);
__ subiw(count, count, 2 >> shift);
__ bnez(count, L_loop2);
break;
case T_INT:
__ sw(value, Address(to, 0));
break;
default: ShouldNotReachHere();
```
We can eliminate the loop for the T_BYTE and T_SHORT cases by unrolling sb and sh.
We have witnessed the additional performance gains for the small-size byte array fills:
Before:
Benchmark (size) Mode Cnt Score Error Units
ArrayFill.fillByteArray 7 avgt 12 27.036 ± 0.061 ns/op
ArrayFill.fillIntArray 7 avgt 12 28.628 ± 0.013 ns/op
ArrayFill.fillShortArray 7 avgt 12 30.775 ± 0.008 ns/op
ArrayFill.zeroByteArray 7 avgt 12 27.076 ± 0.013 ns/op
ArrayFill.zeroIntArray 7 avgt 12 28.624 ± 0.003 ns/op
ArrayFill.zeroShortArray 7 avgt 12 30.776 ± 0.009 ns/op
After:
Benchmark (size) Mode Cnt Score Error Units
ArrayFill.fillByteArray 7 avgt 12 19.347 ± 0.079 ns/op
ArrayFill.fillIntArray 7 avgt 12 28.639 ± 0.012 ns/op
ArrayFill.fillShortArray 7 avgt 12 30.777 ± 0.015 ns/op
ArrayFill.zeroByteArray 7 avgt 12 19.646 ± 0.599 ns/op
ArrayFill.zeroIntArray 7 avgt 12 28.631 ± 0.008 ns/op
ArrayFill.zeroShortArray 7 avgt 12 30.780 ± 0.009 ns/op
- links to
-
Commit(master) openjdk/jdk/78d0dc75
-
Review(master) openjdk/jdk/25350