I did some experiments comparing performance of memory segment bulk operation against plain Java loops. Here are some results (unscientific benchmark attached):
FILL
Benchmark Mode Cnt Score Error Units
BulkOps.segment_fill avgt 10 119323.358 ± 3484.991 ns/op
BulkOps.segment_fill_int_loop avgt 10 2055700.828 ± 101325.298 ns/op
BulkOps.segment_fill_long_loop avgt 10 47875.953 ± 1727.711 ns/op
COPY
Benchmark Mode Cnt Score Error Units
BulkOps.segment_copy_static avgt 10 86283.631 ± 4562.169 ns/op
BulkOps.segment_copy_static_int_loop avgt 10 82480.038 ± 3476.123 ns/op
BulkOps.segment_copy_static_long_loop avgt 10 78929.262 ± 2100.533 ns/op
BulkOps.segment_copy_static_small avgt 10 4.346 ± 0.037 ns/op
BulkOps.segment_copy_static_small_int_loop avgt 10 5.110 ± 0.055 ns/op
BulkOps.segment_copy_static_small_long_loop avgt 10 4.208 ± 0.026 ns/op
MISMATCH
Benchmark Mode Cnt Score Error Units
BulkOps.mismatch_large_segment avgt 10 38011.887 ± 2219.403 ns/op
BulkOps.mismatch_large_segment_int_loop avgt 10 778412.959 ± 11380.481 ns/op
BulkOps.mismatch_large_segment_long_loop avgt 10 283515.423 ± 7737.791 ns/op
BulkOps.mismatch_small_segment avgt 10 2.719 ± 0.097 ns/op
BulkOps.mismatch_small_segment_int_loop avgt 10 2.963 ± 0.030 ns/op
BulkOps.mismatch_small_segment_long_loop avgt 10 2.892 ± 0.011 ns/op
Overall, really great progress. I think we're really close to being able to just use plain loops for these routines in the memory segment implementation (and maybe even ByteBuffer) classes.
One notable hiccup is that loops using int induction variables are still significantly slower than those using long variables.
Another issue (but this is known) is that the intrinsics for mismatch is still faster than a loop -- this is due to limitations with autovectorization and control flow (as mismatch needs to branch out of the loop if a mismatch is detected).
FILL
Benchmark Mode Cnt Score Error Units
BulkOps.segment_fill avgt 10 119323.358 ± 3484.991 ns/op
BulkOps.segment_fill_int_loop avgt 10 2055700.828 ± 101325.298 ns/op
BulkOps.segment_fill_long_loop avgt 10 47875.953 ± 1727.711 ns/op
COPY
Benchmark Mode Cnt Score Error Units
BulkOps.segment_copy_static avgt 10 86283.631 ± 4562.169 ns/op
BulkOps.segment_copy_static_int_loop avgt 10 82480.038 ± 3476.123 ns/op
BulkOps.segment_copy_static_long_loop avgt 10 78929.262 ± 2100.533 ns/op
BulkOps.segment_copy_static_small avgt 10 4.346 ± 0.037 ns/op
BulkOps.segment_copy_static_small_int_loop avgt 10 5.110 ± 0.055 ns/op
BulkOps.segment_copy_static_small_long_loop avgt 10 4.208 ± 0.026 ns/op
MISMATCH
Benchmark Mode Cnt Score Error Units
BulkOps.mismatch_large_segment avgt 10 38011.887 ± 2219.403 ns/op
BulkOps.mismatch_large_segment_int_loop avgt 10 778412.959 ± 11380.481 ns/op
BulkOps.mismatch_large_segment_long_loop avgt 10 283515.423 ± 7737.791 ns/op
BulkOps.mismatch_small_segment avgt 10 2.719 ± 0.097 ns/op
BulkOps.mismatch_small_segment_int_loop avgt 10 2.963 ± 0.030 ns/op
BulkOps.mismatch_small_segment_long_loop avgt 10 2.892 ± 0.011 ns/op
Overall, really great progress. I think we're really close to being able to just use plain loops for these routines in the memory segment implementation (and maybe even ByteBuffer) classes.
One notable hiccup is that loops using int induction variables are still significantly slower than those using long variables.
Another issue (but this is known) is that the intrinsics for mismatch is still faster than a loop -- this is due to limitations with autovectorization and control flow (as mismatch needs to branch out of the loop if a mismatch is detected).
- relates to
-
JDK-8331659 C2 SuperWord: investigate failed vectorization in compiler/loopopts/superword/TestMemorySegment.java
-
- Closed
-