-
Enhancement
-
Resolution: Fixed
-
P3
-
8, 11, 15, 16
-
b25
If you look at how current CASE(_new) is structured, then you will notice an odd thing:
if (ik->is_initialized() && ik->can_be_fastpath_allocated() ) {
size_t obj_size = ik->size_helper();
oop result = NULL;
if (UseTLAB) {
result = (oop) THREAD->tlab().allocate(obj_size);
}
if (result == NULL) {
// allocate from inline contiguous alloc
}
if (result != NULL) {
// initialize the object and return
}
}
}
// Slow case allocation
CALL_VM(InterpreterRuntime::_new(THREAD, METHOD->constants(), index),
handle_exception);
// return
The oddity here is: when TLAB is depleted and rejects the allocation, we fall through to inline contiguous alloc block that allocates the object in shared eden. That allocation is likely to succeed, and then we return from this path. But TLAB would never get replenished! Because to do that, we need to hit the slowpath allocation in Interpreter::_new, let in enter the runtime, and ask GC for a new TLAB!
So in the end, when +UseTLAB is enabled for Zero, the code only uses the very first issued TLAB, and then always falls through to inline contiguous allocation, until eden is completely depleted. Inline contiguous block makes the shared CAS increment that could be heavily contended under allocations.
I have observed this with supplying +UseTLAB to my adhoc Zero runs -- it was still slow. I think we can just remove the inline contiguous allocation block, and let the whole thing slide to slowpath on failure. This would also resolve the issue of enabling Zero for GCs that do not support inline contiguous allocs (anything beyond Serial and Parallel).
Sample allocation benchmark:
- original: 302 +- 5 ns/op
- original +UseTLAB: 291 +- 5 ns/op
- patched +UseTLAB: 233 +- 3 ns/op
if (ik->is_initialized() && ik->can_be_fastpath_allocated() ) {
size_t obj_size = ik->size_helper();
oop result = NULL;
if (UseTLAB) {
result = (oop) THREAD->tlab().allocate(obj_size);
}
if (result == NULL) {
// allocate from inline contiguous alloc
}
if (result != NULL) {
// initialize the object and return
}
}
}
// Slow case allocation
CALL_VM(InterpreterRuntime::_new(THREAD, METHOD->constants(), index),
handle_exception);
// return
The oddity here is: when TLAB is depleted and rejects the allocation, we fall through to inline contiguous alloc block that allocates the object in shared eden. That allocation is likely to succeed, and then we return from this path. But TLAB would never get replenished! Because to do that, we need to hit the slowpath allocation in Interpreter::_new, let in enter the runtime, and ask GC for a new TLAB!
So in the end, when +UseTLAB is enabled for Zero, the code only uses the very first issued TLAB, and then always falls through to inline contiguous allocation, until eden is completely depleted. Inline contiguous block makes the shared CAS increment that could be heavily contended under allocations.
I have observed this with supplying +UseTLAB to my adhoc Zero runs -- it was still slow. I think we can just remove the inline contiguous allocation block, and let the whole thing slide to slowpath on failure. This would also resolve the issue of enabling Zero for GCs that do not support inline contiguous allocs (anything beyond Serial and Parallel).
Sample allocation benchmark:
- original: 302 +- 5 ns/op
- original +UseTLAB: 291 +- 5 ns/op
- patched +UseTLAB: 233 +- 3 ns/op
- blocks
-
JDK-8256497 Zero: enable G1 and Shenandoah GCs
-
- Resolved
-
-
JDK-8256499 Zero: enable Epsilon GC
-
- Resolved
-
- relates to
-
JDK-8255782 Turn UseTLAB and ResizeTLAB from product_pd to product, defaulting to "true"
-
- Resolved
-