-
Bug
-
Resolution: Fixed
-
P3
-
repo-shenandoah
It seems there is a race that results in deadlock or livelock over the ShenandoahControlThread::_regulator_lock.
We need to DEBUG this. Could it be that on rare occasion, the V() is performed before the P() operation, and thus the P() operation never gets released? In one 20 minute execution of an Extremem workload, the last heuristic request was accepted at time 559.076s, and this was 2.261s after sleeping 1ms following the previous invocation of regulator_sleep(), which occurred at time 557.979. This very long delay was presumably a combination of the delay on this particular trigger and the delay on the previous trigger, which would have caused this invocation of _regulator_sleep() to begin late. After this, no more heuristics requests were accepted during the remaining 700s of execution. Rather, we limped along, repeatedly ignoring heuristics requests until we experienced allocation failures, at which point we would perform degen or full GCs.
We have made improvements to the GenShen implementation that seem to make this issue even more difficult to reproduce. It may be easiest to reproduce using the following commit of openjdk/shenandoah branch make-instantaneous-alloc-rate-trigger-quicker.
commit d6bc97ebb9b88047bbf012bcc79330bba9b727b2 (HEAD -> make-instantaneous-alloc-rate-trigger-quicker, origin/make-instantaneous-alloc-rate-trigger-quicker)
Author: Kelvin Nilsen <kdnilsen@amazon.com>
Date: Wed Sep 27 13:30:37 2023 +0000
Checkpoint this code so we can pursue suspected deadlock JBS issue
We need to DEBUG this. Could it be that on rare occasion, the V() is performed before the P() operation, and thus the P() operation never gets released? In one 20 minute execution of an Extremem workload, the last heuristic request was accepted at time 559.076s, and this was 2.261s after sleeping 1ms following the previous invocation of regulator_sleep(), which occurred at time 557.979. This very long delay was presumably a combination of the delay on this particular trigger and the delay on the previous trigger, which would have caused this invocation of _regulator_sleep() to begin late. After this, no more heuristics requests were accepted during the remaining 700s of execution. Rather, we limped along, repeatedly ignoring heuristics requests until we experienced allocation failures, at which point we would perform degen or full GCs.
We have made improvements to the GenShen implementation that seem to make this issue even more difficult to reproduce. It may be easiest to reproduce using the following commit of openjdk/shenandoah branch make-instantaneous-alloc-rate-trigger-quicker.
commit d6bc97ebb9b88047bbf012bcc79330bba9b727b2 (HEAD -> make-instantaneous-alloc-rate-trigger-quicker, origin/make-instantaneous-alloc-rate-trigger-quicker)
Author: Kelvin Nilsen <kdnilsen@amazon.com>
Date: Wed Sep 27 13:30:37 2023 +0000
Checkpoint this code so we can pursue suspected deadlock JBS issue
- links to
-
Review openjdk/shenandoah/332