With JDK-8350441, ZGC got infrastructure to prefer allocations to end up on a specific NUMA node. When a new object is allocated, it is preferably placed on the NUMA node that is local to the allocating thread. This strategy improves acess speeds for mutators working on that object, if it continues to be used by threads on the same NUMA node. However, when relocating objects, ZGC will potentially move (migrate) objects away from the NUMA node they were originally allocated on. This means that if a page is selected as part of the Relocation Set, it could potentially be moved to another NUMA node, breaking the NUMA locality we strived for when allocating.
As a follow up toJDK-8350441, we should consider adding NUMA-awareness to ZGC's relocation phase. NUMA-Awareness consists of two main features:
First, GC threads should strive toward keeping the NUMA locality of objects to their original node, meaning that objects should ideally be relocated to a page that is on the same NUMA node as the source page.
Mutator threads should have a different approach, as we know that the mutator that's relocating an object is also going to access it, so we migrate it to the NUMA node associated with the relocating thread. This strategy is already used in upstream/mainline, and so does not require any changes to the code (specifically, ZObjectAllocator already track per-CPU specific Small pages). However, Medium pages are shared between CPUs and thus does not hold any guarantees on which NUMA node it is on. Together, mutator and Medium page relocation are not common, and thus there is little gain on introducing NUMA-awareness to such operation. This instead be addressed as a follow up.
Secondly, when a GC chooses a page to relocate objects from, it should choose page(s) that are local to the same NUMA node, to speed up performance by working on local memory. There are multiple ways this could be achieved, but the man goal should be to (1) start working on pages that are local to the GC thread's NUMA node, and (2) when finished with pages on its own NUMA node, start working (help out) with pages associated with other NUMA nodes.
There are many considerations with the above approach. Some key observations:
* The NUMA node associated with the GC thread should be "polled"/"checked" in regular intervals, to account for the fact that the GC thread might have migrated to another CPU, and in turn another NUMA node. It is probably enough to check the associated NUMA node before claiming a new page and starting to relocate objects.
* Also, by no longer starting on the entries/pages with the least live bytes, but rather which NUMA node a page is associated with, we might not start relocating the objects with the least live bytes first. This is really only a problem if the machine is fully saturated and there are allocation stalls. It is also worth mentioning that in a common NUMA configuration, it takes twice as long to access remote memory compared to local memory. This means that a local page could be relocated twice as fast, which could release memory faster than starting with a remote page that has less live bytes.
* The strategy is an optimization for mutators and might make the GC take a bit longer to complete the relocation phase. The current strategy is to move objects to the NUMA node associated with the GC thread, regardless of where the object was originally from. This improves GC speed, at the potential downside of the mutators not accessing local memory any more. However, since ZGC is a concurrent garbage collector, it isn't really a huge issue if the relocation phase becomes a bit longer, if the mutators receive a speedup.
As a follow up to
First, GC threads should strive toward keeping the NUMA locality of objects to their original node, meaning that objects should ideally be relocated to a page that is on the same NUMA node as the source page.
Mutator threads should have a different approach, as we know that the mutator that's relocating an object is also going to access it, so we migrate it to the NUMA node associated with the relocating thread. This strategy is already used in upstream/mainline, and so does not require any changes to the code (specifically, ZObjectAllocator already track per-CPU specific Small pages). However, Medium pages are shared between CPUs and thus does not hold any guarantees on which NUMA node it is on. Together, mutator and Medium page relocation are not common, and thus there is little gain on introducing NUMA-awareness to such operation. This instead be addressed as a follow up.
Secondly, when a GC chooses a page to relocate objects from, it should choose page(s) that are local to the same NUMA node, to speed up performance by working on local memory. There are multiple ways this could be achieved, but the man goal should be to (1) start working on pages that are local to the GC thread's NUMA node, and (2) when finished with pages on its own NUMA node, start working (help out) with pages associated with other NUMA nodes.
There are many considerations with the above approach. Some key observations:
* The NUMA node associated with the GC thread should be "polled"/"checked" in regular intervals, to account for the fact that the GC thread might have migrated to another CPU, and in turn another NUMA node. It is probably enough to check the associated NUMA node before claiming a new page and starting to relocate objects.
* Also, by no longer starting on the entries/pages with the least live bytes, but rather which NUMA node a page is associated with, we might not start relocating the objects with the least live bytes first. This is really only a problem if the machine is fully saturated and there are allocation stalls. It is also worth mentioning that in a common NUMA configuration, it takes twice as long to access remote memory compared to local memory. This means that a local page could be relocated twice as fast, which could release memory faster than starting with a remote page that has less live bytes.
* The strategy is an optimization for mutators and might make the GC take a bit longer to complete the relocation phase. The current strategy is to move objects to the NUMA node associated with the GC thread, regardless of where the object was originally from. This improves GC speed, at the potential downside of the mutators not accessing local memory any more. However, since ZGC is a concurrent garbage collector, it isn't really a huge issue if the relocation phase becomes a bit longer, if the mutators receive a speedup.