I suspect that the pools have outlived their usefulness today and (first, possibly naive) performance measurements show they can actually harm performance. Maybe they come from a time when malloc was very expensive, especially in multi-threaded environments. That is usually not the case today.
---
In the hotspot Arena, we pool arena chunks: after an Arena is done, arena chunks are returned to a pool for possible reuse by a different Arena. Arena chunks are allocated via malloc/free.
Every 5 seconds the arena pools get trimmed by a dedicated task. So, we cache Arena chunks for five seconds before returning them to the libc.
The chunk management causes headaches with NMT since chunks can be accounted to different flags at different points in life. We have some strange workarounds for that (see JDK-8325890). It would be nice to be able to lose that complexity.
When we return chunks to the libc via free, there is a high likelyhood that the libc will also cache the memory for us, for use by subsequent malloc. The extent to which that happens depends on the libc implementation, but it is very clearly the case with glibc at least.
I can think of two reasons why we cache Arena chunks:
1) to mitigate malloc() call costs for re-acquiring chunks after an idle period. The best use case is the compiler: at startup, many compilations run in quick succession, each one building up arena footprint, with idle periods interspersed after one compilation finished and the next one did not start yet. Chunks will still be in pool so no malloc is needed to re-acquire them for new Arenas
2) to mitigate footprint issues when we run with Arenas in many threads. The underlying libc may prefer thread-local allocations and hand out new memory instead of cached one. Reusing chunks may make sense to keep libc-arena footprint libc low.
However, the chunk pools come with costs, too.
- Access to them is synchronized with ThreadCritical. That is awful for performance: a simple test [1] (Mac m1) shows that re-acquiring a chunk from a pool in a multithreaded contended scenario is > 4x the cost of a malloc. I believe therefore that whatever benefit (1) brings is eaten up many times by the ThreadCritical calls.
Note that our way of sprinkling code with ResourceMark can make this even worse. If your allocation requires another chunk, a close-by RM will cause immediate release of that chunk. You essentially play a very costly chunk-ping-pong with the chunk pool.
BTW! We run through these ThreadCritical sections every time we start a thread! That is because each thread carries a ResourceArea, which proactively allocates a chunk on creation. So, starting 1000 threads will gift us 1000 ThreadCritical invocations.
- Opportunity costs: caching chunks in hotspot means they only are re-used for Arena allocations. Caching memory at libc level means every malloc in the process can re-use that memory.
More tests are needed to investigate this. Especially point (2) above.
[1] https://github.com/openjdk/jdk/compare/master...tstuefe:jdk:cost-of-arena-pools
---
In the hotspot Arena, we pool arena chunks: after an Arena is done, arena chunks are returned to a pool for possible reuse by a different Arena. Arena chunks are allocated via malloc/free.
Every 5 seconds the arena pools get trimmed by a dedicated task. So, we cache Arena chunks for five seconds before returning them to the libc.
The chunk management causes headaches with NMT since chunks can be accounted to different flags at different points in life. We have some strange workarounds for that (see JDK-8325890). It would be nice to be able to lose that complexity.
When we return chunks to the libc via free, there is a high likelyhood that the libc will also cache the memory for us, for use by subsequent malloc. The extent to which that happens depends on the libc implementation, but it is very clearly the case with glibc at least.
I can think of two reasons why we cache Arena chunks:
1) to mitigate malloc() call costs for re-acquiring chunks after an idle period. The best use case is the compiler: at startup, many compilations run in quick succession, each one building up arena footprint, with idle periods interspersed after one compilation finished and the next one did not start yet. Chunks will still be in pool so no malloc is needed to re-acquire them for new Arenas
2) to mitigate footprint issues when we run with Arenas in many threads. The underlying libc may prefer thread-local allocations and hand out new memory instead of cached one. Reusing chunks may make sense to keep libc-arena footprint libc low.
However, the chunk pools come with costs, too.
- Access to them is synchronized with ThreadCritical. That is awful for performance: a simple test [1] (Mac m1) shows that re-acquiring a chunk from a pool in a multithreaded contended scenario is > 4x the cost of a malloc. I believe therefore that whatever benefit (1) brings is eaten up many times by the ThreadCritical calls.
Note that our way of sprinkling code with ResourceMark can make this even worse. If your allocation requires another chunk, a close-by RM will cause immediate release of that chunk. You essentially play a very costly chunk-ping-pong with the chunk pool.
BTW! We run through these ThreadCritical sections every time we start a thread! That is because each thread carries a ResourceArea, which proactively allocates a chunk on creation. So, starting 1000 threads will gift us 1000 ThreadCritical invocations.
- Opportunity costs: caching chunks in hotspot means they only are re-used for Arena allocations. Caching memory at libc level means every malloc in the process can re-use that memory.
More tests are needed to investigate this. Especially point (2) above.
[1] https://github.com/openjdk/jdk/compare/master...tstuefe:jdk:cost-of-arena-pools
- relates to
-
JDK-8325890 NMT: Arena Chunk value correction causes multiple errors in reporting
-
- Open
-
- links to
-
Review(master) openjdk/jdk/20411