Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8350441

ZGC: Decouple memory from pages into a Mapped Cache

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Unresolved
    • Icon: P4 P4
    • tbd
    • None
    • hotspot
    • gc

      ZGC divides the heap into regions, called pages, of three different size classes: Small, Medium and Large, that are allocated from the page allocator. Referring to pages here on out means pages in ZGC, not to be confused with pages in the operating system (OS).

      Inside the page allocator, ZGC uses a page cache for caching pages that contain committed and mapped memory. This is a sound strategy considering that the aim is to allocate pages as quick as possible. Taking already prepared pages from the cache is fast since there is no need to repeat expensive OS operations.

      In a typical Java program, most allocations are small and will end up on Small pages, which result in the page cache mostly containing Small pages. If (likely when) a larger allocation is attempted, a page large enough to satisfy the allocation might not be available inside the page cache. In this case, smaller pages are "flushed" from the cache so that their physical memory can be combined to satisfy a larger allocation. The process of "flushing" is not without issues, and the two main problems are:
          1. The virtual address space will likely become fragmented as a new *contiguous* virtual address range must be allocated to satisfy the larger allocation. This issue becomes more pressing when running a program for a longer period of time.
          2. The process of moving memory around is not cheap and requires unmapping and mapping memory, which will result in TLB shootdowns.

      To combat these two issues, we propose moving from the page cache to a mapped cache. The mapped cache does not cache pages and instead caches ranges of mapped memory. The main benefit from this is that adjacent virtual memory ranges can be merged to create larger mappings, facilitating larger allocation requests that can succeed without "flushing" the cache. With this new strategy we expect:
          1. To have as large contiguous memory ranges available as possible in the mapped cache, not limited by the abstraction of a page.
          2. More often succeed with Medium allocations without having to "flush" the cache.
          3. Large pages to still be expensive to allocate, which is typically dwarfed by costs outside the page allocator.

      To store memory ranges, the mapped cache uses a self-balancing binary search tree with added optimizations. Since the tree stores mapped memory, the tree can use the memory to intrusively store data about itself whilst the memory is unused, removing the need to depend on malloc, which can affect latency negatively. The desire is to use runtime's red-black tree JDK-8345314 and JDK-8349211.

      With the mapped cache, pages are created when the correctly sized memory is obtained, decoupling memory from pages as much as possible. This comes with the benefit that pages are never recycled or re-sized. This has previously been an issue in JDK-8339161.

      The main concepts that needs to be re-designed when decoupling memory from pages is uncommitting memory and NUMA-awareness. To summarise:

      Uncommit: If a page has been sitting inside the page cache for over some period of time, it is considered disposable and its memory can be uncommitted. When decoupling memory from pages, memory can grow and shrink as it is merged in and removed from the mapped cache. This makes it difficult to reason about how long memory has been inside the cache, requiring a different solution for determining when and how much to uncommit. To solve this, the mapped cache keeps track of a watermark level, indicating how much memory has been unused in the cache since the last uncommit (or program start). The watermark level is taken into account when deciding how much to uncommit.

      NUMA: Currently, ZGC interleaves memory across all NUMA nodes with a granularity of ZGranuleSize (2MB), which is the same size as a Small page. As a result, Small pages will end up on a single, preferably local, NUMA-node, whilst larger allocations will (likely) end up on multiple NUMA-nodes. When moving to a mapped cache, it would interfere with the benefits of merging ranges if it needs to be tailored to keep track of and take into account what NUMA-node(s) memory ranges are allocated on. To maintain the benefits of the mapped cache whilst also supporting NUMA-local allocation, the page allocator uses multiple mapped caches, one for each NUMA-node. This comes with its own set of challenges, but at the same time makes it easier to allocate not only Small pages, but larger allocations as well, on a single NUMA-node instead of interleaving.

            jsikstro Joel Sikstrom
            jsikstro Joel Sikstrom
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: