Reduce metaspace waste by dynamically merging and splitting metaspace chunks.
Improve the existing Metaspace Chunk allocator to reduce Out-Of-Memory errors due to clogging up metaspace with chunks of the "wrong" chunk size.
Replacing the chunk allocator with a different one, e.g. a malloc() based one, or a buddy allocator().
Changing the chunk types (specialized, medium, small) or their sizes.
Changing the distinction between Non-Class- and Class-Metaspace.
Increased reuse of metaspace chunks. Fewer chunks resting in the freelist. Fewer Out-Of-Memory errors due to a filled-up Metaspace.
Chunks of a given size cannot be reused as chunks of a different size. But there are pathological allocation patterns which lead to the metaspace filled up with chunks of a given chunk size. Albeit free, these chunks cannot reused if chunks of a different size are needed.
Example: an application has a large number of class loaders, each one allocating only a few small classes. These class loaders will not need much metaspace, and with the current implementation will only be given small metaspace chunks. This leads to the metaspace being filled with small chunks. When those class loaders are unloaded, the small chunks are freed and added to the freelist.
Now a single classloader continues work, and starts allocating medium-sized chunks. The small chunks will not be reused. With a limit in place (CompressedClassSpaceSize or MaxMetaspaceSize), the VM may hit an OOME from metaspace even though there are plenty of free chunks, but they are locked in into the wrong size.
For a demonstration of this effect please see the two attached example programs:
- Example2 (in test3.zip) loads a number of small classes, each in its own classloader, collects them and then starts loading large classes in a single class loader. It uses dynamic compilation and therefore may run a bit sluggish on debug VMs.
- MetaOOM does basically the same, but uses a different class loader for each large class.
Also note the attached output files. They show the output of the Example2 program with CompressedClassPointers enabled and a CompressedClassSpaceSize of 10M, running into an OOME. Both use SAP-internal metaspace statistic printouts, which were done at the point of the OOME. Most important is the "-- ChunkManager --" section, which shows how many chunks of which size are residing in the freelists.
- output_no_patch.txt shows how metaspace is wasted in Example2. The "-- ChunkManager --" section shows that almost 6MB of free space locked in into small chunks. This amounts to almost 60% of the CompressedClass space.
- output_with_patch.txt shows how the same program ran with an SAP-internal patch in place (see below for details). Here, when finally an OOM happens, metaspace is almost completely used up, only a measly 36K are still residing inside the freelists. There is almost no waste.
Also note that without the patch, the VM manages to load ~1000 large classes before hitting OOM, with the patch, the VM manages to load ~3000 classes.
The printouts also show an ASCII-art metaspace map, another feature we added to our VM, which shows in the former case a lot of small chunks unused (lower-case "s"), in the latter case almost no unused chunks (all letters are uppercase). For the latter case, it also shows less fragmentation.
(Please note that we would be happy to contribute both this statistic to the OpenJDK, however for now they are for now not part of this JEP).
In order to enable small chunks to be reused as larger chunks, multiple neighboring smaller chunks can - if they are all free - be merged to form a larger chunk. Similarly, larger chunks can be split up into smaller chunks if small chunks are needed and only large chunks are available.
As already mentioned, variant of this solution is already implemented as a patch internally at SAP. The following points describe this particular implementation and also serve as a proposal of how an implementation in the OpenJDK could work:
- If a chunk is returned to the freelist, a check is performed to see if it can be merged with its neighboring chunks (chunks adjacent to this chunk in the virtual space) to form a larger chunk. This is possible if the neighboring chunks are also free. The neighboring chunks are then removed from the freelist, merged with the just-freed-chunk, and the resulting larger chunk is placed back in the freelist.
As a result, metaspace will now fill up with larger chunks where possible. This reduces the chance of situations where we need a larger chunk, but only smaller chunks are free.
- If a small chunk is requested from the freelist, but only larger chunks are available, a larger chunk is taken from the freelist, split into n smaller chunks. n-1 smaller chunks are returned to the freelist and one smaller chunk is returned to the caller.
This takes care of the reverse problem: metaspace is filled with large chunks, but smaller chunks are needed.
In order to simplify coding and to increase chance of defragmentation, chunks are allocated aligned to their respective chunk size: specialized chunks at specialized-chunk-size-boundaries, small chunks at small-chunk-size-boundaries and so on. Humongous chunks are excluded from this rule, but they are still aligned to specialized-chunk-size-boundaries like before.
When a chunk is returned, it is checked whether it is possible to merge it with its neighbors. A cheap way is needed to check whether the neighboring chunks are free. A bit mask is used for that, with each bit representing a smallest-chunk-sized-area (all chunks are a multiple of this size and start at an address aligned to this size). The bit indicates if the chunk is free (0) or in use (1). To check if the neighboring chunks of a given area are free, all bits in the area of the prospective larger chunk to be 0. For the most common scenario, when smaller chunks shall be merged to for a medium chunk, this is very cheap: medium chunks are 32 (class-space) or 64 (non-class-space) times the size of the smallest chunk size, and are now aligned to medium-chunk-size. This means that all bits representing the prospective medium chunk form a 32-bit or 64-bit aligned integer in the bit mask and can be loaded as such and compared with zero.
Instead of merging chunks on the fly chunks could be coalesced only in an OOM situation, by iterating over the whole metaspace and attempting to coalesce neighboring chunks. However, the runtime costs are difficult to predict and higher than when merging only around those chunks which are returned to freelist.
The humongous chunks need to be treated special in a number of places because they need to be excluded from the "align at chunk size" rule. That is because their size is not known and potentially very large. They still intermix with the other chunks in the virtual space, though. This makes the coding more complicated than necessary.
Alternatively, the humongous chunks could be allocated elsewhere, e.g. at the end of the VirtualSpace. That would mean that the VirtualSpaceNode would have two high-water marks, one growing from the bottom as before - with all the normal chunks living below it - one growing from the top, accommodating the humongous chunks. For the implementation at SAP, we refrained from doing this, because the changes to the code base would become too large.
The implementation described above is (as part of the normal nightly tests ran at SAP) tested with TCK, jtreg, and a large number of self-written regression tests, as well as a collection of benchmarks (SPECjvm98, SPECjvm2008, SPECjbb2005).
In addition, a small test case was developed to demonstrate the problem, which shows considerable improvement when running with the fix.
More tests are needed to stress every angle of metachunk allocation.
Risks and Assumptions
There is a performance overhead due to on-the-fly merging and splitting. We think this overhead is small - in our internal tests, its effects were not discernible. However, there may be pathological cases where these costs become larger.
The metaspace coding will become more intricate with this fix, which carries the usual risk of introducing new errors. However, the code could be made simpler in other places - e.g. methods like "get_small_chunk_and_allocate" would not be needed anymore - which may negate the added complexity.