This issue is only observable on NUMA systems when running on a subset of the available NUMA nodes. This includes running the JVM with NUMA disabled on such a system, i.e., by binding the JVM to only use memory (and CPU) from a single node, which we believe is not an uncommon configuration. We initially observed this behavior in the development and testing of Automatic Heap Sizing for ZGC, which scales memory usage for the Java heap more fluidly than traditional ergonomic heap sizing.
When running in a configuration where we only use a subset of the available NUMA nodes, with large pages configured on all nodes, we might overreserve the memory the number of large pages. This likely happens because reserving large pages checks against the total limit on the system, not against what large pages we actually can use (that are configured for our bound NUMA nodes). When we have overreserved large pages, there is a race on who "gets" to them first, which can be any of the subsytems in the JVM using large pages. So far we've observed the Java heap and CodeCache heap racing, where we end up getting a SIGBUS from either one of them.
$ echo 512 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
512
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
512
$ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
256
$ cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
256
Some stacktraces of the SIGBUS errors we have observed:
ZGC's Java heap:
$ java -XX:+UseZGC -XX:+UseLargePages dacapo-23.11-MR2-chopin.jar h2
Stack: [0x00007f1c280fc000,0x00007f1c281fc000], sp=0x00007f1c281faad0, free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x1162e98] ZMappedCache::cache_insert(AbstractRBTree<zoffset, IntrusiveRBNode, ZMappedCache::EntryCompare>::Cursor const&, ZVirtualMemory const&)+0x78 (rbTree.hpp:84)
V [libjvm.so+0x1164ad4] ZMappedCache::insert(ZVirtualMemory const&)+0x1a4 (zMappedCache.cpp:618)
V [libjvm.so+0x117dd7c] ZPageAllocator::free_memory(ZArray<ZVirtualMemory>*)+0x8c (zPageAllocator.cpp:696)
V [libjvm.so+0x118178d] ZPageAllocator::free_page(ZPage*)+0x6d (zPageAllocator.cpp:2266)
V [libjvm.so+0x11903e3] ZRelocateTask::work()+0x723 (zRelocate.cpp:1047)
V [libjvm.so+0x1130920] WorkerThread::run()+0x80 (workerThread.cpp:71)
V [libjvm.so+0x10558ff] Thread::call_run()+0x9f (thread.cpp:243)
V [libjvm.so+0xe2c976] thread_native_entry(Thread*)+0xc6 (os_linux.cpp:929)
siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00000400037ff8d8
CodeCache heap:
$ java -XX:+UseZGC -XX:+UseLargePages dacapo-23.11-MR2-chopin.jar h2
Stack: [0x00007fc2afdff000,0x00007fc2afeff000], sp=0x00007fc2afefb0a0, free space=1008k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x969f4f] CodeHeap::allocate(unsigned long)+0xbf (checkedCast.hpp:40)
V [libjvm.so+0x6bf1d9] CodeCache::allocate(unsigned int, CodeBlobType, bool, CodeBlobType)+0x89 (codeCache.cpp:548)
V [libjvm.so+0xde6385] nmethod::new_nmethod(methodHandle const&, int, int, CodeOffsets*, int, DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, CompLevel, char*, int, JVMCINMethodData*)+0x185 (nmethod.cpp:1644)
V [libjvm.so+0x64ba7a] ciEnv::register_method(ciMethod*, int, CodeOffsets*, int, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, bool, bool, bool, bool, int)+0x3fa (ciEnv.cpp:1064)
V [libjvm.so+0xe42a5c] PhaseOutput::install_code(ciMethod*, int, AbstractCompiler*, bool, bool)+0xec (output.cpp:3207)
V [libjvm.so+0x6f1b52] Compile::Code_Gen()+0x742 (compile.cpp:3164)
V [libjvm.so+0x6f4176] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x15e6 (compile.cpp:895)
V [libjvm.so+0x609f31] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1c1 (c2compiler.cpp:147)
V [libjvm.so+0x6faf4f] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x94f (compileBroker.cpp:2345)
V [libjvm.so+0x6fc180] CompileBroker::compiler_thread_loop()+0x4d0 (compileBroker.cpp:1989)
V [libjvm.so+0x9f2548] JavaThread::thread_main_inner() [clone .part.0]+0xb8 (javaThread.cpp:772)
V [libjvm.so+0x10558ff] Thread::call_run()+0x9f (thread.cpp:243)
V [libjvm.so+0xe2c976] thread_native_entry(Thread*)+0xc6 (os_linux.cpp:929)
siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00007fc378800000
When running in a configuration where we only use a subset of the available NUMA nodes, with large pages configured on all nodes, we might overreserve the memory the number of large pages. This likely happens because reserving large pages checks against the total limit on the system, not against what large pages we actually can use (that are configured for our bound NUMA nodes). When we have overreserved large pages, there is a race on who "gets" to them first, which can be any of the subsytems in the JVM using large pages. So far we've observed the Java heap and CodeCache heap racing, where we end up getting a SIGBUS from either one of them.
$ echo 512 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
512
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
512
$ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
256
$ cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
256
Some stacktraces of the SIGBUS errors we have observed:
ZGC's Java heap:
$ java -XX:+UseZGC -XX:+UseLargePages dacapo-23.11-MR2-chopin.jar h2
Stack: [0x00007f1c280fc000,0x00007f1c281fc000], sp=0x00007f1c281faad0, free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x1162e98] ZMappedCache::cache_insert(AbstractRBTree<zoffset, IntrusiveRBNode, ZMappedCache::EntryCompare>::Cursor const&, ZVirtualMemory const&)+0x78 (rbTree.hpp:84)
V [libjvm.so+0x1164ad4] ZMappedCache::insert(ZVirtualMemory const&)+0x1a4 (zMappedCache.cpp:618)
V [libjvm.so+0x117dd7c] ZPageAllocator::free_memory(ZArray<ZVirtualMemory>*)+0x8c (zPageAllocator.cpp:696)
V [libjvm.so+0x118178d] ZPageAllocator::free_page(ZPage*)+0x6d (zPageAllocator.cpp:2266)
V [libjvm.so+0x11903e3] ZRelocateTask::work()+0x723 (zRelocate.cpp:1047)
V [libjvm.so+0x1130920] WorkerThread::run()+0x80 (workerThread.cpp:71)
V [libjvm.so+0x10558ff] Thread::call_run()+0x9f (thread.cpp:243)
V [libjvm.so+0xe2c976] thread_native_entry(Thread*)+0xc6 (os_linux.cpp:929)
siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00000400037ff8d8
CodeCache heap:
$ java -XX:+UseZGC -XX:+UseLargePages dacapo-23.11-MR2-chopin.jar h2
Stack: [0x00007fc2afdff000,0x00007fc2afeff000], sp=0x00007fc2afefb0a0, free space=1008k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x969f4f] CodeHeap::allocate(unsigned long)+0xbf (checkedCast.hpp:40)
V [libjvm.so+0x6bf1d9] CodeCache::allocate(unsigned int, CodeBlobType, bool, CodeBlobType)+0x89 (codeCache.cpp:548)
V [libjvm.so+0xde6385] nmethod::new_nmethod(methodHandle const&, int, int, CodeOffsets*, int, DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, CompLevel, char*, int, JVMCINMethodData*)+0x185 (nmethod.cpp:1644)
V [libjvm.so+0x64ba7a] ciEnv::register_method(ciMethod*, int, CodeOffsets*, int, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, bool, bool, bool, bool, int)+0x3fa (ciEnv.cpp:1064)
V [libjvm.so+0xe42a5c] PhaseOutput::install_code(ciMethod*, int, AbstractCompiler*, bool, bool)+0xec (output.cpp:3207)
V [libjvm.so+0x6f1b52] Compile::Code_Gen()+0x742 (compile.cpp:3164)
V [libjvm.so+0x6f4176] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x15e6 (compile.cpp:895)
V [libjvm.so+0x609f31] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1c1 (c2compiler.cpp:147)
V [libjvm.so+0x6faf4f] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x94f (compileBroker.cpp:2345)
V [libjvm.so+0x6fc180] CompileBroker::compiler_thread_loop()+0x4d0 (compileBroker.cpp:1989)
V [libjvm.so+0x9f2548] JavaThread::thread_main_inner() [clone .part.0]+0xb8 (javaThread.cpp:772)
V [libjvm.so+0x10558ff] Thread::call_run()+0x9f (thread.cpp:243)
V [libjvm.so+0xe2c976] thread_native_entry(Thread*)+0xc6 (os_linux.cpp:929)
siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00007fc378800000