@(#)READ_ME 1.3 13/02/21 This directory contains the test program for the following bug: 6799919 Recursive calls to report_vm_out_of_memory are handled incorrectly Here are the files included: READ_ME - this file Makefile - GNU makefile for building and running the test agent_util.c - JVM/TI demo agent utility code agent_util.h - JVM/TI demo agent utility header debug_diffs.txt - debug patch for the VM to make this bug reproducible doit.ksh - a script to run the test HelloForever.java - A "Hello World!" Java program that sleeps forever. memEater.c - a JVM/TI agent modeled after the agent from the (unofficial) JVM/TI demo followRefsOnStack Intro ===== src/share/vm/utilities/debug.cpp: report_vm_out_of_memory() is function used to report native memory allocation failures. There are a couple of issues with the function that are described in the bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6799919 How the Test Works ================== Java Part --------- Since this bug is about native memory allocation failures, we don't need a whole lot of Java code. In fact, less Java is better. In the early stages of trying to reproduce this failure, the Java level OutOfMemoryError (OOME) kept popping up and getting in the way. The HelloForever.java test program is a version of "Hello World!" that sleeps forever. This Java program serves the purpose of keeping the VM up and running until the other parts of the test can make the VM crash. Native Part ----------- 6799919 talks about recursive calls to report_vm_out_of_memory() and multi-threaded calls to report_vm_out_of_memory(). Combine that with the need to provoke a native memory allocation failure and that just cries out for a JVM/TI agent. And there just happens to be some JVM/TI demo code in the 'jdk' repo to help out along those lines. agent_util.c and agent_util.h were copied from jdk/src/share/demo/jvmti/agent_util/agent_util.c jdk/src/share/demo/jvmti/agent_util/agent_util.h memEater.c was modeled after the (unofficial) JVM/TI demo program for another bug called followRefsOnStack.c; it looks like followRefsOnStack.c might have been modeled after: jdk/src/share/demo/jvmti/gctest/gctest.c memEater.c is a pretty simple JVM/TI agent that launches two JVM/TI "agent" threads from the JVMTI_EVENT_VM_INIT event handler. Both threads execute the same worker() function that operates on a thread specific 2K element array of buffer pointers. The allocation algorithm is simple: buf_size = 1MB; while (buf_size > 4) { allocate buffer of buf_size bytes if (alloc fails) { buf_size /= 2; } } With two threads executing the above algorithm, a 32-bit VM rapidly runs out of native memory and the VM falls over with a message like this one: # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 152 bytes for AllocateHeap # An error report file with more information is saved as: # C:\6799919\hs_err_pid7588.log status=1 The message can vary in some of the details depending on what memory allocation happens to fail in the VM. If this sounds imprecise, that's because it is imprecise. In a multi-threaded application like Java, if all available malloc() memory is used up, it is very hard to predict where the next memory allocation failure. It is also important to note that the JVM/TI Allocate() calls made by the agent threads _do not_ directly result in a call to the target function: report_vm_out_of_memory(). When a JVM/TI Allocate() call fails, the JVMTI_ERROR_OUT_OF_MEMORY error is returned and the caller is expected to handle that failure in some intelligent way if at all possible. In the case of our agent, our response is to allocate smaller and smaller blocks until nothing larger than 4 bytes can be allocated. Of course, this means that the next os::malloc() call made by the VM is likely to result in a call to report_vm_out_of_memory(), but that's the whole point of this exercise. The native part of this test reliably provokes a single call to report_vm_out_of_memory(), but that's not good enough. We're trying to reliably get a recursive call or multi-threaded calls to report_vm_out_of_memory() and that requires debugging code in the VM. VM Part ------- There are three pieces of debugging code in VM to help make the target failure mode reproducible; see the debug_diffs.txt file which contains a patch for use with "hg import". 1) an upper limit on os::malloc() allocated mmemory As mentioned above the native part of the test reliably causes a 32-bit VM to run out of native memory. A 64-bit VM is another story especially when run on an OS like Win64 that doesn't (easily) support something like 'ulimit -d or 'ulimit -v'. In src/share/vm/runtime/os.cpp, the os::malloc() function is limited to allocate a total of no more than 768MB of memory. Once that limit is reached, os::malloc() returns NULL to simulate a real native memory allocation failure. 2) forcing an os::malloc() failure in the right place and time The VM hack in #1 is enough to reliably cause an os::malloc() failure in any VM. However, remember that the JVM/TI agent threads are only enabling another random part of the VM to fail an os::malloc() call. It turns out that the next failure point is typically in the exit path for the JVM/TI agent threads. JVM/TI agent threads are full blown JavaThreads that happen to be running native code. One of the semantics of a JavaThread is that java.lang.Thread.exit() is always called on a JavaThread, even on a JavaThread that has simply returned from main() or on a JVM/TI agent that has finished executing its native code and returned. Yes, the JVM/TI agent threads are trying to call java.lang.Thread.exit() and the VM blows up trying to allocate some housekeeping data structures for that call. The sad part of this story is that the housekeeping data is a JNIHandleBlock which is allocated under protection of a lock so there's no way to get two threads failing an allocation on that path at the same time. Enter the next VM hack. In src/share/vm/prims/jvmtiImpl.cpp, the JvmtiAgentThread::call_start_function() function is modified to os::malloc() a 1K block right after the native code returns. If that allocation fails, then vm_exit_out_of_memory() is called which results in a call to report_vm_out_of_memory(). Since both JVM/TI agent threads have nicely used up all the malloc() memory and returned, we have two threads racing to VM hack #2. 3) forcing two threads into the gauntlet at the same time This bug is all about a race in report_vm_out_of_memory() so the last hack is some code to make sure that both threads are lined up at the start line for the race. In the src/share/vm/utilities/debug.cpp: report_vm_out_of_memory(), the RawMonitor_lock monitor is used to block both threads at the same point. Why RawMonitor_lock? Because it's not likely to be used at this point in time so it has been hijacked for this hack. The first thread to grab RawMonitor_lock calls wait(). When the second thread grabs RawMonitor_lock, it notifies "all threads", releases RawMonitor_lock and the race is on. One thread will manage to be "first" in the racy block and calls VMError(thread, file, line, size, message).report_and_die(). The other thread by passes the report_and_die() call and calls vm_abort(true). Wait... one thread is trying report a native memory allocation failure and the other thread aborts the VM? That can't be good... The Consequences ---------------- If the vm_abort(true) call happens quickly enough, the "first" thread may not get a chance to report anything. Have you ever seen a test failure where the "java" cmd simply exited with "exit code == 1" with no error message and no hs_err_pid file? This race could be the bug that caused that "drive by shooting". Similarly, have you ever seen a test failure where the hs_err_pid appears to be incomplete or partially written? Again, this race could be the bug that caused that "UFO". Reproducing the Bug =================== Reproducing the bug requires building the test, building a VM to include the special debug patch, installing the special VM in a JDK and then running the test. Building the Test ----------------- $ mkdir $ cd # download 6799919_test.tgz into $ gunzip < 6799919_test.tgz | tar xvfp - # Use something like this to add Microsoft Tools to your environment: # # $ . add_ms_env VS2010 # for 32-bit # $ . add_ms_env -l VS2010 # for 64-bit $ make JDK=$JAVA_HOME OSNAME=win32 # or 'linux' or 'solaris' Building the VM --------------- $ hg clone -r b861c8af2510 \ http://hg.openjdk.java.net/hsx/hotspot-rt/hotspot my_hotspot $ hg clone -r cb57f84b031c \ http://closedjdk.us.oracle.com/hsx/hotspot-rt/hotspot/make/closed \ my_hotspot/make/closed $ hg clone -r 60755a7f98f8 \ http://closedjdk.us.oracle.com/hsx/hotspot-rt/hotspot/src/closed \ my_hotspot/src/closed $ hg clone -r 26a8c0e935cb \ http://closedjdk.us.oracle.com/hsx/hotspot-rt/hotspot/test/closed \ my_hotspot/test/closed $ cd my_hotspot $ cp -p /debug_diffs.txt . $ hg import debug_diffs.txt # If you want to use the same VM to also exercise the fix, then # do the following: # # $ cp -p /debug.cpp.debug_and_fix src/share/vm/utilities/debug.cpp # # The above will replace the debug.cpp file created by the import of # debug_diffs.txt with a version that has both the debug hooks and # the fix in place. Use your favorite build method to build your VM bits. Once built, copy your VM bits into the proper location in $JAVA_HOME. Running the Test ---------------- $ cd $ make JDK=$JAVA_HOME OSNAME=win32 \ TEST_ARGS=-client test 2>&1 | tee make.log The above test invocation assumes that you installed your VM bits as the "client" VM. Obviously, if you used a different name for your VM, then change the args. The above test invocation should reproduce the failing test result with something that looks like this at the end: Worker thread 2 finished 401 loops... XXX - allocating dummy block! XXX - faking malloc failure. XXX - first thread is in the cage! XXX - faking malloc failure. Worker thread 1 failed to allocate a 8 byte buffer... Worker thread 1 allocated 1 buffers of size 8. Worker thread 1 buf_size=4... Worker thread 1 finished 401 loops... XXX - allocating dummy block! XXX - faking malloc failure. XXX - second thread is in the cage! XXX - second thread is leaving the cage! XXX - first thread is leaving the cage! XXX - one thread is calling VMError XXX - one thread is calling vm_abort + status=1 + echo status=1 status=1 + exit 1 make: *** [test] Error 1 In the above example, the thread calling vm_abort() managed to kill off the VM before the thread calling VMError managed to output anything to either stderr or the hs_err_pid file. If you chose to include the fix with your VM, then you can exercise the fix along with the special debug patch: $ make JDK=$JAVA_HOME OSNAME=win32 \ TEST_ARGS="-client -XX:+UseNewCode2" test 2>&1 | tee make.log The above test invocation should reproduce a passing test result with something that looks like this at the end: Worker thread 1 finished 401 loops... XXX - allocating dummy block! XXX - faking malloc failure. XXX - first thread is in the cage! XXX - faking malloc failure. Worker thread 2 failed to allocate a 8 byte buffer... Worker thread 2 allocated 1 buffers of size 8. Worker thread 2 buf_size=4... Worker thread 2 finished 401 loops... XXX - allocating dummy block! XXX - faking malloc failure. XXX - second thread is in the cage! XXX - second thread is leaving the cage! XXX - first thread is leaving the cage! [thread 5856 also had an error] # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 1024 bytes for cannot allocate dummy 1K block # An error report file with more information is saved as: # C:\6799919\hs_err_pid2052.log XXX - faking malloc failure. + status=1 + echo status=1 status=1 + exit 1 make: *** [test] Error 1