@(#)READ_ME 1.3 13/02/21

This directory contains the test program for the following bug:

    6799919 Recursive calls to report_vm_out_of_memory are handled incorrectly

Here are the files included:

READ_ME            - this file
Makefile           - GNU makefile for building and running the test
agent_util.c       - JVM/TI demo agent utility code
agent_util.h       - JVM/TI demo agent utility header
debug_diffs.txt    - debug patch for the VM to make this bug reproducible
doit.ksh           - a script to run the test
HelloForever.java  - A "Hello World!" Java program that sleeps forever.
memEater.c         - a JVM/TI agent modeled after the agent from the
                     (unofficial) JVM/TI demo followRefsOnStack

Intro
=====

src/share/vm/utilities/debug.cpp: report_vm_out_of_memory() is function
used to report native memory allocation failures. There are a couple of
issues with the function that are described in the bug report:

    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6799919


How the Test Works
==================

Java Part
---------

Since this bug is about native memory allocation failures, we don't need a
whole lot of Java code. In fact, less Java is better. In the early stages
of trying to reproduce this failure, the Java level OutOfMemoryError
(OOME) kept popping up and getting in the way. The HelloForever.java
test program is a version of "Hello World!" that sleeps forever. This
Java program serves the purpose of keeping the VM up and running until
the other parts of the test can make the VM crash.

Native Part
-----------

6799919 talks about recursive calls to report_vm_out_of_memory() and
multi-threaded calls to report_vm_out_of_memory(). Combine that with the
need to provoke a native memory allocation failure and that just cries
out for a JVM/TI agent. And there just happens to be some JVM/TI demo
code in the 'jdk' repo to help out along those lines.

agent_util.c and agent_util.h were copied from

    jdk/src/share/demo/jvmti/agent_util/agent_util.c
    jdk/src/share/demo/jvmti/agent_util/agent_util.h

memEater.c was modeled after the (unofficial) JVM/TI demo program for
another bug called followRefsOnStack.c; it looks like followRefsOnStack.c
might have been modeled after:

    jdk/src/share/demo/jvmti/gctest/gctest.c

memEater.c is a pretty simple JVM/TI agent that launches two JVM/TI
"agent" threads from the JVMTI_EVENT_VM_INIT event handler. Both threads
execute the same worker() function that operates on a thread specific
2K element array of buffer pointers. The allocation algorithm is simple:

    buf_size = 1MB;
    while (buf_size > 4) {
        allocate buffer of buf_size bytes
        if (alloc fails) {
            buf_size /= 2;
        }
    }

With two threads executing the above algorithm, a 32-bit VM rapidly runs
out of native memory and the VM falls over with a message like this one:

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 152 bytes for AllocateHeap
# An error report file with more information is saved as:
# C:\6799919\hs_err_pid7588.log
status=1 

The message can vary in some of the details depending on what memory
allocation happens to fail in the VM. If this sounds imprecise, that's
because it is imprecise. In a multi-threaded application like Java,
if all available malloc() memory is used up, it is very hard to predict
where the next memory allocation failure.

It is also important to note that the JVM/TI Allocate() calls made by the
agent threads _do not_ directly result in a call to the target function:
report_vm_out_of_memory(). When a JVM/TI Allocate() call fails, the
JVMTI_ERROR_OUT_OF_MEMORY error is returned and the caller is expected
to handle that failure in some intelligent way if at all possible. In
the case of our agent, our response is to allocate smaller and smaller
blocks until nothing larger than 4 bytes can be allocated. Of course,
this means that the next os::malloc() call made by the VM is likely
to result in a call to report_vm_out_of_memory(), but that's the whole
point of this exercise.

The native part of this test reliably provokes a single call
to report_vm_out_of_memory(), but that's not good enough. We're
trying to reliably get a recursive call or multi-threaded calls to
report_vm_out_of_memory() and that requires debugging code in the VM.

VM Part
-------

There are three pieces of debugging code in VM to help make the target
failure mode reproducible; see the debug_diffs.txt file which contains
a patch for use with "hg import".

1) an upper limit on os::malloc() allocated mmemory

   As mentioned above the native part of the test reliably causes a
   32-bit VM to run out of native memory. A 64-bit VM is another story
   especially when run on an OS like Win64 that doesn't (easily) support
   something like 'ulimit -d or 'ulimit -v'.

   In src/share/vm/runtime/os.cpp, the os::malloc() function is limited
   to allocate a total of no more than 768MB of memory. Once that limit
   is reached, os::malloc() returns NULL to simulate a real native memory
   allocation failure.

2) forcing an os::malloc() failure in the right place and time

   The VM hack in #1 is enough to reliably cause an os::malloc() failure
   in any VM. However, remember that the JVM/TI agent threads are only
   enabling another random part of the VM to fail an os::malloc() call.
   It turns out that the next failure point is typically in the exit
   path for the JVM/TI agent threads. JVM/TI agent threads are full
   blown JavaThreads that happen to be running native code. One of the
   semantics of a JavaThread is that java.lang.Thread.exit() is always
   called on a JavaThread, even on a JavaThread that has simply returned
   from main() or on a JVM/TI agent that has finished executing its native
   code and returned. Yes, the JVM/TI agent threads are trying to call
   java.lang.Thread.exit() and the VM blows up trying to allocate some
   housekeeping data structures for that call.

   The sad part of this story is that the housekeeping data is a
   JNIHandleBlock which is allocated under protection of a lock so there's
   no way to get two threads failing an allocation on that path at the
   same time. Enter the next VM hack.

   In src/share/vm/prims/jvmtiImpl.cpp, the
   JvmtiAgentThread::call_start_function() function is modified to
   os::malloc() a 1K block right after the native code returns. If
   that allocation fails, then vm_exit_out_of_memory() is called
   which results in a call to report_vm_out_of_memory(). Since both
   JVM/TI agent threads have nicely used up all the malloc() memory
   and returned, we have two threads racing to VM hack #2.

3) forcing two threads into the gauntlet at the same time

   This bug is all about a race in report_vm_out_of_memory() so the last
   hack is some code to make sure that both threads are lined up at the
   start line for the race.

   In the src/share/vm/utilities/debug.cpp: report_vm_out_of_memory(),
   the RawMonitor_lock monitor is used to block both threads at the same
   point. Why RawMonitor_lock? Because it's not likely to be used at this
   point in time so it has been hijacked for this hack. The first thread
   to grab RawMonitor_lock calls wait(). When the second thread grabs
   RawMonitor_lock, it notifies "all threads", releases RawMonitor_lock
   and the race is on.

   One thread will manage to be "first" in the racy block and calls
   VMError(thread, file, line, size, message).report_and_die(). The other
   thread by passes the report_and_die() call and calls vm_abort(true).
   Wait... one thread is trying report a native memory allocation failure
   and the other thread aborts the VM? That can't be good...

The Consequences
----------------

If the vm_abort(true) call happens quickly enough, the "first" thread may
not get a chance to report anything. Have you ever seen a test failure
where the "java" cmd simply exited with "exit code == 1" with no error
message and no hs_err_pid file? This race could be the bug that caused
that "drive by shooting". Similarly, have you ever seen a test failure
where the hs_err_pid appears to be incomplete or partially written?
Again, this race could be the bug that caused that "UFO".


Reproducing the Bug
===================

Reproducing the bug requires building the test, building a VM to
include the special debug patch, installing the special VM in
a JDK and then running the test.

Building the Test
-----------------

$ mkdir <dir-for-the-test>

$ cd <dir-for-the-test>

# download 6799919_test.tgz into <dir-for-the-test>

$ gunzip < 6799919_test.tgz | tar xvfp -
<tar-extract-output-mesgs>

# Use something like this to add Microsoft Tools to your environment:
#
#  $ . add_ms_env VS2010       # for 32-bit
#  $ . add_ms_env -l VS2010    # for 64-bit

$ make JDK=$JAVA_HOME OSNAME=win32  # or 'linux' or 'solaris'

Building the VM
---------------

$ hg clone -r b861c8af2510 \
  http://hg.openjdk.java.net/hsx/hotspot-rt/hotspot my_hotspot

$ hg clone -r cb57f84b031c \
  http://closedjdk.us.oracle.com/hsx/hotspot-rt/hotspot/make/closed \
  my_hotspot/make/closed

$ hg clone -r 60755a7f98f8 \
  http://closedjdk.us.oracle.com/hsx/hotspot-rt/hotspot/src/closed \
  my_hotspot/src/closed

$ hg clone -r 26a8c0e935cb \
  http://closedjdk.us.oracle.com/hsx/hotspot-rt/hotspot/test/closed \
  my_hotspot/test/closed

$ cd my_hotspot

$ cp -p <dir-for-the-test>/debug_diffs.txt .

$ hg import debug_diffs.txt
<do-not-save-this-commit-just-quit>

# If you want to use the same VM to also exercise the fix, then
# do the following:
#
# $ cp -p <dir-for-the-test>/debug.cpp.debug_and_fix src/share/vm/utilities/debug.cpp
#
# The above will replace the debug.cpp file created by the import of
# debug_diffs.txt with a version that has both the debug hooks and
# the fix in place.

Use your favorite build method to build your VM bits.
Once built, copy your VM bits into the proper location
in $JAVA_HOME.

Running the Test
----------------

$ cd <dir-for-the-test>

$ make JDK=$JAVA_HOME OSNAME=win32 \
      TEST_ARGS=-client test 2>&1 | tee make.log 

The above test invocation assumes that you installed your VM bits
as the "client" VM. Obviously, if you used a different name for
your VM, then change the args.

The above test invocation should reproduce the failing test
result with something that looks like this at the end:

    Worker thread 2 finished 401 loops...
    XXX - allocating dummy block!
    XXX - faking malloc failure.
    XXX - first thread is in the cage!
    XXX - faking malloc failure.
    Worker thread 1 failed to allocate a 8 byte buffer...
    Worker thread 1 allocated 1 buffers of size 8.
    Worker thread 1 buf_size=4...
    Worker thread 1 finished 401 loops...
    XXX - allocating dummy block!
    XXX - faking malloc failure.
    XXX - second thread is in the cage!
    XXX - second thread is leaving the cage!
    XXX - first thread is leaving the cage!
    XXX - one thread is calling VMError
    XXX - one thread is calling vm_abort
    + status=1
    + echo status=1
    status=1
    + exit 1
    make: *** [test] Error 1

In the above example, the thread calling vm_abort() managed to kill off
the VM before the thread calling VMError managed to output anything to
either stderr or the hs_err_pid file.

If you chose to include the fix with your VM, then you can exercise the
fix along with the special debug patch:

$ make JDK=$JAVA_HOME OSNAME=win32 \
      TEST_ARGS="-client -XX:+UseNewCode2" test 2>&1 | tee make.log 

The above test invocation should reproduce a passing test
result with something that looks like this at the end:

    Worker thread 1 finished 401 loops...
    XXX - allocating dummy block!
    XXX - faking malloc failure.
    XXX - first thread is in the cage!
    XXX - faking malloc failure.
    Worker thread 2 failed to allocate a 8 byte buffer...
    Worker thread 2 allocated 1 buffers of size 8.
    Worker thread 2 buf_size=4...
    Worker thread 2 finished 401 loops...
    XXX - allocating dummy block!
    XXX - faking malloc failure.
    XXX - second thread is in the cage!
    XXX - second thread is leaving the cage!
    XXX - first thread is leaving the cage!
    [thread 5856 also had an error]
    #
    # There is insufficient memory for the Java Runtime Environment to continue.
    # Native memory allocation (malloc) failed to allocate 1024 bytes for cannot allocate dummy 1K block
    # An error report file with more information is saved as:
    # C:\6799919\hs_err_pid2052.log
    XXX - faking malloc failure.
    + status=1
    + echo status=1
    status=1
    + exit 1
    make: *** [test] Error 1