-
Enhancement
-
Resolution: Won't Fix
-
P4
-
None
-
11, 17, 18, 19
Summary: ProcessHandle/OnExitTest.java depends on the ability of the OS to clean up zombies. On misconfigured systems that fail to clean up zombies, it may fail intermittently.
This mainly affects misconfigured Docker instances. Bare metal systems are usually fine (init or systemd will clean up). Normal Docker images are usually fine too since the standard convention is to use a bourne shell as ENTRYPOINT - pid 1 - which can function as reaper.
But when running in a malformed Docker container that has as ENTRYPOINT a binary not able to reap sub-processes, zombies will not be reaped and confuse the ProcessHandle/OnExitTest.java. A prominent example is Jenkins CI, which runs Docker by starting the container with its ENTRYPOINT overwritten with "cat" (no arguments). This is a bone-headed [1] trick [2] to keep the docker container alive, but by running "cat" as PID 1 it effectively disables the reaping of orphaned child processes [3]. Thus, when running jtreg tests in a Docker container inside Jenkins CI, ProcessHandle/OnExitTest.java may fail. Note that this is timing-dependent - see reproduction notes below.
---------------
Test details:
ProcessHandle/OnExitTest.java spans a process tree consisting of three layers:
```
test -> P -> Pa -> Pa1
-> Pa2
-> Pb -> Pb1
-> Pb2
-> Pc -> Pc1
-> Pc2
```
Each of the children is connected via pipe to its parent and listens to it. When the Pipe breaks, the children break the pipe to their children, wait on their children, then terminate.
The test calls `Process.destroy()` on the root of the process tree `P`. `P` receives a SIGTERM and terminates the process. It breaks the pipes to its immediate children `Pa`, `Pb`, and `Pc`, then terminates. This also terminates the Reaper daemon threads in `P` waiting on its immediate children `Pa`, `Pb`, `Pc`. Those child processes, in turn, break out from their listen loop, break the pipes to their children, wait on their children, then terminate themselves.
What happens depends on timing: If `Pa`, `Pb` and `Pc` are slower to terminate than their respective reaper threads in parent `P`, they will not be reaped by `P`. Instead, they become orphaned, and what then happens depends on the system. If child process adoption works, then someone will adopt and reap the orphans. If not, they zombify.
If the child processes zombify, the test will wait for them forever, since `ProcessHandle.isAlive()` returns true for zombified processes. Therefore, the test hangs and runs into a timeout.
---------------
Reproduction: This can be easily reproduced with Docker:
1) Since this is timing-dependent, introduce a delay in jdk/java/lang/ProcessHandle/JavaChild.java, in the listener loop "stdin" case, after breaking from the input loop and before exiting:
```
--- a/test/jdk/java/lang/ProcessHandle/JavaChild.java
+++ b/test/jdk/java/lang/ProcessHandle/JavaChild.java
@@ -360,6 +360,7 @@ private static volatile int commandSeq = 0; // Command sequence number
}
}
}
+ Thread.sleep(10000);
sendResult(action, "done");
return; // normal exit from JavaChild Process
case "parent":
```
2) Start a standard ubuntu container, but as entrypoint give it cat, mimicking the wrongheaded behavior of Jenkins CI [2][3]:
```
thomas@starfish $ docker run -td --name u1 -v /shared:/shared ubuntu:20.04 cat
51cf74b1a0df55372e92db888d13f9d429dac175b1ef80803025a9f21f0e15ee
```
(in my case, /shared is a shared host volume containing OpenJDK sources, binaries, and jtreg)
3) Let's see inside:
```
thomas@starfish $ docker exec u1 ps -Alf
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
4 S root 1 0 0 80 0 - 665 wait_w 08:46 pts/0 00:00:00 cat
4 R root 7 0 0 80 0 - 1475 - 08:47 ? 00:00:00 ps -Alf
```
As we can see, PID 1 is "cat" - unsurprisingly not able to reap any processes.
4) Start jtreg inside the container:
```
thomas@starfish $ docker exec u1 sh -c 'JT_JAVA=/shared/projects/openjdk/jdks/sapmachine11/ /shared/projects/openjdk/jtreg-prebuilt/jtreg/bin/jtreg -jdk:/shared/projects/openjdk/jdk-jdk/output-fastdebug/images/jdk /shared/projects/openjdk/jdk-jdk/source/test/jdk/java/lang/ProcessHandle/OnExitTest.java'
Directory "JTwork" not found: creating
Directory "JTreport" not found: creating
Test results: error: 1
Report written to /JTreport/html/report.html
Results written to /JTwork
Error: Some tests failed or other problems occurred.
```
As we can see, the test fails. Jtr file will complain about the test time outing.
5) let's look again inside:
```
thomas@starfish $ docker exec u1 ps -Alf
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
4 S root 1 0 0 80 0 - 665 wait_w 08:46 pts/0 00:00:00 cat
4 Z root 549 1 0 80 0 - 0 - 08:48 ? 00:00:00 [java] <defunct>
4 Z root 558 1 0 80 0 - 0 - 08:48 ? 00:00:00 [java] <defunct>
4 Z root 561 1 0 80 0 - 0 - 08:48 ? 00:00:00 [java] <defunct>
4 R root 825 0 0 80 0 - 1475 - 08:53 ? 00:00:00 ps -Alf
```
As we can see, the test caused Zombies to appear, which fool the "isAlive" check used in the JDK to check the aliveness of the child processes and thereby cause the test to timeout.
[1] https://issues.jenkins.io/browse/JENKINS-39748
[2] https://stackoverflow.com/questions/55369726/jenkins-docker-container-always-adds-cat-command
[3] https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/
This mainly affects misconfigured Docker instances. Bare metal systems are usually fine (init or systemd will clean up). Normal Docker images are usually fine too since the standard convention is to use a bourne shell as ENTRYPOINT - pid 1 - which can function as reaper.
But when running in a malformed Docker container that has as ENTRYPOINT a binary not able to reap sub-processes, zombies will not be reaped and confuse the ProcessHandle/OnExitTest.java. A prominent example is Jenkins CI, which runs Docker by starting the container with its ENTRYPOINT overwritten with "cat" (no arguments). This is a bone-headed [1] trick [2] to keep the docker container alive, but by running "cat" as PID 1 it effectively disables the reaping of orphaned child processes [3]. Thus, when running jtreg tests in a Docker container inside Jenkins CI, ProcessHandle/OnExitTest.java may fail. Note that this is timing-dependent - see reproduction notes below.
---------------
Test details:
ProcessHandle/OnExitTest.java spans a process tree consisting of three layers:
```
test -> P -> Pa -> Pa1
-> Pa2
-> Pb -> Pb1
-> Pb2
-> Pc -> Pc1
-> Pc2
```
Each of the children is connected via pipe to its parent and listens to it. When the Pipe breaks, the children break the pipe to their children, wait on their children, then terminate.
The test calls `Process.destroy()` on the root of the process tree `P`. `P` receives a SIGTERM and terminates the process. It breaks the pipes to its immediate children `Pa`, `Pb`, and `Pc`, then terminates. This also terminates the Reaper daemon threads in `P` waiting on its immediate children `Pa`, `Pb`, `Pc`. Those child processes, in turn, break out from their listen loop, break the pipes to their children, wait on their children, then terminate themselves.
What happens depends on timing: If `Pa`, `Pb` and `Pc` are slower to terminate than their respective reaper threads in parent `P`, they will not be reaped by `P`. Instead, they become orphaned, and what then happens depends on the system. If child process adoption works, then someone will adopt and reap the orphans. If not, they zombify.
If the child processes zombify, the test will wait for them forever, since `ProcessHandle.isAlive()` returns true for zombified processes. Therefore, the test hangs and runs into a timeout.
---------------
Reproduction: This can be easily reproduced with Docker:
1) Since this is timing-dependent, introduce a delay in jdk/java/lang/ProcessHandle/JavaChild.java, in the listener loop "stdin" case, after breaking from the input loop and before exiting:
```
--- a/test/jdk/java/lang/ProcessHandle/JavaChild.java
+++ b/test/jdk/java/lang/ProcessHandle/JavaChild.java
@@ -360,6 +360,7 @@ private static volatile int commandSeq = 0; // Command sequence number
}
}
}
+ Thread.sleep(10000);
sendResult(action, "done");
return; // normal exit from JavaChild Process
case "parent":
```
2) Start a standard ubuntu container, but as entrypoint give it cat, mimicking the wrongheaded behavior of Jenkins CI [2][3]:
```
thomas@starfish $ docker run -td --name u1 -v /shared:/shared ubuntu:20.04 cat
51cf74b1a0df55372e92db888d13f9d429dac175b1ef80803025a9f21f0e15ee
```
(in my case, /shared is a shared host volume containing OpenJDK sources, binaries, and jtreg)
3) Let's see inside:
```
thomas@starfish $ docker exec u1 ps -Alf
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
4 S root 1 0 0 80 0 - 665 wait_w 08:46 pts/0 00:00:00 cat
4 R root 7 0 0 80 0 - 1475 - 08:47 ? 00:00:00 ps -Alf
```
As we can see, PID 1 is "cat" - unsurprisingly not able to reap any processes.
4) Start jtreg inside the container:
```
thomas@starfish $ docker exec u1 sh -c 'JT_JAVA=/shared/projects/openjdk/jdks/sapmachine11/ /shared/projects/openjdk/jtreg-prebuilt/jtreg/bin/jtreg -jdk:/shared/projects/openjdk/jdk-jdk/output-fastdebug/images/jdk /shared/projects/openjdk/jdk-jdk/source/test/jdk/java/lang/ProcessHandle/OnExitTest.java'
Directory "JTwork" not found: creating
Directory "JTreport" not found: creating
Test results: error: 1
Report written to /JTreport/html/report.html
Results written to /JTwork
Error: Some tests failed or other problems occurred.
```
As we can see, the test fails. Jtr file will complain about the test time outing.
5) let's look again inside:
```
thomas@starfish $ docker exec u1 ps -Alf
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
4 S root 1 0 0 80 0 - 665 wait_w 08:46 pts/0 00:00:00 cat
4 Z root 549 1 0 80 0 - 0 - 08:48 ? 00:00:00 [java] <defunct>
4 Z root 558 1 0 80 0 - 0 - 08:48 ? 00:00:00 [java] <defunct>
4 Z root 561 1 0 80 0 - 0 - 08:48 ? 00:00:00 [java] <defunct>
4 R root 825 0 0 80 0 - 1475 - 08:53 ? 00:00:00 ps -Alf
```
As we can see, the test caused Zombies to appear, which fool the "isAlive" check used in the JDK to check the aliveness of the child processes and thereby cause the test to timeout.
[1] https://issues.jenkins.io/browse/JENKINS-39748
[2] https://stackoverflow.com/questions/55369726/jenkins-docker-container-always-adds-cat-command
[3] https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/
- relates to
-
JDK-8284874 Add comment to ProcessHandle/OnExitTest to describe zombie problem
- Resolved