-
Bug
-
Resolution: Incomplete
-
P4
-
None
-
17
-
generic
-
generic
ADDITIONAL SYSTEM INFORMATION :
Red Hat Enterprise Linux Server release 7.9 (Maipo)
Linux <clip> 3.10.0-1160.99.1.el7.x86_64 #1 SMP Thu Aug 10 10:46:21 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
# java --version
openjdk 17.0.8 2023-07-18 LTS
OpenJDK Runtime Environment Corretto-17.0.8.7.1 (build 17.0.8+7-LTS)
OpenJDK 64-Bit Server VM Corretto-17.0.8.7.1 (build 17.0.8+7-LTS, mixed mode, sharing)
A DESCRIPTION OF THE PROBLEM :
Some weeks ago, we noticed a memory RSS shoot up for our java micro-services. The shootup is in the scale of 570MG -> 1.4GB, for example. The spike is only visible in RSS memory graphs, the java internal memory measurement ( heap, code cache, etc) don't show any spike at all. We are using micro-meter and telegraf to get get the graphs via infludb to grafana. We enabled NMT, but the memory usage is not visible in NMT reports.
using pmap in linux we can see that after the memory shoots up, there are multiple entries like this:
---
00007f96b4000000 65508 65508 65508 rw--- [ anon ]
00007f9774000000 65516 65516 65516 rw--- [ anon ]
00007f96a4000000 65524 65440 65440 rw--- [ anon ]
00007f969c000000 65528 65516 65516 rw--- [ anon ]
00007f96bc000000 65528 63168 63168 rw--- [ anon ]
00007f96a8000000 65532 65532 65532 rw--- [ anon ]
---
Unfortunately, we lack the skill to troubleshoot/find out which entity in jvm reserves those.
We noticed that when the RSS memory shoots up, there's also a cpu spike of 10-30s at the very same time. One of our services gets the shoot-up after running automated tests for ~1.5 hours, so we started dumping logs from "top -H -p <pid>" to see which thread is active during the spike. The thread revealed to be "C2 CompilerThre". After this we did similar dumping using "jstack -l $PID" with 1s interval. This showed that C2 CompilerThread was working for 10-30 seconds on one specific java method of ours. I tested disabling C2 compilation completely, the cpu-spike and memory RSS shoot-up disappeared.
I tried excluding the method from compilation using "-XX:CompileCommand=exclude", but this did not remove the spike. The jstack dumps showed that C2 compiler was now stuck on another method ( which is called by the previously excluded one). I iterativly excluded more methods from compilation. After excluding 4 methods, the cpu-spike and memory RSS shoot-up disappeared. I then tested excluding only the method from the last iteration, and that was enough to make the problem go away. The method that brings up the problem does not seem any special java-wise, but it's closed source and contains references to multiple directions. Unfortunately, I can't share it and I'm not sure it it would help either before further narrowing down the problem.
I can reproduce the issue every time in our test automation labs, but at this point I'm not able provide a simple reproducer. All the above troubleshooting took me a month ( in calendar time ), as I had no earlier experience troubleshooting such an issue. To get forward I will probably need some guidance so we can pinpoint the specific problem in C2 which causes this behaviour.
I did not find any open/closed bug report with these symptoms.
( I would have included the graphs, but it does not seem to be possible. Sorry for the report being in a war story format, I was trying to describe how the problem was found and what was observed. )
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Not available at this point. We can reproduce it in our test labs, though.
FREQUENCY : often
Red Hat Enterprise Linux Server release 7.9 (Maipo)
Linux <clip> 3.10.0-1160.99.1.el7.x86_64 #1 SMP Thu Aug 10 10:46:21 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
# java --version
openjdk 17.0.8 2023-07-18 LTS
OpenJDK Runtime Environment Corretto-17.0.8.7.1 (build 17.0.8+7-LTS)
OpenJDK 64-Bit Server VM Corretto-17.0.8.7.1 (build 17.0.8+7-LTS, mixed mode, sharing)
A DESCRIPTION OF THE PROBLEM :
Some weeks ago, we noticed a memory RSS shoot up for our java micro-services. The shootup is in the scale of 570MG -> 1.4GB, for example. The spike is only visible in RSS memory graphs, the java internal memory measurement ( heap, code cache, etc) don't show any spike at all. We are using micro-meter and telegraf to get get the graphs via infludb to grafana. We enabled NMT, but the memory usage is not visible in NMT reports.
using pmap in linux we can see that after the memory shoots up, there are multiple entries like this:
---
00007f96b4000000 65508 65508 65508 rw--- [ anon ]
00007f9774000000 65516 65516 65516 rw--- [ anon ]
00007f96a4000000 65524 65440 65440 rw--- [ anon ]
00007f969c000000 65528 65516 65516 rw--- [ anon ]
00007f96bc000000 65528 63168 63168 rw--- [ anon ]
00007f96a8000000 65532 65532 65532 rw--- [ anon ]
---
Unfortunately, we lack the skill to troubleshoot/find out which entity in jvm reserves those.
We noticed that when the RSS memory shoots up, there's also a cpu spike of 10-30s at the very same time. One of our services gets the shoot-up after running automated tests for ~1.5 hours, so we started dumping logs from "top -H -p <pid>" to see which thread is active during the spike. The thread revealed to be "C2 CompilerThre". After this we did similar dumping using "jstack -l $PID" with 1s interval. This showed that C2 CompilerThread was working for 10-30 seconds on one specific java method of ours. I tested disabling C2 compilation completely, the cpu-spike and memory RSS shoot-up disappeared.
I tried excluding the method from compilation using "-XX:CompileCommand=exclude", but this did not remove the spike. The jstack dumps showed that C2 compiler was now stuck on another method ( which is called by the previously excluded one). I iterativly excluded more methods from compilation. After excluding 4 methods, the cpu-spike and memory RSS shoot-up disappeared. I then tested excluding only the method from the last iteration, and that was enough to make the problem go away. The method that brings up the problem does not seem any special java-wise, but it's closed source and contains references to multiple directions. Unfortunately, I can't share it and I'm not sure it it would help either before further narrowing down the problem.
I can reproduce the issue every time in our test automation labs, but at this point I'm not able provide a simple reproducer. All the above troubleshooting took me a month ( in calendar time ), as I had no earlier experience troubleshooting such an issue. To get forward I will probably need some guidance so we can pinpoint the specific problem in C2 which causes this behaviour.
I did not find any open/closed bug report with these symptoms.
( I would have included the graphs, but it does not seem to be possible. Sorry for the report being in a war story format, I was trying to describe how the problem was found and what was observed. )
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Not available at this point. We can reproduce it in our test labs, though.
FREQUENCY : often