Loading...

Type: Bug
Resolution: Unresolved
Priority: P3
Fix Version/s: 17-pool
Affects Version/s: 17.0.1, 17.0.3
Component/s: hotspot
Labels:
- oracle-triage-20
- reproducer-no

Subcomponent:
compiler

Reported by Paul Wagland <paul@kungfoocoder.org> on jdk-updates-dev:

Hi all,

We have been running into a problem with Java 17.0.1, 17.0.2, and now 17.0.3.

# java --version
openjdk 17.0.3 2022-04-19
OpenJDK Runtime Environment Temurin-17.0.3+7 (build 17.0.3+7)
OpenJDK 64-Bit Server VM Temurin-17.0.3+7 (build 17.0.3+7, mixed mode, sharing)

The issue that we are seeing is very similar in behaviour to https://bugs.openjdk.org/browse/JDK-8277212, with a couple of notable exceptions:
1. We are seeing this issue in 17.0.3, whereas ~~JDK-8277212~~ was resolved in either 17.0.2 or 17.0.3-oracle.
2. We are using G1GC, not ZGC

Unfortunately, we don’t know how to reproduce this issue, we are seeing it happen about once a week across our fleet. We have about 300 instances of Java running various different workloads.

What happens is that on a machine where this occurs, after a period of time we see CPU usage suddenly spike, and the gc logs suddenly start getting flooded with ICBufferFull messages, at the rate of about 430 a second. When doing thread dumps to see what is happening at that time, we find most threads end up very deep in a regex processing stack, unfortunately we have one of these fairly central to the product, so this ends up affecting most threads. It seems that almost everything that isn’t thread parsing operates at, or near to, full speed. This is one of the reasons that this feels similar to ~~JDK-8277212~~, since that was mashing the VTABLES, and because of the way that regexes work, they would appear to use a lot of VTABLE lookups, so any slowdown in that area, is likely to affect them greatly. Note that this is pure supposition, and I would love to know if there is any way if we can prove, or disprove this, either via testing, or additional logging.

Unfortunately, as mentioned above, we are not able to reproduce this in our test environments, and we are only seeing this in customer environments, and then not with any clearly identifiable cause. We have seen the issue about a dozen times, over 17.0.1, 17.0.2, and recently, 17.0.3.

We have done some performance testing, and during the slow-down we get, for our regex processing test:

10 iterations of testString 1 took 10700ms (1070004460ns/match) (matched=false)
10 iterations of testString 2 took 8965ms (896560879ns/match) (matched=true)

Normal timings looks like:
18503 iterations of testString 1 took 1000ms (54047ns/match) (matched=false)
40954 iterations of testString 2 took 1000ms (24417ns/match) (matched=true)

So that represents a slowdown of roughly 20,000 times!

We have another test that is mostly map lookup heavy, and during the high CPU it has:
doMapOperations:2:31076ns
doMapOperations:3:16951ns
doMapOperations:4:14192ns

While normally it has:
doMapOperations:2:27392ns
doMapOperations:3:7987ns
doMapOperations:4:6079ns

So a slowdown of maybe a factor of 2, which is perfectly explainable by the very high CPU/system load.

Our startup arguments are:

JVM startup arguments:
-javaagent:/opt/jmx-exporter/jmx-exporter.jar=9404:/opt/jmx-exporter/config.yaml
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.io=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
--add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED
-Dnop
-Djava.security.egd=file:/dev/./urandom
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/opt/…/logs
-Djava.awt.headless=true
-XX:+PrintFlagsFinal
-DRP_CONTEXTPATH=
-DRP_UNPACKWAR=true
-XX:FlightRecorderOptions=stackdepth=256
-XX:+UnlockDiagnosticVMOptions
-XX:+DebugNonSafepoints
-Xlog:gc*,safepoint=info:file=/opt/…/logs/verbosegc.log:time,uptimemillis:filecount=10,filesize=50M
-Xlog:class*=info:file=/opt/…/logs/classload.log:time,uptimemillis,level,tags:filecount=10,filesize=50M
-XX:+UseG1GC
-XX:G1PeriodicGCInterval=10000
-XX:+ParallelRefProcEnabled
-XX:MaxDirectMemorySize=25M
-XX:MaxMetaspaceSize=300M
-XX:MetaspaceSize=100M
-Xss1024K
-XX:ReservedCodeCacheSize=150M
-XX:+SegmentedCodeCache
-Xmx862m
-XX:MaxMetaspaceFreeRatio=20
-XX:MinMetaspaceFreeRatio=5
-XX:MaxHeapFreeRatio=40
-XX:MinHeapFreeRatio=5
-Djdk.nio.maxCachedBufferSize=8192
-XX:SoftMaxHeapSize=646m
-Dlog4j2.disableJmx=true
-Djdk.tls.ephemeralDHKeySize=2048
-Djava.protocol.handler.pkgs=org.apache.catalina.webresources
-Dorg.apache.catalina.security.SecurityListener.UMASK=0027
-Dignore.endorsed.dirs=
-Dcatalina.base=/opt/…/server1
-Dcatalina.home=/opt/…/tomcat
-Djava.io.tmpdir=/opt/…/temp

Can anyone think of anything that we can do to ameliorate this problem, apart from not using regexes, which we are moving towards, but I think that they are the victim here, not the cause of the problem, since in normal use, they are “fast enough”.

If there is any information that I can add, please let me know!

Cheers,
Paul

relates to

JDK-8277212 GC accidentally cleans valid megamorphic vtable inline caches

Closed

Details

Description

Attachments

Issue Links

Activity

People

Dates