Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8291596

Massive slowdown in regex processing while GC log fills with ICBufferFull messages

XMLWordPrintable

      Reported by Paul Wagland <paul@kungfoocoder.org> on jdk-updates-dev:

      Hi all,

      We have been running into a problem with Java 17.0.1, 17.0.2, and now 17.0.3.

      # java --version
      openjdk 17.0.3 2022-04-19
      OpenJDK Runtime Environment Temurin-17.0.3+7 (build 17.0.3+7)
      OpenJDK 64-Bit Server VM Temurin-17.0.3+7 (build 17.0.3+7, mixed mode, sharing)


      The issue that we are seeing is very similar in behaviour to https://bugs.openjdk.org/browse/JDK-8277212, with a couple of notable exceptions:
      1. We are seeing this issue in 17.0.3, whereas JDK-8277212 was resolved in either 17.0.2 or 17.0.3-oracle.
      2. We are using G1GC, not ZGC

      Unfortunately, we don’t know how to reproduce this issue, we are seeing it happen about once a week across our fleet. We have about 300 instances of Java running various different workloads.

      What happens is that on a machine where this occurs, after a period of time we see CPU usage suddenly spike, and the gc logs suddenly start getting flooded with ICBufferFull messages, at the rate of about 430 a second. When doing thread dumps to see what is happening at that time, we find most threads end up very deep in a regex processing stack, unfortunately we have one of these fairly central to the product, so this ends up affecting most threads. It seems that almost everything that isn’t thread parsing operates at, or near to, full speed. This is one of the reasons that this feels similar to JDK-8277212, since that was mashing the VTABLES, and because of the way that regexes work, they would appear to use a lot of VTABLE lookups, so any slowdown in that area, is likely to affect them greatly. Note that this is pure supposition, and I would love to know if there is any way if we can prove, or disprove this, either via testing, or additional logging.

      Unfortunately, as mentioned above, we are not able to reproduce this in our test environments, and we are only seeing this in customer environments, and then not with any clearly identifiable cause. We have seen the issue about a dozen times, over 17.0.1, 17.0.2, and recently, 17.0.3.

      We have done some performance testing, and during the slow-down we get, for our regex processing test:

      10 iterations of testString 1 took 10700ms (1070004460ns/match) (matched=false)
      10 iterations of testString 2 took 8965ms (896560879ns/match) (matched=true)

      Normal timings looks like:
      18503 iterations of testString 1 took 1000ms (54047ns/match) (matched=false)
      40954 iterations of testString 2 took 1000ms (24417ns/match) (matched=true)

      So that represents a slowdown of roughly 20,000 times!

      We have another test that is mostly map lookup heavy, and during the high CPU it has:
      doMapOperations:2:31076ns
      doMapOperations:3:16951ns
      doMapOperations:4:14192ns

      While normally it has:
      doMapOperations:2:27392ns
      doMapOperations:3:7987ns
      doMapOperations:4:6079ns

      So a slowdown of maybe a factor of 2, which is perfectly explainable by the very high CPU/system load.

      Our startup arguments are:

      JVM startup arguments:
      -javaagent:/opt/jmx-exporter/jmx-exporter.jar=9404:/opt/jmx-exporter/config.yaml
      --add-opens=java.base/java.lang=ALL-UNNAMED
      --add-opens=java.base/java.io=ALL-UNNAMED
      --add-opens=java.base/java.util=ALL-UNNAMED
      --add-opens=java.base/java.util.concurrent=ALL-UNNAMED
      --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED
      -Dnop
      -Djava.security.egd=file:/dev/./urandom
      -XX:+HeapDumpOnOutOfMemoryError
      -XX:HeapDumpPath=/opt/…/logs
      -Djava.awt.headless=true
      -XX:+PrintFlagsFinal
      -DRP_CONTEXTPATH=
      -DRP_UNPACKWAR=true
      -XX:FlightRecorderOptions=stackdepth=256
      -XX:+UnlockDiagnosticVMOptions
      -XX:+DebugNonSafepoints
      -Xlog:gc*,safepoint=info:file=/opt/…/logs/verbosegc.log:time,uptimemillis:filecount=10,filesize=50M
      -Xlog:class*=info:file=/opt/…/logs/classload.log:time,uptimemillis,level,tags:filecount=10,filesize=50M
      -XX:+UseG1GC
      -XX:G1PeriodicGCInterval=10000
      -XX:+ParallelRefProcEnabled
      -XX:MaxDirectMemorySize=25M
      -XX:MaxMetaspaceSize=300M
      -XX:MetaspaceSize=100M
      -Xss1024K
      -XX:ReservedCodeCacheSize=150M
      -XX:+SegmentedCodeCache
      -Xmx862m
      -XX:MaxMetaspaceFreeRatio=20
      -XX:MinMetaspaceFreeRatio=5
      -XX:MaxHeapFreeRatio=40
      -XX:MinHeapFreeRatio=5
      -Djdk.nio.maxCachedBufferSize=8192
      -XX:SoftMaxHeapSize=646m
      -Dlog4j2.disableJmx=true
      -Djdk.tls.ephemeralDHKeySize=2048
      -Djava.protocol.handler.pkgs=org.apache.catalina.webresources
      -Dorg.apache.catalina.security.SecurityListener.UMASK=0027
      -Dignore.endorsed.dirs=
      -Dcatalina.base=/opt/…/server1
      -Dcatalina.home=/opt/…/tomcat
      -Djava.io.tmpdir=/opt/…/temp

      Can anyone think of anything that we can do to ameliorate this problem, apart from not using regexes, which we are moving towards, but I think that they are the victim here, not the cause of the problem, since in normal use, they are “fast enough”.

      If there is any information that I can add, please let me know!

      Cheers,
      Paul

            vlivanov Vladimir Ivanov
            dholmes David Holmes
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: