Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8301341

LinkedTransferQueue does not respect timeout for poll()

XMLWordPrintable

    • b08
    • generic
    • generic
    • Verified

        ADDITIONAL SYSTEM INFORMATION :
        Linux 5.15.0-58-generic #64-Ubuntu SMP x86_64 GNU/Linux
        Ubuntu 22.04.1 LTS
        openjdk 17.0.5 2022-10-18
        OpenJDK Runtime Environment (build 17.0.5+8-Ubuntu-2ubuntu122.04)
        OpenJDK 64-Bit Server VM (build 17.0.5+8-Ubuntu-2ubuntu122.04, mixed mode, sharing)

        A DESCRIPTION OF THE PROBLEM :
        There's code that used a LinkedTransferQueue that is not returning from `poll` under certain circumstance. The code in question is https://github.com/openhab/openhab-core/blob/main/bundles/org.openhab.core/src/main/java/org/openhab/core/common/QueueingThreadPoolExecutor.java#L193. I modified it locally to include its own measurements and logging:

        ```
                                    long start = System.nanoTime();
                                    final Runnable runnable = taskQueue.poll(2, TimeUnit.SECONDS);
                                    long end = System.nanoTime();
                                    long duration = end - start;
                                    if (duration > 3000000000L) {
                                        logger.warn("Polling the task queue in thread pool '{}' took {}ms",
                                                threadPoolName, duration / 1000000L);
                                    }
        ```

        When it hangs, CPU use on that thread goes to 100%, and the stack trace is:
        ```
        "safeCall-queue" Id=8239 in RUNNABLE
            at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:652)
            at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:616)
            at java.base@17.0.5/java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1294)
            at org.openhab.core.common.QueueingThreadPoolExecutor$1.run(QueueingThreadPoolExecutor.java:193)
            at java.base@17.0.5/java.lang.Thread.run(Thread.java:833)
        ```

        This stack trace is consistent, no matter how many times I sample it, I've not caught it executing on a different line.

        At this point, I used another thread to inspect the queue, and confirmed that from its point of view the queue is empty. I was unable to inspect the internal data structures of the queue via reflection, because `head` and `tail` are `volatile transient`. I _am_ able to force it to break out by forcing an item to be pushed onto the queue (line 118 in that file). At which point, my logs indicated the following:

        ```
        2023-01-27 08:05:17.338 [WARN ] [re.common.QueueingThreadPoolExecutor] - Polling the task queue in thread pool 'safeCall' took 1377089ms
        ```

        Definitely longer than 2s! I'm not sure if the `awaitMatch` method is messing up its time-keeping, or if it's getting confused by some rare intermediate state. It's not completely deadlocked, because pushing a new element allows it to break out. And it's definitely spinning chewing up CPU usage, and not parking itself.

        STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
        Run openHAB 3.4.1 on my hardware, and I'm sure a smattering of my particular configuration. I can reproduce within a few hours of restarting the process, no less often than once per day.

        EXPECTED VERSUS ACTUAL BEHAVIOR :
        EXPECTED -
        `poll` returns after 2 seconds, even if the queue is empty.
        ACTUAL -
        CPU usage gets pegged at 100%.

        CUSTOMER SUBMITTED WORKAROUND :
        I'm able to notice the situation, and trigger an item to be pushed onto the pertinent queue, dislodging it.

        FREQUENCY : often


              vklang Viktor Klang
              webbuggrp Webbug Group
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: