Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8186226

Improve ConcurrentSkipListMap performance on weak memory model machines

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Fixed
    • Icon: P3 P3
    • 10
    • None
    • core-libs
    • None

      Doug Lea writes
      """
      As of JDK9 (and in part internally with JDK8), it's possible to
      improve performance of some concurrently-readable data structures
      using techniques that substantially reduce the number of volatile-mode
      reads. This has only a small impact on TSO machines (including X86)
      but can be very noticeable on weaker processors (ARM, POWER, RISCV).
      ConcurrentSkipListMap (CSLM) is the best target in j.u.c for putting
      these techniques into place. Even though CSLM is rarely as fast as
      ConcurrentHashMap for insertion and removal, it should be among the
      fastest concurrent data structures for traversal etc (plus, sometimes
      you need things to be sorted, so need CSLM anyway). And this
      generally seems to hold after reworking CSLM: on some POWER8 and ARMv8
      processors tested (thanks to Paul McKenney/OSUOSL and Andrew
      Haley/RedHat) performance improvements seem to range from a few
      percent to a factor of six, depending on operation load mix, key
      types, size, etc. Performance on X86 is also a little better. The
      main interaction is with cache misses -- large and/or heavily
      contended maps encounter a lot of them, limiting speedups. A
      paste of some of the internal documentation below describes the
      man ideas (which as usual took much longer to figure out how to put
      into place than I had first guessed).

      While revisiting CSLM, a couple of other improvements were made. When
      it was first introduced, we didn't have the contention-avoiding
      LongAdder to track size, so lived with very slow O(n) estimation.
      Even though the javadocs tell people not to call size(), some still
      do/did. There's now no reason to do this. Also, the indexing
      originally maxed out at levels corresponding to a billion elements
      (slowing down after that), which is now up to essentially unbounded
      (Long.MAX_VALUE) elements. Plus other minor things, with possibly
      a few more to come. I haven't completely given up on ways to reduce
      floating-garbage issues common to all concurrent linked structures.
      And there are several opportunities to use and discover more related
      techniques for other j.u.c classes.

      ... internal documentation excerpts ...

           * This class provides concurrent-reader-style memory consistency,
           * ensuring that read-only methods report status and/or values no
           * staler than those holding at method entry. This is done by
           * performing all publication and structural updates using
           * (volatile) CAS, placing an acquireFence in a few access
           * methods, and ensuring that linked objects are transitively
           * acquired via dependent reads (normally once) unless performing
           * a volatile-mode CAS operation (that also acts as an acquire and
           * release). This form of fence-hoisting is similar to RCU
           * and related techniques (see McKenney's online book
           *
      https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html)
           * It minimizes overhead that may otherwise occur when using so
           * many volatile-mode reads. Using explicit acquireFences is
           * logistically easier than targeting particular fields to be read
           * in acquire mode: fences are just hoisted up as far as possible,
           * to the entry points or loop headers of a few methods. A
           * potential disadvantage is that these few remaining fences are
           * not easily optimized away by compilers under exclusively
           * single-thread use. It requires some care avoid volatile mode
           * reads of other fields. (Note that the memory semantics of a
           * reference dependently read in plain mode exactly once are
           * equivalent to those for atomic opaque mode.) Iterators and
           * other traversals encounter each node and value exactly once.
           * Other operations locate an element (or position to insert an
           * element) via a sequence of dereferences. This search is broken
           * into two parts. Method findPredecessor (and its specialized
           * embeddings) searches index nodes only, returning a base-level
           * predecessor of the key. Callers carry out the base-level
           * search, restarting if encountering a marker preventing link
           * modification. In some cases, it is possible to encounter a
           * node multiple times while descending levels. For mutative
           * operations, the reported value is validated using CAS (else
           * retrying), preserving linearizability with respect to each
           * other. Others may return any (non-null) value holding in the
           * course of the method call.
      """

            martin Martin Buchholz
            martin Martin Buchholz
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: