Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8223051

support loops with long (64b) trip counts

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Fixed
    • Icon: P4 P4
    • 16
    • 13
    • hotspot
    • b21

      Many core loop transformations apply to counted loops, which
      are those with a calculated trip count. Transformations
      include unrolling, iteration range splitting (array RCE),
      and strip mining (JDK-8186027). The optimizer performs many
      complicated pattern matches to detect and transform counted
      loop.

      Most or all of these pattern matches and transformations
      apply to loops with 32-bit control variables and arithmetic.
      This makes sense as long as bulk operations apply only to
      Java arrays, since those arrays can only span a 31-bit index
      range. Newer APIs for larger blocks of bulk data will
      introduce 64-bit indexes, such as Panama's native arrays and
      (possibly) range-expanded byte buffers. Under the hood, the
      Unsafe API routinely works with 64-bit addresses and address
      arithmetic. Loops which work on such data structures
      naturally use 64-bit values, either as direct Java longs, or
      as wrapped cursor structure with incrementing long
      components (Panama pointers).

      There needs to be a story for transforming such long-running
      loops. This RFE is a request for that story.

      A complicating factor is that sometimes counted loops have
      no safepoints, on the assumption that the largest possible
      iteration (across 32 bits of dynamic range) won't cause the
      JVM's safepoint mechanism to malfunction due to a
      non-responsive thread stuck in such a counted loop. This
      assumption is invalid in the 64-bit case. Luckily, we have
      a (relatively new) optimization which can address this
      problem, by strip-mining a single very long running loop
      into a sequence (outer loop) of loops of with appropriately
      bounded trip counts.

      This observation leads to a suggested approach to this RFE.
      Generalize the strip-mining transform operate on
      trip-counted loops which use 64-bit arithmetic. In the
      inner loops, use a normal 32-bit counter, constrained (of
      course) to avoid overflow. This should enable all the other
      loop transforms to work on the inner loops, without them
      "knowing" that they are working on a strip-mined 64-bit
      loop.

      The strip mining transform, along with the index variable
      size reduction, would look something like this:

      ```
      void processRange(long start, long end) {
        for (long li = start; li < end; li++) {
          checkIndexLong(li, a.length());
          a.put(li, a.get(li) + 1);
        }
        ==STRIPMINE==>
      void processRange(long start, long end) {
        for (long li = start; li < end; ) {
          int stripMax = LoopStripMiningIter;
          int strip = (li + stripMax < end) ? stripMax : (int)(end - li);
          for (int si = 0; si < strip; si++) {
            long li_ = li + si;
            checkIndexLong(li_, a.length());
            a.put(li_, a.get(li_) + 1);
          }
          li += strip;
        }
      ```

      In the future, Valhalla will introduce primitive-like
      "inline types" that can wrap hardware vectors of 128 bits or
      larger. The possibility exists (in theory) of treating
      those larger types as capable of arithmetic. In that case,
      the JIT may encounter loops over other types besides Java
      longs, and wider than 64 bits. The initial version of this
      RFE need to support that, but it should be factored so that
      adding more kinds of index types (besides T_INT and T_LONG)
      will not require a total rewrite. I say "should" not "must"
      because this point is a goal not a requirement of this RFE.

      A separate part of getting the benefit from this work is
      recognizing the places where range checking occurs, so that
      iteration range splitting can be done (inside each strip of
      the outer loop). The JVM does this in a hardwired way for
      Java arrays, and also (as of JDK-8042997) on library-coded
      data structures which happen to use the intrinsic method
      `Objects.checkIndex` to check their indexes.

      For library-coded data structures with 64-bit indexes, the
      JIT can be set up to look for a 64-bit version of
      `checkIndex`. (That is TBD, but see `checkIndexLong`
      above.) Range checks on the long index can be reduced to
      range checks on the shortened inner loop index.

      Here is an example, based on the previous one, of splitting
      the inner loop into pre-, main, and post-loops:

      ```
        ==RCE==>
      void processRange(long start, long end) {
        long mainStartLI = Long.max(start, 0);
        long mainLimitLI = Long.min(end, a.length());
        for (long li = start; li < end; ) {
          int stripMax = LoopStripMiningIter;
          int strip = (li+stripMax < end) ? stripMax : (int)(end-li);
          int si = 0;
          int mainStartSI, mainLimitSI;
          if (li+strip <= mainStartLI)
            mainStartSI = mainLimitSI = strip; //or 0
          else if (li >= mainLimitLI)
            mainStartSI = mainLimitSI = 0; //or strip
          else {
            mainStartSI = (int)Long.max(mainStartLI-li, 0);
            mainLimitSI = (int)Long.min(mainLimitLI-li, strip);
          }
          for (; si < mainStartSI; si++) { //PreLoop
            long li_ = li + si;
            checkIndexLong(li_, a.length());
            a.put(li_, a.get(li_) + 1);
          }
          for (; si < mainLimitSI; si++) { //MainLoop
            long li_ = li + si;
            //RCE!// checkIndexLong(li_, a.length());
            a.put(li_, a.get(li_) + 1);
          }
          for (; si < strip; si++) { //PostLoop
            (... same as PreLoop ...)
          }
          li += strip;
      }
      ```

      Or it may be more reasonable to split the outer loop itself
      into pre/main/post copies, in which case the 32-bit versions
      of the split values are not computed.

      (Note: The gathering of constraints is a complex matter.
      See `PhaseIdealLoop::add_constraint` in C2 for more details.
      It seems it would need to be generalized to handle
      constraints on 64-bit values.)

      The request to generalize iteration range splitting over
      64-bit ranges should be broken out into a separate RFE or
      subtask, when work starts on this RFE.

            roland Roland Westrelin
            jrose John Rose
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: