Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Fixed
Priority: P4
Fix Version/s: 16
Affects Version/s: 13
Component/s: hotspot
Labels:
- c2
- c2-loopopts

Subcomponent:
compiler
Resolved In Build:
b21

Many core loop transformations apply to counted loops, which
are those with a calculated trip count. Transformations
include unrolling, iteration range splitting (array RCE),
and strip mining (~~JDK-8186027~~). The optimizer performs many
complicated pattern matches to detect and transform counted
loop.

Most or all of these pattern matches and transformations
apply to loops with 32-bit control variables and arithmetic.
This makes sense as long as bulk operations apply only to
Java arrays, since those arrays can only span a 31-bit index
range. Newer APIs for larger blocks of bulk data will
introduce 64-bit indexes, such as Panama's native arrays and
(possibly) range-expanded byte buffers. Under the hood, the
Unsafe API routinely works with 64-bit addresses and address
arithmetic. Loops which work on such data structures
naturally use 64-bit values, either as direct Java longs, or
as wrapped cursor structure with incrementing long
components (Panama pointers).

There needs to be a story for transforming such long-running
loops. This RFE is a request for that story.

A complicating factor is that sometimes counted loops have
no safepoints, on the assumption that the largest possible
iteration (across 32 bits of dynamic range) won't cause the
JVM's safepoint mechanism to malfunction due to a
non-responsive thread stuck in such a counted loop. This
assumption is invalid in the 64-bit case. Luckily, we have
a (relatively new) optimization which can address this
problem, by strip-mining a single very long running loop
into a sequence (outer loop) of loops of with appropriately
bounded trip counts.

This observation leads to a suggested approach to this RFE.
Generalize the strip-mining transform operate on
trip-counted loops which use 64-bit arithmetic. In the
inner loops, use a normal 32-bit counter, constrained (of
course) to avoid overflow. This should enable all the other
loop transforms to work on the inner loops, without them
"knowing" that they are working on a strip-mined 64-bit
loop.

The strip mining transform, along with the index variable
size reduction, would look something like this:

```
void processRange(long start, long end) {
  for (long li = start; li < end; li++) {
    checkIndexLong(li, a.length());
    a.put(li, a.get(li) + 1);
  }
  ==STRIPMINE==>
void processRange(long start, long end) {
  for (long li = start; li < end; ) {
    int stripMax = LoopStripMiningIter;
    int strip = (li + stripMax < end) ? stripMax : (int)(end - li);
    for (int si = 0; si < strip; si++) {
      long li_ = li + si;
      checkIndexLong(li_, a.length());
      a.put(li_, a.get(li_) + 1);
    }
    li += strip;
  }
```

In the future, Valhalla will introduce primitive-like
"inline types" that can wrap hardware vectors of 128 bits or
larger. The possibility exists (in theory) of treating
those larger types as capable of arithmetic. In that case,
the JIT may encounter loops over other types besides Java
longs, and wider than 64 bits. The initial version of this
RFE need to support that, but it should be factored so that
adding more kinds of index types (besides T_INT and T_LONG)
will not require a total rewrite. I say "should" not "must"
because this point is a goal not a requirement of this RFE.

A separate part of getting the benefit from this work is
recognizing the places where range checking occurs, so that
iteration range splitting can be done (inside each strip of
the outer loop). The JVM does this in a hardwired way for
Java arrays, and also (as of ~~JDK-8042997~~) on library-coded
data structures which happen to use the intrinsic method
`Objects.checkIndex` to check their indexes.

For library-coded data structures with 64-bit indexes, the
JIT can be set up to look for a 64-bit version of
`checkIndex`. (That is TBD, but see `checkIndexLong`
above.) Range checks on the long index can be reduced to
range checks on the shortened inner loop index.

Here is an example, based on the previous one, of splitting
the inner loop into pre-, main, and post-loops:

```
  ==RCE==>
void processRange(long start, long end) {
  long mainStartLI = Long.max(start, 0);
  long mainLimitLI = Long.min(end, a.length());
  for (long li = start; li < end; ) {
    int stripMax = LoopStripMiningIter;
    int strip = (li+stripMax < end) ? stripMax : (int)(end-li);
    int si = 0;
    int mainStartSI, mainLimitSI;
    if (li+strip <= mainStartLI)
      mainStartSI = mainLimitSI = strip; //or 0
    else if (li >= mainLimitLI)
      mainStartSI = mainLimitSI = 0; //or strip
    else {
      mainStartSI = (int)Long.max(mainStartLI-li, 0);
      mainLimitSI = (int)Long.min(mainLimitLI-li, strip);
    }
    for (; si < mainStartSI; si++) { //PreLoop
      long li_ = li + si;
      checkIndexLong(li_, a.length());
      a.put(li_, a.get(li_) + 1);
    }
    for (; si < mainLimitSI; si++) { //MainLoop
      long li_ = li + si;
      //RCE!// checkIndexLong(li_, a.length());
      a.put(li_, a.get(li_) + 1);
    }
    for (; si < strip; si++) { //PostLoop
      (... same as PreLoop ...)
    }
    li += strip;
}
```

Or it may be more reasonable to split the outer loop itself
into pre/main/post copies, in which case the 32-bit versions
of the split values are not computed.

(Note: The gathering of constraints is a complex matter.
See `PhaseIdealLoop::add_constraint` in C2 for more details.
It seems it would need to be generalized to handle
constraints on 64-bit values.)

The request to generalize iteration range splitting over
64-bit ranges should be broken out into a separate RFE or
subtask, when work starts on this RFE.

relates to

JDK-8254998 C2: assert(!n->as_Loop()->is_transformed_long_loop()) failure with -XX:StressLongCountedLoop=1

Resolved

JDK-8255224 x86_32 tests fail with "bad AD file" after JDK-8223051

Resolved

JDK-8221358 need intrinsics for loop control that can guide iteration range splitting

Open

JDK-8256655 rework long counted loop handling

Resolved

JDK-8336293 C2 produces inconsistent results with loops with long trip counts

Closed

JDK-8256386 ARM32 tests fail with "bad AD file" after JDK-8223051

Resolved

JDK-8254734 "dead loop detected" assert failure with patch from 8223051

Resolved

JDK-8260407 cmp != __null && cmp->Opcode() == Op_CmpL failure with -XX:StressLongCountedLoop=200000000 in lucene

Resolved

JDK-8237082 Workaround C2 limitations when working with long loops

Closed

JDK-8336478 C2: assert(!n->as_Loop()->is_loop_nest_inner_loop() || _loop_opts_cnt == 0) failed: should have been turned into a counted loop

Closed

JDK-8259609 C2: optimize long range checks in long counted loops

Resolved

JDK-8244504 C2: refactor counted loop code in preparation for long counted loop

Resolved

JDK-8255150 Add utility methods to check long indexes and ranges

Resolved

links to

Commit openjdk/jdk/e76de189

Review openjdk/jdk/318

(8 relates to, 2 links to)

Assignee:: Roland Westrelin

Reporter:: John Rose

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2019-04-26 19:54

Updated:: 2024-12-09 13:38

Resolved:: 2020-10-19 04:31

Details

Description

Attachments

Issue Links

Activity

People

Dates