-
Enhancement
-
Resolution: Unresolved
-
P4
-
25
Proposal: Optimize binary search loops by reusing some of our RCE transform logic, but applied to a loop index which is provably between the last two loop index values.
Background: C2 takes great care to recognize loop indexes that range in an arithmetic sequence. In such a case, the index's minimum and maximum values may be pre-computed in the loop header.
If these indexes are also used to address a (loop-invariant) array, the array length can be compared against the min/max index values, and if the range [min,max] is included in the valid array index range, the loop body does not need range checks. This is range check elimination, or RCE.
The technique applies to multiple arrays at one time. It also applies when the array indexes are simple (linear, non-overflowing) functions of the loop index.
RCE also applies in the very common case where there is no static proof available that the array indexes (as generated by the loop index in [min,max]) are completely in range. In that case, the array indexes and array lengths are "reverse engineered" to discover the largest unchecked range for the loop index, call them min2/max2. The loop is then cloned ("split") into three copies, pre, main, and post loop. The main loop handles all loop indexes in the range [min2,max2] and does not need per-use range checking. The pre (or post) loop handles the residual range [min,min2) (or (max2,max]).
It's lovely and complicated. It has been extended over the years to cover not only 32-bit loop counts but 64-bit ones. It has also been extended to take account, not only of intrinsic array range checks, but also user-defined range checks. These two extensions make the optimization relevant to non-native arrays, such as may be found in Project Panama.
Proposal Details: Define a new loop type in the C2 IR, a BinarySearchLoop (like CountedLoop). Analogous to CountedLoop, track uses of a pair of index variables, ensuring that they meet a constrained pattern of access. At least after the first two iterations of the loop, each index have been range-checked (unconditionally). (The indexes might be checked before the loop. The loop can be peeled up to two times to ensure this if desired.) On every back-edge, if an index is updated, it must be updated to some kind of mean between the two indexes. There is a range of possible patterns for the mean, but every pattern should reflect a computation of a value that is provably (to C2) not outside the interval spanned by the two index variables, just before the mean is computed. (There is only a slight benefit to insisting that the mean is strictly inside that spanning interval. The point is that neither index makes any excursion **outside** the initial interval. This means that, as long as there is a dominating check in the loop header of the two initial index values, all other range checks (of those indexes, corresponding to the first two checks) may be eliminated.
There is no need for a pre/main/post structure, but a limited two-trip pre-loop may be required to get the effect of peeling out the first two checks (to establish the allowed span).
Overflow must be excluded, as overflow is excluded in C2’s classic RCE.
The pattern for a mean computation should include many usages like the following:
mn = (hi + lo) >>> 1
mn = lo + ((hi - lo) >>> 1)
mn = hi - ((hi - lo) >>> 1)
mn = lo + ((hi - lo) >>> CON) [CON%NBIT>0]
mn = hi - ((hi - lo) >>> CON) [CON%NBIT>0]
mn = min(hi, max(lo, EXPR))
mn = (hi + lo) / 2 [DC]
mn = (hi + lo) >> 1 [DC]
mn = lo + ((hi - lo) >> 1) [DC]
mn = lo + ((hi - lo) / CON) [DC; CON>=1]
mn = lo + min(max(0, EXPR), (hi-lo)) [DC?]
mn = hi - min(max(0, EXPR), (hi-lo)) [DC?]
Some use cases might also have small increments which are harmless to the mean, as long as they can be proven not to drive an index value outside the known good interval:
The [DC] note means a domain constraint may be required to rule out overflow. As is well known, some buggy binary search control expressions can overflow to a value which is **not** a mean between the inputs.
mn = (hi-1 + lo) >>> 1 [require hi>lo]
mn = (hi + lo+1) >>> 1 [require hi>lo]
mn = (hi-1 + lo) / 2 [require hi>lo; DC]
mn = (hi + lo+1) / 2 [require hi>lo; DC]
As a follow-on goal, the loop should be allowed to contain a counted sub-loop whose own sub-index traverses some sub-interval of [min,max].
Unlike RCE for arrays, this proposal can pull its weight even if the two loop indexes are used directly as array indexes, not via any transforming linear function. However, such transforms can be accommodated, in principle, just as they are for array RCE. It would require a more difficult “reverse engineering” of both the arrays and the array indexes, to find the valid range. This is done for array RCE, and it may be that that logic can be refactored to apply to binary search as well.
An enhancement to this proposal would be to allow the two indexes to range unchecked (outside of any dominating checks) during a pre-loop. As soon as enough dominating checks are done, and/or the indexes are known to be inside a known-good range (for the array lengths and index expressions), a main loop could be entered that elides all checks, as long as the indexes continue to stay in the known-good range.
The methods named java.util.Arrays::binarySearch should be checked to benefit from this optimization. There might also be Panama methods which should be checked.
Background: C2 takes great care to recognize loop indexes that range in an arithmetic sequence. In such a case, the index's minimum and maximum values may be pre-computed in the loop header.
If these indexes are also used to address a (loop-invariant) array, the array length can be compared against the min/max index values, and if the range [min,max] is included in the valid array index range, the loop body does not need range checks. This is range check elimination, or RCE.
The technique applies to multiple arrays at one time. It also applies when the array indexes are simple (linear, non-overflowing) functions of the loop index.
RCE also applies in the very common case where there is no static proof available that the array indexes (as generated by the loop index in [min,max]) are completely in range. In that case, the array indexes and array lengths are "reverse engineered" to discover the largest unchecked range for the loop index, call them min2/max2. The loop is then cloned ("split") into three copies, pre, main, and post loop. The main loop handles all loop indexes in the range [min2,max2] and does not need per-use range checking. The pre (or post) loop handles the residual range [min,min2) (or (max2,max]).
It's lovely and complicated. It has been extended over the years to cover not only 32-bit loop counts but 64-bit ones. It has also been extended to take account, not only of intrinsic array range checks, but also user-defined range checks. These two extensions make the optimization relevant to non-native arrays, such as may be found in Project Panama.
Proposal Details: Define a new loop type in the C2 IR, a BinarySearchLoop (like CountedLoop). Analogous to CountedLoop, track uses of a pair of index variables, ensuring that they meet a constrained pattern of access. At least after the first two iterations of the loop, each index have been range-checked (unconditionally). (The indexes might be checked before the loop. The loop can be peeled up to two times to ensure this if desired.) On every back-edge, if an index is updated, it must be updated to some kind of mean between the two indexes. There is a range of possible patterns for the mean, but every pattern should reflect a computation of a value that is provably (to C2) not outside the interval spanned by the two index variables, just before the mean is computed. (There is only a slight benefit to insisting that the mean is strictly inside that spanning interval. The point is that neither index makes any excursion **outside** the initial interval. This means that, as long as there is a dominating check in the loop header of the two initial index values, all other range checks (of those indexes, corresponding to the first two checks) may be eliminated.
There is no need for a pre/main/post structure, but a limited two-trip pre-loop may be required to get the effect of peeling out the first two checks (to establish the allowed span).
Overflow must be excluded, as overflow is excluded in C2’s classic RCE.
The pattern for a mean computation should include many usages like the following:
mn = (hi + lo) >>> 1
mn = lo + ((hi - lo) >>> 1)
mn = hi - ((hi - lo) >>> 1)
mn = lo + ((hi - lo) >>> CON) [CON%NBIT>0]
mn = hi - ((hi - lo) >>> CON) [CON%NBIT>0]
mn = min(hi, max(lo, EXPR))
mn = (hi + lo) / 2 [DC]
mn = (hi + lo) >> 1 [DC]
mn = lo + ((hi - lo) >> 1) [DC]
mn = lo + ((hi - lo) / CON) [DC; CON>=1]
mn = lo + min(max(0, EXPR), (hi-lo)) [DC?]
mn = hi - min(max(0, EXPR), (hi-lo)) [DC?]
Some use cases might also have small increments which are harmless to the mean, as long as they can be proven not to drive an index value outside the known good interval:
The [DC] note means a domain constraint may be required to rule out overflow. As is well known, some buggy binary search control expressions can overflow to a value which is **not** a mean between the inputs.
mn = (hi-1 + lo) >>> 1 [require hi>lo]
mn = (hi + lo+1) >>> 1 [require hi>lo]
mn = (hi-1 + lo) / 2 [require hi>lo; DC]
mn = (hi + lo+1) / 2 [require hi>lo; DC]
As a follow-on goal, the loop should be allowed to contain a counted sub-loop whose own sub-index traverses some sub-interval of [min,max].
Unlike RCE for arrays, this proposal can pull its weight even if the two loop indexes are used directly as array indexes, not via any transforming linear function. However, such transforms can be accommodated, in principle, just as they are for array RCE. It would require a more difficult “reverse engineering” of both the arrays and the array indexes, to find the valid range. This is done for array RCE, and it may be that that logic can be refactored to apply to binary search as well.
An enhancement to this proposal would be to allow the two indexes to range unchecked (outside of any dominating checks) during a pre-loop. As soon as enough dominating checks are done, and/or the indexes are known to be inside a known-good range (for the array lengths and index expressions), a main loop could be entered that elides all checks, as long as the indexes continue to stay in the known-good range.
The methods named java.util.Arrays::binarySearch should be checked to benefit from this optimization. There might also be Panama methods which should be checked.