Add a cross-lane operation that performs a scan (or prefix sum) of a vector's elements for a given associated operator e.g.,
public abstract Vector<E> scanLanes(VectorOperators.Associative op);
Although there is no known vector hardware instruction on common CPU architectures we can emit a very efficient sequence of permutation and associative instructions for the target hardware.
The scan should be an "inclusive" scan. That is, the last lane of the output will also be the result of reducing all the lanes. Also, the first lane of the output will always be equal to the first lane of the input. A so-called "exclusive" scan can be obtained by shifting the lanes by one, and shifting in the identity of the scan op (0 for ADD or OR, -1 for AND, etc.).
(Either kind of scan is easily turned into the other. But inclusive scans produce strictly more information exclusive scans, since converting the latter to the former requires an extra reduction operation.)
Reductions and scans, used as components of larger algorithms, often take a partial sum (or other accumulated value, for AND/OR/MAX/etc.) from data preceding the current vector, and contribute to partial sums of data that comes after the current vector. We could represent this directly in the API by providing an optional parameter for the leftward partial sum, for Vector::reduceLanes. But, for reduction, it is just as easy to add this partial sum to the result returned by reduceLanes. This is true whether there is a mask (for partial reduction) or not (for unconditional full reduction).
public Vector<E> scanLanes(VectorOperators.Associative op, ETYPE leftInput) {
var that = this.scanLanes(op);
return that.lanewise(op, that.broadcast(leftInput));
}
For scans, the situation might be more subtle. In the maskless (unconditional) case, if you are scanning across a sequence of vectors, if you don't have an optional parameter to feed in the leftward scan output, then you have to specify a workaround. In this case, the workaround is to "add" (or AND or OR or MAX) the appropriate leftward reduction value (the "last scan value to the left" because it is an inclusive scan) into ALL of the lanes of the current scan result. That is not much harder than doing the corresponding scalar "add" in the case of reduceLanes.
Besides the familiar distinction of scan vs. reduce (vector vs. scalar result) and inclusive vs. exclusive, there is another variation on behavior to consider here, which is whether to have a mask or not. Here we encounter so-called "partial" or "segmented" scans. This happens when the scan operation takes a lane mask operand as well. In the resulting segmented scan, each output lane gets a scan result, but the scanning is performed disjointly on segments, with each segment accumulating only values within that segment.
Here is a reference to segmented scans:
https://en.wikipedia.org/wiki/Segmented_scan
Here is a paper by a researcher who has specialized in their use:
https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf
There are at least five conventions for decoding a mask to derive a set of disjoint lane segments. Either the beginning or the end of a segment can be marked, and either a 1 or a 0 can mark the segment boundary. Alternatively, mask parity variations can mark the segment boundaries. Each representation has its advantages. For the Vector API, which tends to focus on left-to-right processing, it seems useful to say that a 1 bit (in the mask) defines a new segment, including the corresponding data lane, and also including (after that lane) any immediately following lanes that have a 0 bit (in the mask). So the bit-count of the mask is the number of complete segments.
If the mask begins with 0 (for the first lane), then that signals the first lane in the current vector continues a scan segment that was started (logically) before the current vector. We might call this a "continuing" scan segment. THIS case is where another scalar input comes in handy: We want a way to send in the value of the last lane of the previous scan result, from the previous vector scan, so that it can be added into the lanes which participate in the "continuing" scan segment.
(These conventions are less friendly to reverse-order processing, but that can be easily obtained by reversing all vectors and performing forward-order logic to their reversals.)
The implementation of a segmented scan is more expensive than an unsegmented scan, because it requires more bookkeeping. It is still a log-time operation (but with the constant multiplier about doubled).
At each point in the algorithm, each intermediate partial sum gets a contribution "from the left" (from an earlier lane) only if the relevant mask bit is clear. If some mask bit is set, then contributions "from the left" at that point are blocked, replaced by the null/identity value of the associative scan op. For the initial lane position, the contribution "from the left" is supplied by the input scalar argument, and it percolates only until the first lane which is selected with a 1 in the mask.
The declaration for a segmented scan could therefore look like this:
public abstract Vector<E> scanLanes(VectorOperators.Associative op, VectorMask<E> m, long leftInput);
public EVector scanLanes(VectorOperators.Associative op, VectorMask<E> m, ETYPE leftInput);
The optional "leftInput" parameter is typically taken from the last lane of the previously scanned vector, or is the identity element if there is none. The left-input is fed into the leftmost scan segment (the "continuing" segment) IF the first lane is unselected by the mask, and ignored otherwise.
If the left-input argument is not supported, or if the algorithm cannot supply it (because it is using thread parallelism as well as vector parallelism!), there is a workaround. The workaround is to use `VectorMask::firstTrue` to derive a mask that selects only the first segment, if it is a "continuing" segment, and no lanes otherwise. That derived mask would gate the addition of the left-input but just for the "continuing" segment. This is akin to the workaround for non-segmented scan which broadcasts the left-input to ALL lanes of the unsegmented scan result.
Note that an unsegmented scan is equivalent to a segmented scan with a mask where only the first lane is selected, so that the whole vector (and nothing else) is one long scan segment. It is also equivalent to a segmented scan with NO lanes selected (in the mask) and a null input value for the left-input.
It may be worthwhile to add a method or two to VectorMask to work with scan segments, under these conditions. The firstTrue method is a starting point. Perhaps one method could create a mask selecting the lanes UP TO the FIRST selected (1) bit in the input mask. A complementary method could create a mask selecting the lanes STARTING WITH the first selected (1) bit in the input mask, but stopping with the next selected bit. Or perhaps these should be scalar operations, hunting for indexes in the mask.
Whether the left-input parameter "pulls its weight" depends on the details of the assembly code for the segmented scan. It is probably more convenient to inject the left-input in the assembly code (and peephole it away if it is not present). That is, more convenient than asking the user to perform the tricky workaround of building a derived mask. Convenience here contemplates both maintainability and performance. It is probably faster for the compiler intrinsic to handle the possibility of an initial "continuing" segment, than to have the user code it explicitly. But in some cases at least (multi-threaded algorithms) the workaround mentioned above will be necessary, to patch together batches that have been handed off to differing threads.
Whether segmented scans as a whole "pull their weight" depends on (a) whether they are significantly faster than serial algorithms, and (b) whether they are likely to be used by more than one or two researchers.
Here is a pseudocode implementation of segmented scan which is O(VLEN) rather thatn O(log(VLEN)):
public Vector<E> scanLanes(VectorOperators.Associative op, VectorMask<E> m, ETYPE leftInput) {
ETYPE[] a = this.toArray();
var partial = leftInput;
var noLanesSet = VectorMask.fromLong(species(), 0);
var leftId = this.reduceLanes(op, noLanesSet);
//var leftId = VectorOperations.leftIdentity(op); // hypothetical API
for (int i = 0; i < a.length; i++) {
if (m.laneIsSet(i)) partial = leftId;
var ai = a[i];
ai = this.broadcast(partial).lanewise(op, this.broadcast(ai)).lane(0);
//ai = VectorOperations.execute(op, partial, ai); // hypothetical API
a[i] = ai;
partial = ai;
}
return fromArray(a);
}
An efficient O(log(VLEN)) implementation will resemble the non-segmented scan implementation, except that it may perform extra work on the binary tree of the lanes to ensure that partial sums are propagated only where the mask bits allow.
public abstract Vector<E> scanLanes(VectorOperators.Associative op);
Although there is no known vector hardware instruction on common CPU architectures we can emit a very efficient sequence of permutation and associative instructions for the target hardware.
The scan should be an "inclusive" scan. That is, the last lane of the output will also be the result of reducing all the lanes. Also, the first lane of the output will always be equal to the first lane of the input. A so-called "exclusive" scan can be obtained by shifting the lanes by one, and shifting in the identity of the scan op (0 for ADD or OR, -1 for AND, etc.).
(Either kind of scan is easily turned into the other. But inclusive scans produce strictly more information exclusive scans, since converting the latter to the former requires an extra reduction operation.)
Reductions and scans, used as components of larger algorithms, often take a partial sum (or other accumulated value, for AND/OR/MAX/etc.) from data preceding the current vector, and contribute to partial sums of data that comes after the current vector. We could represent this directly in the API by providing an optional parameter for the leftward partial sum, for Vector::reduceLanes. But, for reduction, it is just as easy to add this partial sum to the result returned by reduceLanes. This is true whether there is a mask (for partial reduction) or not (for unconditional full reduction).
public Vector<E> scanLanes(VectorOperators.Associative op, ETYPE leftInput) {
var that = this.scanLanes(op);
return that.lanewise(op, that.broadcast(leftInput));
}
For scans, the situation might be more subtle. In the maskless (unconditional) case, if you are scanning across a sequence of vectors, if you don't have an optional parameter to feed in the leftward scan output, then you have to specify a workaround. In this case, the workaround is to "add" (or AND or OR or MAX) the appropriate leftward reduction value (the "last scan value to the left" because it is an inclusive scan) into ALL of the lanes of the current scan result. That is not much harder than doing the corresponding scalar "add" in the case of reduceLanes.
Besides the familiar distinction of scan vs. reduce (vector vs. scalar result) and inclusive vs. exclusive, there is another variation on behavior to consider here, which is whether to have a mask or not. Here we encounter so-called "partial" or "segmented" scans. This happens when the scan operation takes a lane mask operand as well. In the resulting segmented scan, each output lane gets a scan result, but the scanning is performed disjointly on segments, with each segment accumulating only values within that segment.
Here is a reference to segmented scans:
https://en.wikipedia.org/wiki/Segmented_scan
Here is a paper by a researcher who has specialized in their use:
https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf
There are at least five conventions for decoding a mask to derive a set of disjoint lane segments. Either the beginning or the end of a segment can be marked, and either a 1 or a 0 can mark the segment boundary. Alternatively, mask parity variations can mark the segment boundaries. Each representation has its advantages. For the Vector API, which tends to focus on left-to-right processing, it seems useful to say that a 1 bit (in the mask) defines a new segment, including the corresponding data lane, and also including (after that lane) any immediately following lanes that have a 0 bit (in the mask). So the bit-count of the mask is the number of complete segments.
If the mask begins with 0 (for the first lane), then that signals the first lane in the current vector continues a scan segment that was started (logically) before the current vector. We might call this a "continuing" scan segment. THIS case is where another scalar input comes in handy: We want a way to send in the value of the last lane of the previous scan result, from the previous vector scan, so that it can be added into the lanes which participate in the "continuing" scan segment.
(These conventions are less friendly to reverse-order processing, but that can be easily obtained by reversing all vectors and performing forward-order logic to their reversals.)
The implementation of a segmented scan is more expensive than an unsegmented scan, because it requires more bookkeeping. It is still a log-time operation (but with the constant multiplier about doubled).
At each point in the algorithm, each intermediate partial sum gets a contribution "from the left" (from an earlier lane) only if the relevant mask bit is clear. If some mask bit is set, then contributions "from the left" at that point are blocked, replaced by the null/identity value of the associative scan op. For the initial lane position, the contribution "from the left" is supplied by the input scalar argument, and it percolates only until the first lane which is selected with a 1 in the mask.
The declaration for a segmented scan could therefore look like this:
public abstract Vector<E> scanLanes(VectorOperators.Associative op, VectorMask<E> m, long leftInput);
public EVector scanLanes(VectorOperators.Associative op, VectorMask<E> m, ETYPE leftInput);
The optional "leftInput" parameter is typically taken from the last lane of the previously scanned vector, or is the identity element if there is none. The left-input is fed into the leftmost scan segment (the "continuing" segment) IF the first lane is unselected by the mask, and ignored otherwise.
If the left-input argument is not supported, or if the algorithm cannot supply it (because it is using thread parallelism as well as vector parallelism!), there is a workaround. The workaround is to use `VectorMask::firstTrue` to derive a mask that selects only the first segment, if it is a "continuing" segment, and no lanes otherwise. That derived mask would gate the addition of the left-input but just for the "continuing" segment. This is akin to the workaround for non-segmented scan which broadcasts the left-input to ALL lanes of the unsegmented scan result.
Note that an unsegmented scan is equivalent to a segmented scan with a mask where only the first lane is selected, so that the whole vector (and nothing else) is one long scan segment. It is also equivalent to a segmented scan with NO lanes selected (in the mask) and a null input value for the left-input.
It may be worthwhile to add a method or two to VectorMask to work with scan segments, under these conditions. The firstTrue method is a starting point. Perhaps one method could create a mask selecting the lanes UP TO the FIRST selected (1) bit in the input mask. A complementary method could create a mask selecting the lanes STARTING WITH the first selected (1) bit in the input mask, but stopping with the next selected bit. Or perhaps these should be scalar operations, hunting for indexes in the mask.
Whether the left-input parameter "pulls its weight" depends on the details of the assembly code for the segmented scan. It is probably more convenient to inject the left-input in the assembly code (and peephole it away if it is not present). That is, more convenient than asking the user to perform the tricky workaround of building a derived mask. Convenience here contemplates both maintainability and performance. It is probably faster for the compiler intrinsic to handle the possibility of an initial "continuing" segment, than to have the user code it explicitly. But in some cases at least (multi-threaded algorithms) the workaround mentioned above will be necessary, to patch together batches that have been handed off to differing threads.
Whether segmented scans as a whole "pull their weight" depends on (a) whether they are significantly faster than serial algorithms, and (b) whether they are likely to be used by more than one or two researchers.
Here is a pseudocode implementation of segmented scan which is O(VLEN) rather thatn O(log(VLEN)):
public Vector<E> scanLanes(VectorOperators.Associative op, VectorMask<E> m, ETYPE leftInput) {
ETYPE[] a = this.toArray();
var partial = leftInput;
var noLanesSet = VectorMask.fromLong(species(), 0);
var leftId = this.reduceLanes(op, noLanesSet);
//var leftId = VectorOperations.leftIdentity(op); // hypothetical API
for (int i = 0; i < a.length; i++) {
if (m.laneIsSet(i)) partial = leftId;
var ai = a[i];
ai = this.broadcast(partial).lanewise(op, this.broadcast(ai)).lane(0);
//ai = VectorOperations.execute(op, partial, ai); // hypothetical API
a[i] = ai;
partial = ai;
}
return fromArray(a);
}
An efficient O(log(VLEN)) implementation will resemble the non-segmented scan implementation, except that it may perform extra work on the binary tree of the lanes to ensure that partial sums are propagated only where the mask bits allow.