Description
CFGLoop::compute_freq() assumes that the underlying CFG is reducible. Under this assumption, it computes the frequencies of each block within a loop L in a single forward pass (in reverse DFS postorder) over the members (blocks and loops) of L [1], assuming that the members form an acyclic graph (except for L's backedges). This assumption does not hold for irreducible graphs, where there might be additional cycles corresponding to nonnatural loops.
This leads to inaccurate frequency estimations for irreducible CFGs. For example, in the attached irreducible CFG countactual.pdf (the redblue color scale reflects estimated frequency), B8 has a lower estimated execution frequency than B7, but since B8 is the single successor of B7, it should have at least the same estimated execution frequency. This inaccuracy is due to the single forward pass: the frequencies of {B8, B9, B10, B11, B12, B13, B14, B16, B18, B20} are updated "too early" (before the frequency of B7 is updated with that of B22) and never revisited.
A possible solution is to transform the single forward pass into a proper system of equations to be solved iteratively (see e.g. [2]), as in the following Python/SymPy snippet:
from sympy import *
b3, l3, b5, b6, b7, l2, b14, b16, b18, b20, b21, b22, b23, b25 = symbols(['b3', 'l3', 'b5', 'b6', 'b7', 'l2', 'b14', 'b16', 'b18', 'b20', 'b21', 'b22', 'b23', 'b25'])
system = [
Eq(b3, 1),
Eq(l3, b3 * 1),
Eq(b5, l3 * 1),
Eq(b6, b5 * 0),
Eq(b7, b6 * 1 + b22 * 1),
Eq(l2, b7 * 1),
Eq(b14, l2 * 0.666),
Eq(b16, b14 * 1),
Eq(b18, b16 * 1),
Eq(b20, b18 * 1),
Eq(b21, b25 * 1 + b20 * 1),
Eq(b22, b21 * 0.02),
Eq(b23, b21 * 0.98),
Eq(b25, b5 * 1),
]
soln = solve(system, [b3, l3, b5, b6, b7, l2, b14, b16, b18, b20, b21, b22, b23, b25])
print(soln)
{b3: 1.00000000000000, l3: 1.00000000000000, b5: 1.00000000000000, b6: 0.0, b7: 0.0202699963514007, l2: 0.0202699963514007, b14: 0.0134998175700328, b16: 0.0134998175700328, b18: 0.0134998175700328, b20: 0.0134998175700328, b21: 1.01349981757003, b22: 0.0202699963514007, b23: 0.993229821218632, b25: 1.00000000000000}
countexpected.pdf shows the same CFG (and the corresponding result of GCM) using the accurate frequencies.
The above system of equations can be computed by reiterating the forward pass until a fix point is reached. This algorithm would require two passes for the vast majority of cases (reducible CFGs), and only suffer a slow down for irreducible CFGs, which could be mitigated with a limit on the number of iterations and a specified tolerance.
An alternative, wider solution could be to simply enforce reducibility through C2 by either bailing out on irreducible CFGs or transforming them into reducible ones by nodesplitting or dispatching.
[1] https://github.com/openjdk/jdk/blob/4a267f1bc2b025aae2cb9df7283156aeb0282406/src/hotspot/share/opto/gcm.cpp#L1790
[2] https://dl.acm.org/doi/10.1145/178243.178251 (Section 5).
This leads to inaccurate frequency estimations for irreducible CFGs. For example, in the attached irreducible CFG countactual.pdf (the redblue color scale reflects estimated frequency), B8 has a lower estimated execution frequency than B7, but since B8 is the single successor of B7, it should have at least the same estimated execution frequency. This inaccuracy is due to the single forward pass: the frequencies of {B8, B9, B10, B11, B12, B13, B14, B16, B18, B20} are updated "too early" (before the frequency of B7 is updated with that of B22) and never revisited.
A possible solution is to transform the single forward pass into a proper system of equations to be solved iteratively (see e.g. [2]), as in the following Python/SymPy snippet:
from sympy import *
b3, l3, b5, b6, b7, l2, b14, b16, b18, b20, b21, b22, b23, b25 = symbols(['b3', 'l3', 'b5', 'b6', 'b7', 'l2', 'b14', 'b16', 'b18', 'b20', 'b21', 'b22', 'b23', 'b25'])
system = [
Eq(b3, 1),
Eq(l3, b3 * 1),
Eq(b5, l3 * 1),
Eq(b6, b5 * 0),
Eq(b7, b6 * 1 + b22 * 1),
Eq(l2, b7 * 1),
Eq(b14, l2 * 0.666),
Eq(b16, b14 * 1),
Eq(b18, b16 * 1),
Eq(b20, b18 * 1),
Eq(b21, b25 * 1 + b20 * 1),
Eq(b22, b21 * 0.02),
Eq(b23, b21 * 0.98),
Eq(b25, b5 * 1),
]
soln = solve(system, [b3, l3, b5, b6, b7, l2, b14, b16, b18, b20, b21, b22, b23, b25])
print(soln)
{b3: 1.00000000000000, l3: 1.00000000000000, b5: 1.00000000000000, b6: 0.0, b7: 0.0202699963514007, l2: 0.0202699963514007, b14: 0.0134998175700328, b16: 0.0134998175700328, b18: 0.0134998175700328, b20: 0.0134998175700328, b21: 1.01349981757003, b22: 0.0202699963514007, b23: 0.993229821218632, b25: 1.00000000000000}
countexpected.pdf shows the same CFG (and the corresponding result of GCM) using the accurate frequencies.
The above system of equations can be computed by reiterating the forward pass until a fix point is reached. This algorithm would require two passes for the vast majority of cases (reducible CFGs), and only suffer a slow down for irreducible CFGs, which could be mitigated with a limit on the number of iterations and a specified tolerance.
An alternative, wider solution could be to simply enforce reducibility through C2 by either bailing out on irreducible CFGs or transforming them into reducible ones by nodesplitting or dispatching.
[1] https://github.com/openjdk/jdk/blob/4a267f1bc2b025aae2cb9df7283156aeb0282406/src/hotspot/share/opto/gcm.cpp#L1790
[2] https://dl.acm.org/doi/10.1145/178243.178251 (Section 5).
Attachments
Issue Links
 relates to

JDK8255763 C2: OSR miscompilation caused by invalid memory instruction placement
 Resolved

JDK8257146 C2: extend the scope of StressGCM
 Closed