Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8258895

C2: Refine execution frequency estimation for irreducible CFGs

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Unresolved
    • Icon: P4 P4
    • tbd
    • 17
    • hotspot

      CFGLoop::compute_freq() assumes that the underlying CFG is reducible. Under this assumption, it computes the frequencies of each block within a loop L in a single forward pass (in reverse DFS postorder) over the members (blocks and loops) of L [1], assuming that the members form an acyclic graph (except for L's back-edges). This assumption does not hold for irreducible graphs, where there might be additional cycles corresponding to non-natural loops.

      This leads to inaccurate frequency estimations for irreducible CFGs. For example, in the attached irreducible CFG count-actual.pdf (the red-blue color scale reflects estimated frequency), B8 has a lower estimated execution frequency than B7, but since B8 is the single successor of B7, it should have at least the same estimated execution frequency. This inaccuracy is due to the single forward pass: the frequencies of {B8, B9, B10, B11, B12, B13, B14, B16, B18, B20} are updated "too early" (before the frequency of B7 is updated with that of B22) and never revisited.

      A possible solution is to transform the single forward pass into a proper system of equations to be solved iteratively (see e.g. [2]), as in the following Python/SymPy snippet:

      from sympy import *
      b3, l3, b5, b6, b7, l2, b14, b16, b18, b20, b21, b22, b23, b25 = symbols(['b3', 'l3', 'b5', 'b6', 'b7', 'l2', 'b14', 'b16', 'b18', 'b20', 'b21', 'b22', 'b23', 'b25'])
      system = [
          Eq(b3, 1),
          Eq(l3, b3 * 1),
          Eq(b5, l3 * 1),
          Eq(b6, b5 * 0),
          Eq(b7, b6 * 1 + b22 * 1),
          Eq(l2, b7 * 1),
          Eq(b14, l2 * 0.666),
          Eq(b16, b14 * 1),
          Eq(b18, b16 * 1),
          Eq(b20, b18 * 1),
          Eq(b21, b25 * 1 + b20 * 1),
          Eq(b22, b21 * 0.02),
          Eq(b23, b21 * 0.98),
          Eq(b25, b5 * 1),
          ]
      soln = solve(system, [b3, l3, b5, b6, b7, l2, b14, b16, b18, b20, b21, b22, b23, b25])
      print(soln)
      {b3: 1.00000000000000, l3: 1.00000000000000, b5: 1.00000000000000, b6: 0.0, b7: 0.0202699963514007, l2: 0.0202699963514007, b14: 0.0134998175700328, b16: 0.0134998175700328, b18: 0.0134998175700328, b20: 0.0134998175700328, b21: 1.01349981757003, b22: 0.0202699963514007, b23: 0.993229821218632, b25: 1.00000000000000}

      count-expected.pdf shows the same CFG (and the corresponding result of GCM) using the accurate frequencies.

      The above system of equations can be computed by reiterating the forward pass until a fix point is reached. This algorithm would require two passes for the vast majority of cases (reducible CFGs), and only suffer a slow down for irreducible CFGs, which could be mitigated with a limit on the number of iterations and a specified tolerance.

      An alternative, wider solution could be to simply enforce reducibility through C2 by either bailing out on irreducible CFGs or transforming them into reducible ones by node-splitting or dispatching.

      [1] https://github.com/openjdk/jdk/blob/4a267f1bc2b025aae2cb9df7283156aeb0282406/src/hotspot/share/opto/gcm.cpp#L1790
      [2] https://dl.acm.org/doi/10.1145/178243.178251 (Section 5).

            Unassigned Unassigned
            rcastanedalo Roberto Castaneda Lozano
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: