-
Enhancement
-
Resolution: Unresolved
-
P4
-
21, 25
Current profiling code that deals with receiver types tries to maintain a flat array of (receiver-type, count) pairs in the table of TypeProfileWidth size. The code that gathers this profile is spread around interpreter and C1 code, and both individually and collectively runs in several issues that distort or mangle the profile. This is likely the source of run-to-run variance. This would be especially problematic in Leyden scenarios, where storing the profiles capture the misleading profiles in AOT cache.
The non-exhaustive list of issues is:
1. Polymorphic counter updates are missing. When receiver type table is full, most code defers to update the shared "polymorphic" counter to indicate polymorphic case. Except one case: the code gated by C1OptimizeVirtualCallProfiling tries to use MDO slots when the receiver type is known at compile time -- and it *misses* the update for polymorphic case.
2. The race for updating the counter. The "counter" is updated with non-atomic RMW instruction like addptr. This has a potential to lose the unbounded number of samples, if some CPU is stalled in the middle of update. But even without the stalls, the counter updates can go missing. Fixing this might have performance impact on profiling, if even non-contended atomics are slow.
3. The race for claiming the receiver slot. This one is the most egregious, so it deserves most of the attention.
The runtime code normally searches through the (receiver-type,count) table for a receiver. When that receiver is found, profiling code proceeds to update the counter. This path is fine, modulo issue (2). But when the profiling code sees no matching receivers in the table, it attempts to install the receiver in the table. This is where things go off the rails.
Even with full sequential consistency, it is plausible that we end up with duplicated receivers in the table. Observe:
<T1, T2, T3 threads all go search for empty receiver row>
<T1 wants to install R1, T2 wants to install R2, T3 wants to install R2>
Thread 1: sees receivers[0] == null
Thread 2: sees receivers[0] == null
Thread 1: installs receivers[0] = R1
Thread 3: sees receivers[0] == R1
Thread 2: installs receivers[0] = R2 (overwrites R1)
Thread 3: sees receivers[1] == null
Thread 3: installs receivers[1] = R2
End result: receiver[0] == R2, receiver[1] == R2. R1 record is lost, and for TypeProfileWidth=2 (current default) there are no more slots in the table, and the profile is misreported as megamorphic.
Note that C1OptimizeVirtualCallProfiling could make this kind of race condition semi-permanent:
<T1, T2 threads all go search for empty receiver row>
<T1 wants to install R1, T2 wants to install R2, C1 has a well-known receiver R1>
Thread 1: sees receivers[0] == null
Thread 2: sees receivers[0] == null
Thread 1: installs receivers[0] = R1
C1 compiler: sees receivers[0] == R1
Thread 2: installs receivers[0] = R2
C1 compiler: emits receivers_count[0]++ for R1
End result: C1 profiling code misattributes all counts that were destined for R1 to R2.
This can be fixed if we only accept atomic installations of receiver, thus protecting receiver slots from going back from concrete type to another one. In other words, only accept receiver slots transitions from null. Then, any code that updates the counter has to check if the current receiver slot was claimed for a particular receiver.
The scheme above works for runtime checks. But again, C1OptimizeVirtualCallProfiling throws a wrench into it by assigning the null slots for receivers at compile time. Which means, it can flip the receiver slot back to another receiver (unconditionally!), getting us back in trouble. Even worse, I believe you can plausibly get two compilations that would claim the *same* slot in the table, and that slot would flip-flop between receivers as compiled code executes.
<compile #1 has a well-known receiver R1>
<compile #2 has a well-known receiver R2>
C1 compile #1: sees receivers[0] == null
C1 compile #1: emits receivers[0] = R1
C1 compile #2: sees receivers[0] == null
C1 compile #2: emits receivers[0] = R2
Thread 1: executes #1, sets receivers[0] = R1
Thread 1: executes #2, sets receivers[0] = R2
End result: receiver[0] flip-flops between R1/R2, the associated counters get completely mixed up.
All these issues can and should be fixed together:
1. Common the receiver type profiling code in interpreter and C1
2. Rewrite receiver type profiling code to only do atomic receiver slot installations
3. Optionally make counter update code atomic (perhaps under a flag?)
4. Trim C1OptimizeVirtualCallProfiling to only claim slots when receiver is installed
I think this will also subsume runtime helper code density improvements (JDK-8353859), and that would give us a code size budget to do more complicated things in type profiling.
The non-exhaustive list of issues is:
1. Polymorphic counter updates are missing. When receiver type table is full, most code defers to update the shared "polymorphic" counter to indicate polymorphic case. Except one case: the code gated by C1OptimizeVirtualCallProfiling tries to use MDO slots when the receiver type is known at compile time -- and it *misses* the update for polymorphic case.
2. The race for updating the counter. The "counter" is updated with non-atomic RMW instruction like addptr. This has a potential to lose the unbounded number of samples, if some CPU is stalled in the middle of update. But even without the stalls, the counter updates can go missing. Fixing this might have performance impact on profiling, if even non-contended atomics are slow.
3. The race for claiming the receiver slot. This one is the most egregious, so it deserves most of the attention.
The runtime code normally searches through the (receiver-type,count) table for a receiver. When that receiver is found, profiling code proceeds to update the counter. This path is fine, modulo issue (2). But when the profiling code sees no matching receivers in the table, it attempts to install the receiver in the table. This is where things go off the rails.
Even with full sequential consistency, it is plausible that we end up with duplicated receivers in the table. Observe:
<T1, T2, T3 threads all go search for empty receiver row>
<T1 wants to install R1, T2 wants to install R2, T3 wants to install R2>
Thread 1: sees receivers[0] == null
Thread 2: sees receivers[0] == null
Thread 1: installs receivers[0] = R1
Thread 3: sees receivers[0] == R1
Thread 2: installs receivers[0] = R2 (overwrites R1)
Thread 3: sees receivers[1] == null
Thread 3: installs receivers[1] = R2
End result: receiver[0] == R2, receiver[1] == R2. R1 record is lost, and for TypeProfileWidth=2 (current default) there are no more slots in the table, and the profile is misreported as megamorphic.
Note that C1OptimizeVirtualCallProfiling could make this kind of race condition semi-permanent:
<T1, T2 threads all go search for empty receiver row>
<T1 wants to install R1, T2 wants to install R2, C1 has a well-known receiver R1>
Thread 1: sees receivers[0] == null
Thread 2: sees receivers[0] == null
Thread 1: installs receivers[0] = R1
C1 compiler: sees receivers[0] == R1
Thread 2: installs receivers[0] = R2
C1 compiler: emits receivers_count[0]++ for R1
End result: C1 profiling code misattributes all counts that were destined for R1 to R2.
This can be fixed if we only accept atomic installations of receiver, thus protecting receiver slots from going back from concrete type to another one. In other words, only accept receiver slots transitions from null. Then, any code that updates the counter has to check if the current receiver slot was claimed for a particular receiver.
The scheme above works for runtime checks. But again, C1OptimizeVirtualCallProfiling throws a wrench into it by assigning the null slots for receivers at compile time. Which means, it can flip the receiver slot back to another receiver (unconditionally!), getting us back in trouble. Even worse, I believe you can plausibly get two compilations that would claim the *same* slot in the table, and that slot would flip-flop between receivers as compiled code executes.
<compile #1 has a well-known receiver R1>
<compile #2 has a well-known receiver R2>
C1 compile #1: sees receivers[0] == null
C1 compile #1: emits receivers[0] = R1
C1 compile #2: sees receivers[0] == null
C1 compile #2: emits receivers[0] = R2
Thread 1: executes #1, sets receivers[0] = R1
Thread 1: executes #2, sets receivers[0] = R2
End result: receiver[0] flip-flops between R1/R2, the associated counters get completely mixed up.
All these issues can and should be fixed together:
1. Common the receiver type profiling code in interpreter and C1
2. Rewrite receiver type profiling code to only do atomic receiver slot installations
3. Optionally make counter update code atomic (perhaps under a flag?)
4. Trim C1OptimizeVirtualCallProfiling to only claim slots when receiver is installed
I think this will also subsume runtime helper code density improvements (
- duplicates
-
JDK-8353859 C1: x86: Optimize type profile helper
-
- Closed
-
- relates to
-
JDK-8353859 C1: x86: Optimize type profile helper
-
- Closed
-