It seems like it would be useful to take a step backward and examine the scenario that we're discussing. I had originally said that I was going to post one of my transaction diagrams, but something close to that already exists. Dave D's blog entry is here: https://blogs.oracle.com/dave/entry/a_race_in_locksupport_park and here is the scenario from that blog entry: Appendix - elaborated scenario We have a volatile Java variable C, initially 0. Thread T1's Parker::_counter value is 1. - Thread T1 executes: while (C == 0) park() ; - Thread T2 executes: C=1; unpark(T1) Timing: 1. Thread T1 loads C, observes 0, and calls park() 2. Thread T1 in 1st park() invocation: 2a. Load _counter (1) 2b. Store _counter (0) ? languishes in store buffer 2c. return from park() 3. Thread T1: Load C (0) 4. Thread T1: call park() a 2nd time 5. Thread T2: Store C=1; MFENCE (by virtue of java volatile). MFENCE is a completely local operation, influencing only T2. 6. Thread T2: call unpark(T1) 6a. lock _mutex with atomic instruction such as CAS or LOCK:CMPXCHG 6b. Load _counter (0) ? observes value from memory, completely legal 6c. Store _counter (1) 6d. unlock _mutex 7. Thread T1's store of 0 into _counter finally drains from the store buffer and becomes globally visible, overwriting the value 1 just stored by T2 8. Thread T1 continues in 2nd invocation of park() 8a. Load _counter (0) 8b. lock _mutex 8c. Load _counter (0) 8d. block on _condvar I've changed Dave's text a bit; I made the bulleted items into items so it's easier to talk about where we are in the scenario. I made no content changes. Sidebar: You might ask why T1's Parker::_counter value is 1 at the start of the scenario. Remember that park() can return spuriously and that, while only one thread can call park(), many threads can call unpark() and unpark() can also be called before park(). The initial setting of T1's Parker::_counter value at 1 just simulates a spurious return from park() or indicates that an unpark() was queued up (for whatever reason) before T1 called park(). Take your pick as to the reason. The key to the initial setup is that 'C' starts at zero (0) so T1 is going to be calling park() in a tight loop. The first thing that "goes wrong" is in step "2b" when T1's "Store _counter(0)" languishes in the store buffer. Since park() is trying to be clever by being lockless at that stage, it opens a window that allows "Store _counter(0)" to languish. I don't think anyone disagrees that not using locks allows this languishing to occur. Sidebar: Step "5" says that the MFENCE after "Store C=1" is completely local to T2. This statement is causing heartburn for some folks for a variety of reasons. The best thing to do is to ignore what instruction is used (MFENCE or LOCK:ADD or ...) and think about it in terms of Java volatile semantics or perhaps non-semantics. A "volatile store" does not mean that other threads see the new value immediately. Eventually: yes, immediately: no. A "volatile store" does not imply any particular execution order except in relation to other volatile stores. I _think_ the "outer bound" of T2's "Store C=1" being visible is when T2 locks the mutex in step "6a". At the point of the lock, full memory barrier semantics are in place so the "Store C=1" should now be visible to T1. The second thing that "goes wrong" is in step "6a" when T2 locks the mutex. The "triangular race" assertion is that acquiring the lock doesn't cause the languishing store in T1 to be flushed. I think this is a key assertion of the triangular race theory. Step "6b" is discussed in comments to the blog entry, but I don't think T2's observation of a _counter value of zero (0) or not is relevant to the race. In fact, if T2 still sees _counter == 1, then T2 won't call "pthread_cond_signal(_cond)" and then the unpark() is even more likely to be "lost". The third thing that "goes wrong" is in step "6d" when the unlock of the mutex that flushes T2's "Store _counter (1)" occurs. The "triangular race" assertion is that releasing the lock doesn't cause the languishing store in T1 to be flushed. I think this is another key assertion of the triangular race theory. The last thing that "goes wrong" is in step "7" when T1's languishing "Store _counter (0)" finally drains and overwrites the value 1 that was just stored by T2. By the time we get to step "8", according to the "triangular race" theory, it is all over except for the shouting. In step "8a", T1 observes a _counter value of zero which it had just flushed to global memory itself. T1 goes on to grab the lock, block on the condvar and the loss of T2's unpark() has resulted in the hang of T1... The first of the key assertions above: - acquiring the lock doesn't cause the languishing store in T1 to be flushed is hard to believe, but maybe it is possible with weak memory model machines. I know almost nothing about weak memory model machines so I'll have to rely on the opinions of others. The second of the key assertions above: - releasing the lock doesn't cause the languishing store in T1 to be flushed is also hard to believe, but it appears that a loose pthreads implementation is allowed to be lax on memory flushes on unlock as long as everyone uses locks. Our problem is because we mix lockless use of the _counter field in park() with locked use in unpark(). So now let's take a look at the above scenario in transaction diagram format: We have a volatile Java variable C, initially 0. Thread T1's Parker::_counter value is 1. - Thread T1 executes: while (C == 0) park(); - Thread T2 executes: C=1; unpark(T1) Transaction Diagram #1: Thread T1 Thread T2 =================== ================= while (C == 0) : park() C = 1 // C is volatile // _counter == 1 unpark(T1) if (_counter > 0) { : _counter = 0 : // return to while : // nothing to force : // _counter flush : return : } : // while observes C == 0 : park() lock(_mutex) // T1 past 'C == 0' ck // lock forces flush of C // _counter = 0 still // _counter seen as 1 by T2 // languishing in T1's s = _counter // store buffer _counter = 1 if (_counter > 0) { unlock(_mutex) _counter = 0 // _counter flushes return // s == 1 so no signal } // even if s == 0, there // trylock works // is no thread to signal if (!trylock(_mutex)) { if (s < 1) { return signal(_cond) } } // trylock flushes // _counter from T1's // store buffer // _counter == 0 if (_counter > 0) { _counter = 0 unlock(_mutex) return } cond_wait(_cond, _mutex) : : _counter = 0 unlock(_mutex) In my transaction diagram, I have T2's "s = _counter" saving a counter value of 1 instead of "Load _counter(0)" from Dave D's original. The lock and unlock operations by T2 are not causing T1's languishing "_counter = 0" to flush so T2 doesn't see "_counter == 0". This means that T2 doesn't try to signal "_cond". However, even if T2 saw "_counter == 0", T1 isn't blocked in a cond_wait() call so there is no thread to signal. In transaction diagram format it is easier to see how T1's "_counter = 0" from the first park() call can happen after T2's "_counter = 1" which leads to T2's unlock() being "lost". This sliding of a store from T1's first call to park() into T1's second call to park() can happen because of the lockless nature of T1's first call to park(). Ahhh... the joys of lockless algorithms. So now let's take a look at the transaction diagram with the original fix for 6822370 which added fence() calls: We have a volatile Java variable C, initially 0. Thread T1's Parker::_counter value is 1. - Thread T1 executes: while (C == 0) park(); - Thread T2 executes: C=1; unpark(T1) Transaction Diagram #2a: Thread T1 Thread T2 =================== ================= while (C == 0) : park() C = 1 // C is volatile // _counter == 1 unpark(T1) if (_counter > 0) { : _counter = 0 : fence() : // _counter is flushed : // return to while : return : } : // while observes C == 0 : park() lock(_mutex) // T1 past 'C == 0' ck // lock forces flush of C // _counter == 0 // _counter seen as 0 by T2 if (_counter > 0) { s = _counter _counter = 0 _counter = 1 fence() unlock(_mutex) return // _counter flushes } // s == 0 // trylock works if (s < 1) { if (!trylock(_mutex)) { signal(_cond) return } } // _counter == 1 if (_counter > 0) { _counter = 0 unlock(_mutex) fence() // return to while return } cond_wait(_cond, _mutex) _counter = 0 unlock(_mutex) fence() The addition of the fence() prevents "_counter = 0" from languishing in T1's store buffer so T1 can't overwrite T2's "_counter = 1" later. Looks like the fix for 6822370 should work as designed. If the sequence in T2 happens earlier than shown in Transaction Diagram #2a, then T2's update of "C = 1" will be observed earlier by T1 and T1 will not make a second call to park(): Transaction Diagram #2b: Thread T1 Thread T2 =================== ================= while (C == 0) : park() C = 1 // C is volatile : unpark(T1) // _counter == 1 lock(_mutex) if (_counter > 0) { // lock forces flush of C _counter = 0 // _counter seen as 0 || 1 by T2 fence() s = _counter // _counter is flushed _counter = 1 // return to while unlock(_mutex) return // _counter flushes } // s == 0 || s == 1 // while observes C == 1 if (s < 1) { signal(_cond) } If the sequence in T2 happens a little bit later than shown in Transaction Diagram #2a for the original fix for 6822370, then T1's second call to park() will see a conflict in the trylock() call. The trylock() failure will result in T1 returning from the second call to park(), T2's update of "C == 1" will be observed, and T1 will terminate the while-loop: Transaction Diagram #2c: Thread T1 Thread T2 =================== ================= while (C == 0) : park() C = 1 // C is volatile // _counter == 1 unpark(T1) if (_counter > 0) { : _counter = 0 : fence() : // _counter is flushed : // return to while : return : } : // while observes C == 0 : park() : // T1 past 'C == 0' ck : // _counter == 0 : if (_counter > 0) { : _counter = 0 : fence() lock(_mutex) return // lock forces flush of C } // _counter seen as 0 by T2 // trylock fails s = _counter if (!trylock(_mutex)) { _counter = 1 // return to while unlock(_mutex) return // _counter flushes } // s == 0 if (s < 1) { signal(_cond) } If the sequence in T2 happens even later than shown in Transaction Diagram #2a, then T1's second call to park() will block in the cond_wait() call and T2's signal() call will wake up T1. T1 will return from the second call to park(), T2's update of "C == 1" will be observed, and T1 will terminate the while-loop: Transaction Diagram #2d: Thread T1 Thread T2 =================== ================= while (C == 0) : park() C = 1 // C is volatile // _counter == 1 unpark(T1) if (_counter > 0) { : _counter = 0 : fence() : // _counter is flushed : // return to while : return : } : // while observes C == 0 : park() : // T1 past 'C == 0' ck : // _counter == 0 : if (_counter > 0) { : _counter = 0 : fence() : return : } : // trylock works : if (!trylock(_mutex)) { : return : } : // _counter == 0 : if (_counter > 0) { : _counter = 0 : unlock(_mutex) : fence() : return : } : cond_wait(_cond, _mutex) : : lock(_mutex) : // lock forces flush of C : // _counter seen as 0 by T2 : s = _counter : _counter = 1 : unlock(_mutex) : // _counter flushes : // s == 0 : if (s < 1) { : signal(_cond) : <_mutex relocked> } _counter = 0 unlock(_mutex) fence() return So now let's take a look at some different scenarios for a racing park() versus a racing unpark() with the original fix for 6822370 which added a fence() call: We have a volatile Java variable C, initially 0. Thread T1's Parker::_counter value is 0. - Thread T1 executes: while (C == 0) park(); - Thread T2 executes: C=1; unpark(T1) In the first variation, T2's finishes most of its work before T1: Transaction Diagram #3a: Thread T1 Thread T2 =================== ================= while (C == 0) : // T1 past 'C == 0' ck C = 1 // C is volatile park() unpark(T1) : lock(_mutex) : // lock forces flush of C : s = _counter : _counter = 1 : unlock(_mutex) : // _counter flushes // _counter == 1 // s == 0 if (_counter > 0) { if (s < 1) { _counter = 0 signal(_cond) fence() } // _counter is flushed // return to while return } In Transaction Diagram #3a, T2's "_counter = 1" gets flushed before T1 gets past the first lockless check of the _counter field. T1 returns from the park() call and terminates the while-loop. In the next variation, T2's work overlaps a little more with T1: Transaction Diagram #3b: Thread T1 Thread T2 =================== ================= while (C == 0) : // T1 past 'C == 0' ck C = 1 // C is volatile park() unpark(T1) : lock(_mutex) : // lock forces flush of C // _counter == 0 s = _counter if (_counter > 0) { _counter = 1 _counter = 0 unlock(_mutex) fence() // _counter flushes return // s == 0 } if (s < 1) { // trylock works signal(_cond) if (!trylock(_mutex)) { } return } // _counter == 1 if (_counter > 0) { _counter = 0 unlock(_mutex) fence() // return to while return } In Transaction Diagram #3b, T1 gets past the first lockless check of the _counter field before T2's "_counter = 1" gets flushed. T2 manages to unlock the mutex before T1's trylock() call. So T1 holds the mutex lock when it observes the "_counter == 1" value, resets _counter, unlocks the mutex, returns from park() and terminates the while-loop. In the next variation, T2's locked region overlaps with T1's trylock() call: Transaction Diagram #3c: Thread T1 Thread T2 =================== ================= while (C == 0) : // T1 past 'C == 0' ck C = 1 // C is volatile park() unpark(T1) : : : : // _counter == 0 : if (_counter > 0) { : _counter = 0 : fence() : return : } lock(_mutex) // trylock fails // lock forces flush of C if (!trylock(_mutex)) { s = _counter // return to while _counter = 1 return unlock(_mutex) } // _counter flushes // s == 0 if (s < 1) { signal(_cond) } In Transaction Diagram #3c, T1 gets past the first lockless check of the _counter field, but T1's trylock() collides with T2's locked region so T1 returns from park() and terminates the while-loop. In the last variation, T2's work happens after T1 has blocked: Transaction Diagram #3d: Thread T1 Thread T2 =================== ================= while (C == 0) : // T1 past 'C == 0' ck C = 1 // C is volatile park() unpark(T1) // _counter == 0 : if (_counter > 0) { : _counter = 0 : fence() : return : } : // trylock works : if (!trylock(_mutex)) { : return : } : // _counter == 0 : if (_counter > 0) { : _counter = 0 : unlock(_mutex) : fence() : return : } : cond_wait(_cond, _mutex) : : lock(_mutex) : // lock forces flush of C : // _counter seen as 0 by T2 : s = _counter : _counter = 1 : unlock(_mutex) : // _counter flushes : // s == 0 : if (s < 1) { : signal(_cond) : <_mutex relocked> } _counter = 0 unlock(_mutex) fence() return In Transaction Diagram #3d, T1's call to park() will block in the cond_wait() call and T2's signal() call will wake up T1. T1 will return from park() and T1 will terminate the while-loop: Lastly, let's take a look at a transaction diagram when the first fence() added by the original fix for 6822370 is replaced with an xchg() call. We have a volatile Java variable C, initially 0. Thread T1's Parker::_counter value is 1. - Thread T1 executes: while (C == 0) park(); - Thread T2 executes: C=1; unpark(T1) Transaction Diagram #4: Thread T1 Thread T2 =================== ================= while (C == 0) : park() C = 1 // C is volatile // _counter == 1 unpark(T1) ret = xchg(0, _counter) : // _counter is flushed : // ret == 1 : if (ret > 0) { : // return to while : return : } : // while observes C == 0 : park() lock(_mutex) // T1 past 'C == 0' ck // lock forces flush of C // _counter == 0 // _counter seen as 0 by T2 ret = xchg(0, _counter) s = _counter // _counter is flushed _counter = 1 // ret == 0 unlock(_mutex) if (ret > 0) { // _counter flushes return // s == 0 } if (s < 1) { // trylock works signal(_cond) if (!trylock(_mutex)) { } return } // _counter == 1 if (_counter > 0) { _counter = 0 unlock(_mutex) fence() // return to while return } cond_wait(_cond, _mutex) _counter = 0 unlock(_mutex) fence() The original fix for 6822370 added a fence() call on the three code paths in Parker::park() where _counter was set to zero (0). When the first of those fence() calls is replaced by an Atomic::xchg() call, the code becomes more predictable and precise. In the original fix, here the lockless and locked parts: Thread T1 Thread T2 =================== ================== park() unpark(T1) : lock(_mutex) 1 if (_counter > 0) { 1 s = _counter 2 _counter = 0 2 _counter = 1 3 fence() 3 unlock(_mutex) return // _counter flushes } if (!trylock(_mutex)) { ... the bi-directional barrier semantics of fence() are only present on one branch of the if-statement. For the purposes of fixing the languishing store bug, calling fence() after "_counter = 0" (in all 3 places) is the minimum that needs to be done. For the purposes of analyzing correctness about my previous sentence, you have to reason about lines 1-3 in both T1 and T2, how their execution order aligns, the memory effects of each line, etc. 9 combinations if you consider the semantics of "unlock(_mutex)" as one operation. In the revised fix, here the lockless and locked parts: Thread T1 Thread T2 =================== ================== park() unpark(T1) : lock(_mutex) 1 ret = xchg(0, _counter) 1 s = _counter if (ret > 0) { 2 _counter = 1 return 3 unlock(_mutex) } // _counter flushes if (!trylock(_mutex)) { ... the bi-directional barrier semantics of xchg() are present on both branches of the if-statement. For the purposes of fixing the languishing store bug, calling xchg() instead adding the first fence() call is more than equivalent. For the purposes of analyzing correctness about my previous sentence, you have to reason about line 1 in T1 versus lines 1-3 in T2, how their execution order aligns, the memory effects of each line, etc. 3 combinations if you consider the semantics of "unlock(_mutex)" as one operation. Just to be clear, I haven't found a transaction diagram scenario that shows that switching the first fence() call added by the original fix for 6822370 to an Atomic::xchg() is required for correctness. The switch definitely helps me logically reason about the algorithm's correctness. However, as I have said before in other e-mail threads, I'm not well versed in weaker memory model machines. So it could be possible to construct a transaction diagram scenario that shows the switch from fence() to Atomic::xchg() is necessary on a weaker memory model machine. The addition of fence() calls after the second and third settings of "_counter = 0" was done in the original fix for 6822370 for similar reasons: worries about weaker memory model semantics. Those two settings of "_counter = 0" are done while the mutex lock is held and they are followed by unlock(_mutex) operations. In all of the transaction diagrams, I included a "// _counter flushes" comment after T2 did an "unlock(_mutex)" operation. However, there is concern that unlock(_mutex) may not force _counter to be flushed. It appears that a loose pthreads implementation is allowed to be lax on memory flushes on unlock as long as everyone uses locks. Since unpark() always uses locks, I think it is OK for the transaction diagrams to show T2's unlock() calls including the "// _counter flushes" comment. In support of this, I'll note that the original fix for 6822370 does not include fence() calls after unpark() calls unlock(_mutex). Dave D's latest changes also do not include fence() calls after unpark() calls unlock(_mutex)