The performance of ThreadLocal, particularly ThreadLocal.remove, can degrade significantly under specific usage patterns. The reproducer attached shows that we can relatively easily get into a state where ThreadLocal.remove on a single element takes 1.5 seconds, for a map with 580k entries and total capacity 1050k (load factor 0.55).
```
java --add-opens=java.base/java.lang=ALL-UNNAMED ThreadLocalTest.java
...
Length: 1048576, numEntries: 582795, maxRun: 21245, time to remove random element: 1458.794ms
...
```
I have seen real applications which occasionally start using nearly all their CPU in ThreadLocalMap.expungeStaleEntry (similar to the report at https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/43), and I suspect it could be this same issue.
I have only seen this in the wild with G1GC, and I also have only managed to reproduce it with G1GC, so it seems that G1's behaviour is an important factor in triggering the condition.
Some notes:
1. The problem occurs when we accrue long runs of elements in the hash table (possibly caused by https://en.wikipedia.org/wiki/Primary_clustering?).
2. ThreadLocalMap does not use tombstones, which if used correctly can mitigate primary clustering (https://arxiv.org/abs/2107.01250).
3. expungeStaleEntry has quadratic cost in the length of the run. For example, if removing the first element in a run of 1000 elements, we will iterate in this loop 499,500 times in total (1000 outer loops with average 999/2 iterations each):
```
// Unlike Knuth 6.4 Algorithm R, we must scan until
// null because multiple entries could have been stale.
while (tab[h] != null)
h = nextIndex(h, len);
```
```
java --add-opens=java.base/java.lang=ALL-UNNAMED ThreadLocalTest.java
...
Length: 1048576, numEntries: 582795, maxRun: 21245, time to remove random element: 1458.794ms
...
```
I have seen real applications which occasionally start using nearly all their CPU in ThreadLocalMap.expungeStaleEntry (similar to the report at https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/43), and I suspect it could be this same issue.
I have only seen this in the wild with G1GC, and I also have only managed to reproduce it with G1GC, so it seems that G1's behaviour is an important factor in triggering the condition.
Some notes:
1. The problem occurs when we accrue long runs of elements in the hash table (possibly caused by https://en.wikipedia.org/wiki/Primary_clustering?).
2. ThreadLocalMap does not use tombstones, which if used correctly can mitigate primary clustering (https://arxiv.org/abs/2107.01250).
3. expungeStaleEntry has quadratic cost in the length of the run. For example, if removing the first element in a run of 1000 elements, we will iterate in this loop 499,500 times in total (1000 outer loops with average 999/2 iterations each):
```
// Unlike Knuth 6.4 Algorithm R, we must scan until
// null because multiple entries could have been stale.
while (tab[h] != null)
h = nextIndex(h, len);
```