Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8178201 | 10 | Thomas Stuefe | P3 | Resolved | Fixed | b04 |
On AIX, we see sporadic asserts when running the jtreg tests:
-----------
# Internal Error (/priv/d031900/openjdk/jdk9-hs/source/hotspot/src/share/vm/runtime/thread.cpp:295), pid=1073374, tid=8739
# assert(_thr_current == 0L) failed: Thread::current already initialized
--------------- T H R E A D ---------------
Current thread (0xbabababababababa):
[error occurred during error reporting (printing current thread), id 0xe0000000]
------------
A new born thread (usually the AttachListener) wants to initialize Thread::current(). SinceJDK-8132510 Thread::current() is implemented with compiler level TLS ("__thread"). Before, it was implemented using pthread library TLS ("pthread_getspecifc" etc). So the code wants to initialize its instance of _thr_current (a __thread variable) but finds it being not NULL. __thread variables should be default be initialized to 0 by the C-Runtime.
In this case the __thread variable is filled with a "0xbababa..." pattern, which after analysis turned out to be the zap value we use in os::free() to mark freed memory before returning it to the C-Runtime.
The memory backing the __thread variables lives in the process data segment, as does the C-heap memory, so an overwrite scenario is possible. In fact, __thread variable locations and malloc() locations are closely interleaved. From the address patterns, it looks like the C-Runtime just mallocs the backing memory for TLS instances as it goaes along, for each new born thread. It does not look like C-Runtime pre-allocates memory for the TLS instances. All this is guesswork though, AIX is closed source, so no way to examine the implementation.
There is a theoretical possibility that this is our fault, that the VM stomps over C-Runtime internal memory. However, after analyzing the issue I think that this is unlikely. It is more likely that the error is with the OS/C-Runtime. Here is why:
1) When examining the order of malloc/free calls, one can observe the malloc call which allocates the memory range which spans the location of the future-to-be bad TLS variable instance. It is malloced (via os::malloc()), then freed again (via os::free). Nothing untoward happens, VM is well behaved. It zaps the memory and hands it back to the C-Runtime. This zap value later shows up as content of the new born threads TLS variable.
2) I never see *existing* TLS variables overwritten, only *new* TLS variables for newborn threads having the wrong initialization value. If we really were stomping around, we should have hit with a certain probability existing TLS variables too, or any other vital memory, and should see more diverse errors.
3) Error only happens on AIX 5.3, observed on two machines. No error seen on AIX 6.1 and AIX 7.2.
4) os::malloc/os::free establish and check guards (GuardedMemory). So simple cases of overwriters or double frees should be catched.
I could still conceive a highly far fetched scenario (see comments) where we could be guilty of stomping over C-Runtime memory, but find it unlikely. More likely is that OS/C-Runtime did not correctly initialize the __thread TLS variable for this thread.
I attempted to write a simple C reproduction case, but so far without success. We will contact IBM support and check for known bugs.
I propose to switch off compiler based TLS and go back to pthread library TLS. (That should be simply, David preserved both code paths and added a compiler switch when he did the original changes forJDK-8132510). There are no real advantages to compiler level TLS, and pthread level TLS used to work for many years for us on AIX without problems. I also like to reduce dependencies to the C-Runtime and compiler on AIX.
-----------
# Internal Error (/priv/d031900/openjdk/jdk9-hs/source/hotspot/src/share/vm/runtime/thread.cpp:295), pid=1073374, tid=8739
# assert(_thr_current == 0L) failed: Thread::current already initialized
--------------- T H R E A D ---------------
Current thread (0xbabababababababa):
[error occurred during error reporting (printing current thread), id 0xe0000000]
------------
A new born thread (usually the AttachListener) wants to initialize Thread::current(). Since
In this case the __thread variable is filled with a "0xbababa..." pattern, which after analysis turned out to be the zap value we use in os::free() to mark freed memory before returning it to the C-Runtime.
The memory backing the __thread variables lives in the process data segment, as does the C-heap memory, so an overwrite scenario is possible. In fact, __thread variable locations and malloc() locations are closely interleaved. From the address patterns, it looks like the C-Runtime just mallocs the backing memory for TLS instances as it goaes along, for each new born thread. It does not look like C-Runtime pre-allocates memory for the TLS instances. All this is guesswork though, AIX is closed source, so no way to examine the implementation.
There is a theoretical possibility that this is our fault, that the VM stomps over C-Runtime internal memory. However, after analyzing the issue I think that this is unlikely. It is more likely that the error is with the OS/C-Runtime. Here is why:
1) When examining the order of malloc/free calls, one can observe the malloc call which allocates the memory range which spans the location of the future-to-be bad TLS variable instance. It is malloced (via os::malloc()), then freed again (via os::free). Nothing untoward happens, VM is well behaved. It zaps the memory and hands it back to the C-Runtime. This zap value later shows up as content of the new born threads TLS variable.
2) I never see *existing* TLS variables overwritten, only *new* TLS variables for newborn threads having the wrong initialization value. If we really were stomping around, we should have hit with a certain probability existing TLS variables too, or any other vital memory, and should see more diverse errors.
3) Error only happens on AIX 5.3, observed on two machines. No error seen on AIX 6.1 and AIX 7.2.
4) os::malloc/os::free establish and check guards (GuardedMemory). So simple cases of overwriters or double frees should be catched.
I could still conceive a highly far fetched scenario (see comments) where we could be guilty of stomping over C-Runtime memory, but find it unlikely. More likely is that OS/C-Runtime did not correctly initialize the __thread TLS variable for this thread.
I attempted to write a simple C reproduction case, but so far without success. We will contact IBM support and check for known bugs.
I propose to switch off compiler based TLS and go back to pthread library TLS. (That should be simply, David preserved both code paths and added a compiler switch when he did the original changes for
- backported by
-
JDK-8178201 [aix] assert(_thr_current == 0L) failed: Thread::current already initialized
- Resolved
- relates to
-
JDK-8132510 Replace ThreadLocalStorage with compiler/language-based thread-local variables
- Resolved