-
Bug
-
Resolution: Fixed
-
P3
-
8, 11, 17, 21, 22
-
b08
-
linux
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8315122 | 21.0.1 | Thomas Stuefe | P3 | Resolved | Fixed | b09 |
JDK-8321110 | 17.0.11-oracle | Poonam Bajaj Parhar | P3 | Resolved | Fixed | b01 |
JDK-8315396 | 17.0.10 | Thomas Stuefe | P3 | Resolved | Fixed | b01 |
Note that this effect is independent of any JVM switches; it happens regardless of -XX:+UseTransparentHugePages.
Demonstration:
10000 idle threads with 100 MB pre-touched java heap, -Xss2M, on x64, will consume:
A) Baseline (THP disabled on system): 369 MB
B) THP="always",
C) THP="always",
Cause:
The problem is caused by timing. When we create multiple Java threads, the following sequence of actions happens:
In the parent thread:
the parent thread calls pthread_create(3)
pthread_create(3) creates the thread stack by calling mmap(2)
pthread_create(3) calls clone(2) to start the child thread
repeat to start more threads
Each child thread:
queries its stack dimensions
handshakes with the parent to signal lifeness
establishes guard pages at the low end of the stack
The thread stack mapping is established in the parent thread; the guard pages are placed by the child threads. There is a time window in which the thread stack is already mapped into address space, but guard pages still need to be placed.
If the parent is faster than the children, it will have created mappings faster than the children can place guard pages on them.
For the kernel, these thread stacks are just anonymous mappings. It places them adjacent to each other to reduce address space fragmentation. As long as no guard pages are placed yet, all these thread stack mappings (VMAs) have the same attributes - same permission bits, all anonymous. Hence, the kernel will fold them into a single large VMA.
That VMA may be large enough to be eligible for huge pages. Now the JVM races with the khugepaged: If khugepaged is faster than the JVM, it will have converted that larger VMA partly or fully into hugepages before the child threads start creating guard pages.
The child threads will catch up and create guard pages. That will splinter the large VMA into several smaller VMAs (two for each thread, one for the usable thread section, and one protected for the guards). Each of these VMAs will typically be smaller than a huge page, and typically not huge-page-aligned. The huge pages created by khugepaged will mostly shatter into small pages, but these small pages remain paged-in. Effect: we pay memory for the whole thread stacks even though the threads did not start yet.
This is a similar effect as described in
Example:
Let's create three threads. Each thread stack, including guard pages, is 2M + 4K sized (+4K because of
Their thread stacks will be located at: ( [base .. end .. guard]:
T1: [7feea53ff000 .. 7feea5202000 .. 7feea51fe000]
T2: [7feea5600000 .. 7feea5403000 .. 7feea53ff000]
T3: [7feea5801000 .. 7feea5604000 .. 7feea5600000]
After pthread_create(3), their thread stacks exist without JVM guard pages. Kernel merges the VMAs of their thread stacks into a single mapping > 6MB. khugepaged then coalesces their small pages into 3 huge pages:
```
7feea51fe000-7feea5801000 rw-p 00000000 00:00 0 <<<------- all three stacks as one VMA
Size: 6156 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 6148 kB
Pss: 6148 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 6148 kB
Referenced: 6148 kB
Anonymous: 6148 kB
LazyFree: 0 kB
AnonHugePages: 6144 kB <<<---------- 3x2MB huge pages
ShmemPmdMapped: 0 kB
FilePmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
THPeligible: 1
VmFlags: rd wr mr mw me ac sd
```
Threads start and create their respective guard pages. The single VMA splinters into 6 smaller VMAs. The huge pages shatter into small pages that remain paged-in:
```
7feea51fe000-7feea5202000 ---p 00000000 00:00 0 <<----- guard pages for T1
Size: 16 kB
...
7feea5202000-7feea53ff000 rw-p 00000000 00:00 0 <<------ thread stack for T1
Size: 2036 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 2036 kB
Pss: 2036 kB
Private_Dirty: 2036 kB <<<-------- all pages resident
...
7feea53ff000-7feea5403000 ---p 00000000 00:00 0 <<----- guard pages for T2
Size: 16 kB
...
7feea5403000-7feea5600000 rw-p 00000000 00:00 0 <<------ thread stack for T2
Size: 2036 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 2036 kB
Pss: 2036 kB
Private_Dirty: 2036 kB <<<-------- all pages resident
...
7feea5600000-7feea5604000 ---p 00000000 00:00 0 <<----- guard pages for T3
Size: 16 kB
...
7feea5604000-7feea5801000 rw-p 00000000 00:00 0 <<------ thread stack for T3
Size: 2036 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 2036 kB
Pss: 2036 kB
Private_Dirty: 2036 kB <<<-------- all pages resident
...
```
- backported by
-
JDK-8315122 THPs cause huge RSS due to thread start timing issue
- Resolved
-
JDK-8315396 THPs cause huge RSS due to thread start timing issue
- Resolved
-
JDK-8321110 THPs cause huge RSS due to thread start timing issue
- Resolved
- relates to
-
JDK-8310233 Fix THP detection on Linux
- Resolved
-
JDK-8314139 TEST_BUG: runtime/os/THPsInThreadStackPreventionTest.java could fail on machine with large number of cores
- Resolved
-
JDK-8303215 Make thread stacks not use huge pages
- Resolved
-
JDK-8312585 Rename DisableTHPStackMitigation flag to THPStackMitigation
- Resolved
-
JDK-8312211 [Linux] Java guard page VMAs should clear the VM_ACCOUNT mm flag
- Closed
- links to
-
Commit openjdk/jdk17u-dev/e6b87a71
-
Commit openjdk/jdk21u/4cca633e
-
Commit openjdk/jdk/84b325b8
-
Review openjdk/jdk17u-dev/1679
-
Review openjdk/jdk17u-dev/1697
-
Review openjdk/jdk21u/103
-
Review openjdk/jdk/14919