Running G1 on low-CPU container/host shows the significant amount of memory is taken by mark stacks. It is a major contributor to difference between Serial and G1, for example. I think we can make a more reasonable default for MarkStackSize? It might be not an issue for OS-es that lazily commit; maybe NMT reporting should be changed to reflect this fact, like we do it with thread stacks.
Observe:
```
$ cat Alloc.java
% cat Alloc.java
public class Alloc {
static final int THREADS = 4;
static final Object[] sinks = new Object[64*THREADS];
static volatile boolean start;
static volatile boolean stop;
public static void main(String... args) throws Throwable {
for (int t = 0; t < THREADS; t++) {
int ft = t;
new Thread(() -> work(ft * 64)).start();
}
start = true;
Thread.sleep(10_000);
stop = true;
}
public static void work(int idx) {
while (!start) { Thread.onSpinWait(); }
while (!stop) {
sinks[idx] = new byte[128];
}
}
}
```
Run it with -XX:ActiveProcessorCount=1 to simulate running in container. This also allows comparison what would happen if G1 would ever become an universal default or gets ergonomically selected in these conditions.
Serial takes about 172M committed with <0.5M spent for GC structures.
```
% build/linux-x86_64-server-release/images/jdk/bin/java -Xms128m -Xmx128m -XX:ActiveProcessorCount=1 -XX:NativeMemoryTracking=summary -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics Alloc.java
Total: reserved=1563070006, committed=172707382
malloc: 5533238 #44067, peak=12867944 #44146
mmap: reserved=1557536768, committed=167174144
- Java Heap (reserved=134217728, committed=134217728)
(mmap: reserved=134217728, committed=134217728, at peak)
...
- GC (reserved=442846, committed=442846)
(malloc=4574 #54) (peak=5294 #66)
(mmap: reserved=438272, committed=438272, at peak)
```
G1 takes 210M total and 39M total for its structures. That's +22% more than Serial, ouch.
```
% build/linux-x86_64-server-release/images/jdk/bin/java -Xms128m -Xmx128m -XX:+UseG1GC -XX:ActiveProcessorCount=1 -XX:NativeMemoryTracking=summary -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics Alloc.java
Total: reserved=1603305555, committed=210878547
malloc: 7913555 #46230, peak=16417137 #43663
mmap: reserved=1595392000, committed=202964992
- Java Heap (reserved=134217728, committed=134217728)
(mmap: reserved=134217728, committed=134217728, at peak)
...
- GC (reserved=38471539, committed=38471539)
(malloc=2275187 #1710) (peak=2287227 #2091)
(mmap: reserved=36196352, committed=36196352, at peak)
(arena=0 #0) (peak=984 #1)
```
Overriding MarkStackSize shows the overheads go way down, to 178M total and 6M in GC structures, fairly close to Serial.
```
$ build/linux-x86_64-server-release/images/jdk/bin/java -Xms128m -Xmx128m -XX:ActiveProcessorCount=1 -XX:+UseG1GC -XX:NativeMemoryTracking=summary -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics -XX:MarkStackSize=128K Alloc.java
Total: reserved=1570822727, committed=178403911
malloc: 7936583 #46198, peak=44497815 #46777
mmap: reserved=1562886144, committed=170467328
- Java Heap (reserved=134217728, committed=134217728)
(mmap: reserved=134217728, committed=134217728, at peak)
...
- GC (reserved=5967579, committed=5967579)
(malloc=2277083 #1698) (peak=2289939 #2004)
(mmap: reserved=3690496, committed=3690496, at peak)
(arena=0 #0) (peak=984 #1)
```
Observe:
```
$ cat Alloc.java
% cat Alloc.java
public class Alloc {
static final int THREADS = 4;
static final Object[] sinks = new Object[64*THREADS];
static volatile boolean start;
static volatile boolean stop;
public static void main(String... args) throws Throwable {
for (int t = 0; t < THREADS; t++) {
int ft = t;
new Thread(() -> work(ft * 64)).start();
}
start = true;
Thread.sleep(10_000);
stop = true;
}
public static void work(int idx) {
while (!start) { Thread.onSpinWait(); }
while (!stop) {
sinks[idx] = new byte[128];
}
}
}
```
Run it with -XX:ActiveProcessorCount=1 to simulate running in container. This also allows comparison what would happen if G1 would ever become an universal default or gets ergonomically selected in these conditions.
Serial takes about 172M committed with <0.5M spent for GC structures.
```
% build/linux-x86_64-server-release/images/jdk/bin/java -Xms128m -Xmx128m -XX:ActiveProcessorCount=1 -XX:NativeMemoryTracking=summary -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics Alloc.java
Total: reserved=1563070006, committed=172707382
malloc: 5533238 #44067, peak=12867944 #44146
mmap: reserved=1557536768, committed=167174144
- Java Heap (reserved=134217728, committed=134217728)
(mmap: reserved=134217728, committed=134217728, at peak)
...
- GC (reserved=442846, committed=442846)
(malloc=4574 #54) (peak=5294 #66)
(mmap: reserved=438272, committed=438272, at peak)
```
G1 takes 210M total and 39M total for its structures. That's +22% more than Serial, ouch.
```
% build/linux-x86_64-server-release/images/jdk/bin/java -Xms128m -Xmx128m -XX:+UseG1GC -XX:ActiveProcessorCount=1 -XX:NativeMemoryTracking=summary -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics Alloc.java
Total: reserved=1603305555, committed=210878547
malloc: 7913555 #46230, peak=16417137 #43663
mmap: reserved=1595392000, committed=202964992
- Java Heap (reserved=134217728, committed=134217728)
(mmap: reserved=134217728, committed=134217728, at peak)
...
- GC (reserved=38471539, committed=38471539)
(malloc=2275187 #1710) (peak=2287227 #2091)
(mmap: reserved=36196352, committed=36196352, at peak)
(arena=0 #0) (peak=984 #1)
```
Overriding MarkStackSize shows the overheads go way down, to 178M total and 6M in GC structures, fairly close to Serial.
```
$ build/linux-x86_64-server-release/images/jdk/bin/java -Xms128m -Xmx128m -XX:ActiveProcessorCount=1 -XX:+UseG1GC -XX:NativeMemoryTracking=summary -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics -XX:MarkStackSize=128K Alloc.java
Total: reserved=1570822727, committed=178403911
malloc: 7936583 #46198, peak=44497815 #46777
mmap: reserved=1562886144, committed=170467328
- Java Heap (reserved=134217728, committed=134217728)
(mmap: reserved=134217728, committed=134217728, at peak)
...
- GC (reserved=5967579, committed=5967579)
(malloc=2277083 #1698) (peak=2289939 #2004)
(mmap: reserved=3690496, committed=3690496, at peak)
(arena=0 #0) (peak=984 #1)
```