Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Unresolved
Priority: P4
Fix Version/s: tbd
Affects Version/s: 23
Component/s: hotspot
Labels:
- gc-parallel

Subcomponent:
gc

On some workloads, Parallel Full GC is slower in single thread than a (naively equivalent) Serial GC. Parallel breaks even with Serial when given 2..3 threads after ~~JDK-8329203~~, which pushed the break-even point quite low. It would be good to understand if the remaining issues are the intrinsic tradeoffs in Parallel GC implementation, and/or whether the break even point could be pushed lower.

This would simplify deployments, as users would not be forced to choose Serial over Parallel when there is only 1 CPU is available for the GC to run on.

Example workload:

```
% cat Fragger.java
import java.util.concurrent.ThreadLocalRandom;

public class Fragger {
static final int COUNT = 15_000_000;
static final Object[] ARR = new Object[COUNT];

public static void main(String... args) {
for (int c = 0; c < COUNT; c++) {
ARR[c] = new byte[16];
}

long end = System.nanoTime() + 30_000_000_000L;
while (System.nanoTime() < end) {
long nextGC = System.nanoTime() + 500_000_000L;
while (System.nanoTime() < nextGC) {
int idxFull = ThreadLocalRandom.current().nextInt(COUNT);
ARR[idxFull] = new byte[16];
}
System.gc();
}
}
}

% build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx4g -Xms4g -XX:+UseSerialGC -Xlog:gc Fragger.java
[24.901s][info][gc] GC(16) Pause Full (System.gc()) 790M->654M(3959M) 901.744ms
[26.374s][info][gc] GC(17) Pause Full (System.gc()) 1010M->654M(3959M) 972.383ms
[27.800s][info][gc] GC(18) Pause Full (System.gc()) 947M->654M(3959M) 925.915ms
[29.214s][info][gc] GC(19) Pause Full (System.gc()) 905M->518M(3959M) 914.314ms
[30.659s][info][gc] GC(20) Pause Full (System.gc()) 873M->654M(3959M) 943.951ms

% build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx4g -Xms4g -XX:+UseParallelGC -XX:ParallelGCThreads=1 -Xlog:gc Fragger.java
[23.677s][info][gc] GC(11) Pause Full (System.gc()) 793M->518M(3925M) 1441.179ms
[25.616s][info][gc] GC(12) Pause Full (System.gc()) 793M->518M(3925M) 1438.795ms
[27.556s][info][gc] GC(13) Pause Full (System.gc()) 832M->518M(3925M) 1439.856ms
[29.512s][info][gc] GC(14) Pause Full (System.gc()) 891M->518M(3925M) 1455.564ms
[31.468s][info][gc] GC(15) Pause Full (System.gc()) 812M->518M(3925M) 1455.337ms
```

The collectors behave a bit differently, even with periodic Full GCs, so the difference might hide there as well. But, the difference is still present even in cycles that do similar amount of work, e.g. "905M->518M(3959M) 914.314ms" vs "891M->518M(3925M) 1455.564ms".

Initial profiles are attached. The difference is likely to be on lower level, e.g. the use of marking bitmaps by Parallel vs. marking the object headers by Serial, etc. parallel-1.html shows that we are spending quite a bit of time in `ParCompactionManager::mark_and_push`, I think it would be interesting to dive there.

Some initial questions to ponder:
a) Can/should Parallel use object headers for marking? (This likely has implications for the complexity of Compact Object Headers).
b) Can/should Parallel user memory_order_relaxed for bitmap accesses?
c) Maybe there are some minor code polishings in `ParCompactionManager::MarkingStatsCache::push` and friends that can give us a bit more edge?

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

parallel-1.html
58 kB
2024-05-27 02:30
serial.html
55 kB
2024-05-27 02:30

relates to

JDK-8320165 Parallel: Full GC code is very slow due to quadratic calc_new_address

Closed

Assignee:: Unassigned

Reporter:: Aleksey Shipilev

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024-05-27 02:29

Updated:: 2024-05-29 06:32

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates