Loading...

Type: Bug
Resolution: Unresolved
Priority: P4
Fix Version/s: None
Affects Version/s: 11.0.20
Component/s: hotspot
Labels:
- gc-g1
- oracle-gc-triage-seen

Subcomponent:
gc

### Please provide a brief summary of the bug

We observe rare G1 crashes in G1ConcurrentMark::mark_in_next_bitmap in at least AdoptOpenJDK/Temurin 11.0.4, 11.0.8, 11.0.16, 11.0.16.1, 11.0.18, and 11.0.19. We don't have a way to reproduce the issue, and it seemingly happens at random based on the reports sent to us by users. We have observed this specific crash 200 times on 144 different machines in the last 3 weeks.

I have included the full crash report of one of these crashes [here](https://github.com/adoptium/adoptium-support/files/12321954/g1_crash.txt), they are all nearly identical and have an identical native frame stack.

They look like this:

Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.19+7 (11.0.19+7, mixed mode, tiered, compressed oops, g1 gc, windows-amd64)

```
Current thread (0x000001e522b25000): ConcurrentGCThread "G1 Conc#0" [stack: 0x0000003ed5500000,0x0000003ed5600000] [id=2332]

Stack: [0x0000003ed5500000,0x0000003ed5600000], sp=0x0000003ed55ffb90, free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [jvm.dll+0x3041e5]
V [jvm.dll+0x3040c2]
V [jvm.dll+0x2fbdcd]
V [jvm.dll+0x2fb491]
V [jvm.dll+0x30ce70]
V [jvm.dll+0x30c764]
V [jvm.dll+0x310c36]
V [jvm.dll+0x812840]
V [jvm.dll+0x79d6e4]
V [jvm.dll+0x64e915]
C [ucrtbase.dll+0x29363]
C [KERNEL32.DLL+0x126ad]
C [ntdll.dll+0x5aa68]

siginfo: EXCEPTION_ACCESS_VIOLATION (0xc0000005), reading address 0x0000000000000160
```

Mapping the dll offsets to symbols via the pdb files Adoptium provides, yields this stack:

```
bool G1ConcurrentMark::mark_in_next_bitmap(unsigned int,class oopDesc * __ptr64 const,unsigned __int64) __ptr64
bool G1CMTask::make_reference_grey(class oopDesc * __ptr64) __ptr64
static void OopOopIterateDispatch<class G1CMOopClosure>::Table::oop_oop_iterate<class ObjArrayKlass,unsigned int>(class G1CMOopClosure * __ptr64,class oopDesc * __ptr64,class Klass * __ptr64)
int oopDesc::oop_iterate_size<class G1CMOopClosure>(class G1CMOopClosure * __ptr64) __ptr64
void G1CMTask::drain_local_queue(bool) __ptr64
void G1CMTask::do_marking_step(double,bool,bool) __ptr64
virtual void G1CMConcurrentMarkingTask::work(unsigned int) __ptr64
virtual void GangWorker::loop(void) __ptr64
void Thread::call_run(void) __ptr64
static __int64 os::thread_cpu_time(class Thread * __ptr64,bool)
```

For reference, the code surrounding the crash is:

```cpp
inline bool G1ConcurrentMark::mark_in_next_bitmap(uint const worker_id, oop const obj, size_t const obj_size) {
  HeapRegion* const hr = _g1h->heap_region_containing(obj);
  return mark_in_next_bitmap(worker_id, hr, obj, obj_size);
}

inline bool G1ConcurrentMark::mark_in_next_bitmap(uint const worker_id, HeapRegion* const hr, oop const obj, size_t const obj_size) {
  assert(hr != NULL, "just checking");
  assert(hr->is_in_reserved(obj), "Attempting to mark object at " PTR_FORMAT " that is not contained in the given region %u", p2i(obj), hr->hrm_index());

  if (hr->obj_allocated_since_next_marking(obj)) {
    return false;
  }
```

I have disassembled jvm.dll to determine what is happening.

```
       1803041be 48 8b 41 08 MOV RAX,qword ptr [RCX + this->_g1h] RAX=this + _g1h
       1803041c2 4c 8b f9 MOV R15,this
       1803041c5 4d 8b d0 MOV R10,param_2 R10=param_2
       1803041c8 44 8b e2 MOV R12D,param_1
       1803041cb 49 8b e9 MOV RBP,param_3
       1803041ce 49 8b f8 MOV RDI,param_2
       1803041d1 8b 88 c0 MOV this,dword ptr [RAX + 0x2c0] RCX=_regions._shift_by
                 02 00 00
       1803041d7 48 8b 80 MOV RAX,qword ptr [RAX + 0x2b0] RAX=_regions._biased_base
                 b0 02 00 00
       1803041de 49 d3 ea SHR R10,this R10=param_2 >> RCX
       1803041e1 4e 8b 14 d0 MOV R10,qword ptr [RAX + R10*0x8] R10=biased_base[R10*8]
       1803041e5 4d 3b 82 CMP param_2,qword ptr [R10 + 0x160] crash here; cmp param_2 and ->_next_top_at_mark_start
                 60 01 00 00
```

The compiled code is somewhat dense because the compiler inlines the call to `heap_region_containing`, `addr_to_region`, `get_by_address`, `shift_by` and `biased_base`, as well as the overloaded call to `mark_in_next_bitmap`.

Inlining them in source form would look something like this:

```cpp
inline bool G1ConcurrentMark::mark_in_next_bitmap(uint const worker_id, oop const obj, size_t const obj_size) {
  if (obj >= _g1h->_hrm._regions._biased_base[obj >> _g1h->_hrm._regions._shift_by]->_next_top_at_mark_start) {
    return false;
  }

  ...
}
```

Note that the CMP instruction accesses `qword ptr [R10 + 0x160]` and also the crash log shows `EXCEPTION_ACCESS_VIOLATION (0xc0000005), reading address 0x0000000000000160`. As far as I can tell, this means the value loaded from the _biased_base array is 0x0, which means `hr` is null, and is crashing when doing the access to _next_top_at_mark_start due to a null pointer dereference.

I have almost no understanding of the G1 GC or most of the JDK so I don't know where to go from here.

### Please provide steps to reproduce where possible

_No response_

### Expected Results

No crash

### Actual Results

Crash

### What Java Version are you using?

Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.19+7 (11.0.19+7, mixed mode, tiered, compressed oops, g1 gc, windows-amd64)

### What is your operating system and platform?

_No response_

### How did you install Java?

_No response_

### Did it work before?

_No response_

### Did you test with the latest update version?

_No response_

### Did you test with other Java versions?

_No response_

### Relevant log output

_No response_

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

g1_crash_11020.txt
117 kB
2023-08-20 15:24
g1_crash.txt
116 kB
2023-08-20 15:24

Details

Description

Attachments

Attachments

Activity

People

Dates