Filed on behalf of Yannik Stradmann:
https://mail.openjdk.org/pipermail/hotspot-runtime-dev/2025-April/077952.html
I'd like to propose a change to hotspot's error handling when spawning native
threads in os::create_thread().
Currently, if EAGAIN is encountered, we retry three times back-to-back.
During recent years, I've experienced instabilities on certain systems, where back-to-back (re-)requests of native threads kept hitting the depleted resource pool and, eventually, failed.
I therefore propose to introduce an exponential backoff when hitting EAGAIN during native thread creation. Hotspot will thereby be more kind to an already depleted resource, reduce stress on the kernel and become more robust on systems under high load.
For reference, I am attaching a patch against os_linux.cpp, which has been running in production on a mid-scale Jenkins cluster over the past three years. If you approve the modification, I'm happy to create a pull request that includes the other platforms (where applicable).
The current choice of constants is arbitrary and I'd welcome any suggestions here.
Please note that this is my first time contributing to OpenJDK, please excuse potential unfamiliarities with the process.
Yannik
diff --git a/src/hotspot/os/linux/os_linux.cpp b/src/hotspot/os/linux/os_linux.cpp
index 4e26797cd5b..2858fbba247 100644
--- a/src/hotspot/os/linux/os_linux.cpp
+++ b/src/hotspot/os/linux/os_linux.cpp
@@ -1064,10 +1064,28 @@ bool os::create_thread(Thread* thread, ThreadType thr_type,
ResourceMark rm;
pthread_t tid;
int ret = 0;
- int limit = 3;
- do {
+ int limit = 5;
+ useconds_t delay = 1'000;
+ constexpr useconds_t max_delay = 1'000'000;
+
+ while (true) {
ret = pthread_create(&tid, &attr, (void* (*)(void*)) thread_native_entry, thread);
- } while (ret == EAGAIN && limit-- > 0);
+
+ if (ret != EAGAIN) {
+ break;
+ }
+
+ if (limit-- <= 0) {
+ break;
+ }
+
+ log_warning(os, thread)("Failed to start native thread (%s), retrying after %dus.", os::errno_name(ret), delay);
+ ::usleep(delay);
+ delay *= 2;
+ if (delay > max_delay) {
+ delay = max_delay;
+ }
+ }
char buf[64];
if (ret == 0) {
https://mail.openjdk.org/pipermail/hotspot-runtime-dev/2025-April/077952.html
I'd like to propose a change to hotspot's error handling when spawning native
threads in os::create_thread().
Currently, if EAGAIN is encountered, we retry three times back-to-back.
During recent years, I've experienced instabilities on certain systems, where back-to-back (re-)requests of native threads kept hitting the depleted resource pool and, eventually, failed.
I therefore propose to introduce an exponential backoff when hitting EAGAIN during native thread creation. Hotspot will thereby be more kind to an already depleted resource, reduce stress on the kernel and become more robust on systems under high load.
For reference, I am attaching a patch against os_linux.cpp, which has been running in production on a mid-scale Jenkins cluster over the past three years. If you approve the modification, I'm happy to create a pull request that includes the other platforms (where applicable).
The current choice of constants is arbitrary and I'd welcome any suggestions here.
Please note that this is my first time contributing to OpenJDK, please excuse potential unfamiliarities with the process.
Yannik
diff --git a/src/hotspot/os/linux/os_linux.cpp b/src/hotspot/os/linux/os_linux.cpp
index 4e26797cd5b..2858fbba247 100644
--- a/src/hotspot/os/linux/os_linux.cpp
+++ b/src/hotspot/os/linux/os_linux.cpp
@@ -1064,10 +1064,28 @@ bool os::create_thread(Thread* thread, ThreadType thr_type,
ResourceMark rm;
pthread_t tid;
int ret = 0;
- int limit = 3;
- do {
+ int limit = 5;
+ useconds_t delay = 1'000;
+ constexpr useconds_t max_delay = 1'000'000;
+
+ while (true) {
ret = pthread_create(&tid, &attr, (void* (*)(void*)) thread_native_entry, thread);
- } while (ret == EAGAIN && limit-- > 0);
+
+ if (ret != EAGAIN) {
+ break;
+ }
+
+ if (limit-- <= 0) {
+ break;
+ }
+
+ log_warning(os, thread)("Failed to start native thread (%s), retrying after %dus.", os::errno_name(ret), delay);
+ ::usleep(delay);
+ delay *= 2;
+ if (delay > max_delay) {
+ delay = max_delay;
+ }
+ }
char buf[64];
if (ret == 0) {
- relates to
-
JDK-8268773 Improvements related to: Failed to start thread - pthread_create failed (EAGAIN)
-
- Resolved
-
- links to
-
Review(master) openjdk/jdk/24682