Last night's low-level threadfest seems to have worked pretty well, but there is
at least one significant remaining area of unsafety.
We learned yesterday that we must not try to acquire ANY monitor (or wait on any
monitor) while the scheduler is locked. That means doing it explicitly or calling
some libc function like malloc() that uses our monitors under the covers. If you do,
and someone else owns it, you will wait for the monitor AND OTHER THINGS WILL RUN.
If this happens you will die some percentage of the time.
GC thinks it is protected because it is wrapped in a scheduler lock. A hole that we
failed to plug yesterday is that GC does at least one thing that uses a monitor:
calls sysMonitorEnumerateOver() to enumerate over the threads, and that locks
the thread queue monitor. There may be other things too, but some of the usual
suspects (malloc/free) appear innocent. Things like memset() are used but don't
lock. We need to inspect this and all SCHED_LOCKED code with a fine-toothed
comb.
The result of this hole will be unpredictable crashes just after GC, with unknown
frequency. Note that nothing done yesterday makes this worse than before the
changes -- we just didn't get everything plugged.
[Addendum:] Incidentally, the issue of avoiding monitors in SCHED_LOCKED() code
means that any tracing or debugging messages put in that code for debugging
or monitoring purposes may *significantly* perturb the system and reduce reliabil-
ity. Things like printf() presumably will do libc stuff that will want mutexes and
get monitors. It might be good to collect stats when SCHED_LOCKED() and only
do things like printing messages once you get out.
at least one significant remaining area of unsafety.
We learned yesterday that we must not try to acquire ANY monitor (or wait on any
monitor) while the scheduler is locked. That means doing it explicitly or calling
some libc function like malloc() that uses our monitors under the covers. If you do,
and someone else owns it, you will wait for the monitor AND OTHER THINGS WILL RUN.
If this happens you will die some percentage of the time.
GC thinks it is protected because it is wrapped in a scheduler lock. A hole that we
failed to plug yesterday is that GC does at least one thing that uses a monitor:
calls sysMonitorEnumerateOver() to enumerate over the threads, and that locks
the thread queue monitor. There may be other things too, but some of the usual
suspects (malloc/free) appear innocent. Things like memset() are used but don't
lock. We need to inspect this and all SCHED_LOCKED code with a fine-toothed
comb.
The result of this hole will be unpredictable crashes just after GC, with unknown
frequency. Note that nothing done yesterday makes this worse than before the
changes -- we just didn't get everything plugged.
[Addendum:] Incidentally, the issue of avoiding monitors in SCHED_LOCKED() code
means that any tracing or debugging messages put in that code for debugging
or monitoring purposes may *significantly* perturb the system and reduce reliabil-
ity. Things like printf() presumably will do libc stuff that will want mutexes and
get monitors. It might be good to collect stats when SCHED_LOCKED() and only
do things like printing messages once you get out.