Summary
Improve startup and warmup time by making native code from a previous run of an application instantly available, when the HotSpot Java Virtual Machine starts. This will greatly reduce the initial load on the JIT compiler, reducing its interference with the application, in configurations with fewer cores. The JIT is then free to delay the generation of native code, unless and until the previously generated code proves insufficiently performant.
Goals
Help applications start up and warm up more quickly by shifting dynamic (JIT) method compilation from production runs to training runs, conveying the necessary native code via the AOT cache.
Do not require any change to the code of applications, libraries, or frameworks.
Do not introduce any new constraints on application execution.
Do not introduce new AOT workflows, but, rather, use the existing AOT cache creation commands.
Provide full interoperability between all optimization levels and execution modes in the HotSpot Java Virtual Machine, including new AOT code, existing JIT code, and the bytecode interpreter.
Motivation
To prepare the best possible native code for an application, we must first run the application.
This means that, initially, an application must execute by means of less-than-optimal techniques. During this initial period, called warmup, the actual application behavior must be observed (or profiled) in order to track which code paths and object types need to be prioritized for optimization. As profiles accumulate during warmup, the system is able to compile and install highly optimized native code, which is organized to provide the best possible performance. When application execution is fully transferred to this optimized code, it stays at peak performance, as long as the profiled code paths and object types continue to dominate performance.
(Delete this digression?)
The Java language is highly dynamic, which means that many application behaviors emerge on the fly; some application behaviors are resistant to static prediction. During application startup, the various classes in the program link themselves together dynamically, and create their initial working sets of objects dynamically by running setup code and the application’s main routine. During warmup, the application further adjust its set of hot paths and data structures, as it moves from initial configuration into its steady-state processing loops.
One might wish for a purely static compiler like that of C++, which could somehow skip the startup and warmup steps, but this is a poor match for Java’s highly dynamic execution patterns. Such a static compiler is likely to produce too much code of low quality, since it is difficult to accurately anticipate the actual dynamic behavior of a typical Java application. Static compilers sometimes support profile guided optimization, but profiles are difficult to gather and can be inaccurate. If there is a speculation error, performance must be regained by rebuilding and restarting the application.
To respond to these challenges, Java virtual machines (or “VMs”) have brought just in time (or “JIT”) compilation into mainstream use. Early misconceptions that Java is only a slow interpreted programming language were replaced by enthusiastic adoption when HotSpot (and other virtual machines) introduced today’s optimizing Java compilers, which are competitive with offline “static” compilers, but run concurrently with the Java application itself. After a VM gathers application profile data, its JIT compiler can then generate code that executes as fast as possible. If performance suffers because of speculation failure, the VM can simply recompile affected methods, without disturbing the running application.
Thus, the use of JIT compilation supports Java’s inherent dynamism. Although at first the VM uses its bytecode interpreter to organize startup activities, the VM’s JIT compiler is also dynamically generating code, upgrading many thousands of methods to use native code. At first the JIT may generate less optimized general-purpose code for a given method. However, as that method executes many times, the VM also gathers additional profile information about its exact behavior, and eventually recompiles it, generating optimized code for the observed behaviors of the application’s steady state.
Methods which are do not contribute significantly to performance might not be compiled or fully optimized; we say such methods are not “hot” enough to receive extra attention from the VM. (Oddly enough, executing a very cold method using single-use native code often leads to a net performance loss; in those cases, using the interpreter can improve startup performance.) But after the application has run for enough time, all of the “hot“ methods are fully optimized. (This is the “hot spot” referred to by the name of the HotSpot VM!) At that point, the application is said to have reached “peak” performance.
In any given run, the peak performance comes from JIT code tuned to exactly the current behavior of the application. Where a static compiler would conservatively support all possible behaviors, the dynamic JIT assumes that rarely taken program paths or rarely used data types are irrelevant to performance, and does not let them complicated the optimized code. Such use of program behavior is called “profile guided optimization”, and Java JITs are masters of this craft.
In a Java VM, profile guided optimizations are “speculative”, in the sense that the JIT guesses good choices for code, but is prepared to correct that code if its guesses are wrong. As yet another form of dynamism in Java VMs, JIT code which encounters an unoptimized path or data type can be “deoptimized”. As mistakes in speculation are corrected by recompilation, the application regains its peak performance. Deoptimization and recompilation cycles happen routinely in practice, and can repeat many times as the application makes previously unexpected detours into new parts of the application logic.
The result of all this dynamism is that Java programs are easy to debug, configure, build, and deploy, without any compromise to application throughput.
There is a small problem in this pleasant picture. Every Java programmer has noticed at some point, however, that the benefits of dynamism are paid for as program is starts up and warms up. Applications running on HotSpot do not start up instantly, and they can run slower than their expected peak performance for some time, until the JIT does all of its work. When looking in detail at processing costs, one can see the JIT using many CPU seconds generating code before the application is fully warmed up. For very large applications, warmup may even require minutes or hours, due to JIT activity.
(End of digression.)
It may seem that there is no shortcut, that that peak application performance is only attained after a CPU-intensive warmup period, including application execution, profiling, and optimizing JIT compilation.
Recent work has reduced these warmup costs, in part. JEP 483 shifts application linking and loading to a training run by means of the AOT cache. JEP 515 shifts profiling work in the same way, so that a production run starts with ready-made profile data, so that the JIT compiler can run immediately. But warmup is still delayed, by seconds or even minutes, because the JIT compilation of optimized code uses many computing resources. On some platforms, the latency of JIT compilation can be hidden by running many JIT threads in parallel, but this trick requires the allocation of processors beyond those immediately useful to the application. Surely it would be helpful if the heavy work of JIT compilation could be shifted to a training run as well.
Description
We extend the AOT cache, introduced by JEP 483 and previously extended by JEP 515, to store natively compiled method code assets, also known as AOT code. During a production run, a request for native method code, normally fulfilled by the JIT compiler, can be immediately fulfilled if a matching method is found in the AOT cache. There does not need to be any delay for profiling or JIT compilation, if appropriate AOT code is available. This means that warmup happens quickly, and with less consumption of computing resources.
From the user’s point of view, all JIT compilation activity is transparent, except for effects on application performance. Likewise, all uses of AOT code are equally transparent. There are no new requirements on application configuration or VM invocation. Applications which use AOT code assets will usually start up and warm up more quickly. Even when peak performance requires additional JIT activity (to generate T4 code), there is likely to be less overall consumption of machine resources by JIT activity, and such activity will tend to spread more evenly across the lifetime of the application.
The presence of AOT code has two low-level effects: It makes the AOT cache larger, usually by a modest amount. And, it makes good native code appear quickly, almost as if the JIT is suddenly able to perform its compilation tasks instantly. The almost-instant loading of AOT code will cause even the earliest phases of application startup to run faster, since it is much faster to load precompiled code than to generate it from scratch. Application warmup will also be accelerated, since much profiling and JIT activity will be skipped, in favor of immediate use of AOT code assets.
Of course, if the application’s behavior in the production run is significantly different from the training run, some AOT code might not be usable, or it might be deoptimized and replaced. This is nothing new: JIT code also gets generated only conditionally (on proof of importance) and is then subject to deoptimization and replacement. (If you run a typical Java application with the option -XX:+PrintCompilation
, and search for the string “made not entrant”, you will see many instances of the JIT replacing methods.) When generating new JIT code, AOT profiles are very useful, since they enable the optimizing JIT compiler to produce code that supports the appropriate hot code paths and hot object types, as observed during the training run.
AOT code implementation details
Just as the JIT compiler generates different optimization levels, called “tiers”, for different purposes, AOT code assets also come in several corresponding tiers and levels. Indeed, AOT code and JIT code are fundamentally the same kind of code; their names simply reflect their differing creation times, and (sometimes) slightly different optimization decisions.
The AOT code in tier A4 is closely similar to fully optimized JIT code (T4). Like T4 code, A4 code assumes that all relevant class initializers have been executed, so it depends on a list of initialized classes. The VM delays usage of a particular A4 code asset until its dependent classes have been initialized.
The new AOT code tier AP4 (“pre-init code”) corresponds to fully optimized JIT code (T4), except that it functions correctly even if classes are not yet loaded. It can therefore be used immediately on VM startup, but it contains extra dynamic checks which may cause to to run a little slower than A4 or T4 code.
The AOT code tier A3 works the same as the JIT tier T3: It runs at a low optimization level, but collects the same kind of profile information that the VM interpreter gathers.
Similarly, the AOT code tiers A2 and A1 work the same as JIT tiers T2 and T1, but gathering little or no profile information. Specfically, A2 and T2 gather method invocation counts, while A1 and T1 doe not.
The interpreter, sometimes called “tier zero”, has no separate AOT aspect.
To robustly reach peak performance, even as application behaviors change, the HotSpot VM must gracefully orchestrate its use of all tiers. Running one-time computations in optimized native code can lose performance just as surely as running a hot loop in the interpreter. Specifically, when application behaviors stabilize, the VM must use T4 or A4 levels to get best performance. During startup and warmup, it must use an effective mix of the lower tiers to gather additional profile information (A3, T3, A2, T2) or simply plow through early one-shot computations (A1, T1, T0).
To these ends, the new AOT tiers integrate seamlessly with the existing JIT tiers and the interpreter. Unlike a natively compiled C++ method, an AOT-compiled method is not a “dead end” on the road to peak performance. An AOT method that becomes a bottleneck can be transparently replaced by the JIT, when better code quality becomes possible. The reverse is possible as well, of replacing a JIT method (when its optimizations over-speculate) with a slower, more robust AOT method. (Previously, such methods must deoptimize all the way down to the interpreter.) Thus, the VM’s abilities to deoptimize and reoptimize JIT code are made faster and more robust by the ability of AOT code.
Although AOT and JIT code are fully interoperable, there are a few subtle differences, because JIT code is created in the context of a loaded and running application, whereas AOT code is created outside of the context of any running application. JIT code aggressively exploits dynamic information in the ambient VM, including class states, execution profiles, and field values. AOT code (especially AP4 “pre-init” code) must take care not to exploit information that might change in the production run.
Thus, AOT code is not simply the byproduct of JIT compilations within the training run. (More broadly, an AOT cache is not just a snapshot of the state at the end of a training run!) Instead, there is a division of labor: First, the training run gathers historical observations and dumps them into an AOT configuration file. Second, a separate assembly phase consults this file and generates an AOT cache, including JIT code, to accelerate later production runs.
In some of its details, AOT code may refrain from relying on class states (especially initialization order), execution profiles, or field values. In these details, the AOT compiler makes a tradeoff between the robustness of an AOT cache and warmup time. It strives to create AOT code which is robustly applicable across many possible production runs, while running close to peak speed for the most likely production runs.
Testing
We will create new unit tests for this feature.
We will run existing AOT cache tests with this feature enabled, and ensure that they pass.
Alternatives
As has been demonstrated many times, Java can be supported by a pure static compiler. Static compilation is always accompanied by compromises to performance, agility, or compatibility. At present, best performance requires a balanced mix of AOT and JIT execution modes (plus the interpreter), as provided by this JEP.
Since AOT code can be loaded immediately on startup, it might seem that profiles in the AOT cache (added by JEP 515) are now useless. But they still have their role, when AOT code is replaced by the JIT.
Therefore, it is not presently a goal to rely completely on AOT code, as if a Java application were the same as a C++ application. When appropriate, applications can still make use of the interpreter, the JIT, and AOT profiles. Future work may investigate further minimization of JIT usage, and/or interpreter usage. However, initial experiments suggest that totally excluding the JIT often leads to lower peak performance. Likewise, excluding the interpreter results in bloated AOT cache files, which can be more expensive to load than running the interpreter.
Unlike a C++ application, a Java application is always compiled to use the highest and best instruction set architecture available at production time, including any available optional instructions. Vector ISAs change and develop, affecting the details of vectorized code generated by the HotSpot virtual machine. When running with an AOT cache that contains AOT code assets, the VM checks that the present processor can correctly execute the AOT code asserts. This check can fail if the AOT cache was created by a newer machine, but the production run is performed on an older model. The resulting execution is still correct, but it may exhibit lower performance, as some or all AOT code assets may be inappropriate for the current run.
Future work may investigate alternatives for finer control over optimization levels of AOT code, possibly allowing users to trade off speed for processor compatibility. Such work could potentially install several versions of a given AOT method, usable by differing processor levels. However, such fine control is not an initial goal.
Risks and Assumptions
There are no new risks beyond those already noted in JEP 483.
The base assumption of the AOT cache remains operative: A training run is assumed to be a good source of observations that, when passed through an AOT cache to a production run, will benefit the performance of that production run. This assumption applies fully to AOT code, which benefits similar production runs, without doing harm to divergent production runs.
- is blocked by
-
JDK-8362657 Make tables used in AOT assembly phase GC-safe
-
- Resolved
-