Title: Enable execution of Java methods on GPU
Author: Eric Caspole
//
// Employer or affiliated organization, if any
//
Organization: Advanced Micro Devices, Inc.
//
// Owner's full name (no e-mail address) -- REQUIRED, if the person now
// responsible for this JEP is not the Author
//
Owner: 
//
// Date initially created -- REQUIRED, in YYYY/MM/DD format
//
Created: 2014/05/08
//
// Type of proposal -- REQUIRED, one of:
//   Feature -- A feature intended eventually for a JDK Release Project
//   Research -- A research effort rather than a product deliverable
//   Infrastructure -- A proposal to provide or enhance infrastructure
//   Informational -- Non-normative information of general interest
//   Process -- A new or revised development process or guideline
//
Type: Feature
//
// Process state -- REQUIRED, one of:
//   Draft, Posted, Submitted, Candidate, Funded, Completed,
//   Active, Withdrawn, or Rejected
//
State: Draft
//
// Exposure -- REQUIRED, either "Open" or "Closed"
//
Exposure: Open
//
// Component affected, REQUIRED for Feature JEPs
//
// We use two-part identifiers of the form <area>/<component>.
//
// The areas are: vm, core, client, web
//
// The components depend upon the areas, as follows:
//
//    vm: comp, gc, rt, svc
//    core: lang, libs, i18n, net, sec, svc
//    client: gui, sound
//    web: jaxp, jaxb, jaxws, corba
//
// A proposal for a new garbage collector, e.g., would go in "vm/gc",
// while one for a new networking protocol would go in "core/net".
//
// Use "--" for the component name if more than one component in an area
// is significantly affected, or if some component not listed here is
// affected.
//
// Use "--/--" for the value of this header if more than one area is
// affected, e.g., for a proposal to restructure the build process.
//
Component: core/libs, vm/comp
//
// Scope of the proposal -- REQUIRED only for Feature JEPs, one of:
//   SE -- Java SE APIs or other interfaces modified or extended;
//         might also include changes to JDK-specific APIs or interfaces
//   JDK -- JDK-specific APIs or other interfaces modified or extended
//   Impl -- No supported APIs or interfaces are affected
//
Scope: Impl
//
// If this JEP will have a JSR then include this line and insert its
// number.  If this JEP is expected to have a JSR but the JSR is not yet
// approved as such in the JCP then write "TBD" for the number.  If this
// JEP will be part of a Maintenance Review of an existing JSR then insert
// its number followed by "MR", e.g., "JSR: 221 MR".  Do not include this
// line if this JEP describes a small enhancement that will be covered by
// a Platform Umbrella JSR.
//
JSR: 999

//
// Mailing list for discussion of this JEP -- REQUIRED
//
Discussion: sumatra dash dev at openjdk dot java dot net
//
// Suggested start date, in the format <year>/Q<quarter>
//
Start: 2014/Q2

//
// Rough effort estimate, one of:
//   XS -- Less than one developer-month (20 working days)
//   S  -- Less than three dev-months
//   M  -- Less than six dev-months
//   L  -- Less than one dev-year
//   XL -- More than one dev-year
//
Effort: XL
//
// Rough duration estimate, one of:
//   XS -- Less than one month (calendar)
//   S  -- Less than three months
//   M  -- Less than six months
//   L  -- Less than one year
//   XL -- More than one year
//
// The duration reflects the calendar time needed to complete the
// proposed work.  It can be more or less than the effort estimate.
//
Duration: L
//
// Version of the template used to create this proposal -- REQUIRED
//
Template: 1.0

//
// Full names of the reviewers of this proposal, comma-separated
//
Reviewed-by: 
//
// Full names of the Group and Area Leads who endorse this proposal
//
Endorsed-by: 

Summary
-------
Enable Java applications to take advantage of GPUs, using JDK 8 Stream API parallel streams and lambdas as the programming model.

Goals
------
Enable seamless offload of Java 8 parallel stream APIs to GPGPU when possible.

By seamless we mean:
-	No syntactic changes to Java 8 parallel stream API
-	Autodetection of hardware and software stack 
-	Heuristic to decide when to offload to GPU gives perf gains
-	Performance improvement for embarrassingly parallel workloads
-	Code accuracy has the same (non-) guarantees you can get with multi core parallelism
-	Code will always run with graceful fallback to normal CPU execution if offload fails
-	Will not expose any additional security risks
-	Offloaded code will maintain Java memory model correctness (find JSR)
-	Where possible enable JVM languages to be offloaded

Non Goals
---------
-	Not intended to offload all code or all of Java 8 stream API to GPU
-	No plan to support auto vectorization and auto parallelization offload to GPU
-	No support for devices that do not support shared virtual memory 
-	Initially not exposing all GPU capabilities to Java language, for example group local memory

Metrics
-------
An initial success metric would be to offload a parallel workload using Stream API and observe better performance in that part of the application.

Motivation
----------
Many Java workloads are becoming larger and larger. GPUs offer computing power that are more efficient in both power and performance for some workloads, but earlier Java/GPU offload solutions such as Aparapi or JOCL are not integrated into the JDK and require their own programming model.

With Sumatra, we plan to offer seamless offload of some Stream API parallel lambda functions. The Stream API is designed to simplify parallel programming and Sumatra is a natural extension of the parallel capability already in the Stream API. Since Sumatra will be integrated into the JDK, it will simplify both development and deployment of offloadable applications compared to existing Java/GPU solutions.

Description
-----------
Our implementation uses Heterogeneous System Architecture supported in certain AMD APUs with a related software stack, and uses the Graal JVM that includes an HSAIL back end. The JDK is modified such that for certain Stream API operations, the application's lambda function is extracted from the stream and compiled into an HSA kernel. The stream data structures are examined to extract the lambda arguments, and passed to the HSA kernel.

Current GPUs have hundreds to thousands of stream cores. Ideally, for parallelizable workloads all the stream cores can operate on the input data at the same time. We use the Stream API parallel() method as the indicator that it is safe to offload the following part of the stream since the programmer explicitly wrote it. For example, we have implemented offloadable versions of parallel().forEach() and some parallel().reduce() operations in the Stream API.

Work sent to a GPU is generally in the form of an array. The length of the input array is sometimes called the "range" in GPU terms. The length of the range indicates how many "work items" are in the task. In the GPU programming model it is common for each stream core to use the work item id as an index into an array to get the data that stream core will process. In Sumatra, we find the source Java array in the stream and pass the array to the kernel and use the work item id to retrieve the array element for that stream core. Each stream core processes one array element which corresponds to one iteration variable execution of the lambda in the Stream API.

Note with HSA the GPU is operating on the main memory and has direct access to the Java heap, so there is no copying of data. Thus we can operate on Java objects and are not limited to basic type arrays.

Garbage collection cannot occur while a kernel is executing. Our prototype is executing the kernels from inside the JVM and is not using JNI, so no extra object pinning is required. 

We support deoptimization of HSA kernels back to CPU execution, and handle safepoints by deoptimizing back to the CPU. In this way the CPU execution of the application is not blocked or delayed by execution of a kernel. 

Here is a simple use of parallel stream API showing examples of what can be offloaded:

`package simple;

import java.util.stream.IntStream;

public class Simple {

    public static void main(String[] args) {
        final int length = 8;
        int[] ina = new int[length];
        int[] inb = new int[length];
        int[] out = new int[length];

        // Initialize the input arrays - this is offloadable.
        // Each iteration of this lambda is independent and
        // always produces the same answer whether executed single-threaded, 
        // by CPU thread pool or GPU kernel.
        IntStream.range(0, length).parallel().forEach(p -> {
            ina[p] = 1;
            inb[p] = 2;
        });

        // Sum each pair of elements into out[] - this is offloadable
        // Meets the same criteria as the above example
        IntStream.range(0, length).parallel().forEach(p -> {
            out[p] = ina[p] + inb[p];
        });

        // Print results - this is not offloadable since it is calling
        // native code etc. Also it is not really parallelizable even
        // on the CPU since it is printing messages that might become garbled.
        IntStream.range(0, length).forEach(p -> {
            System.out.println(out[p] + ", " + ina[p] + ", " + inb[p]);
        });
    }
}`


Alternatives
------------
There are several open source packages available to offload some Java methods to GPUs with OpenCL or CUDA. They generally require their own programming model, their own jars in the classpath and native libraries.
 
-	Aparapi
-	RootBeer
-	JCUDA/ JOCL
-	SCALA CL


Testing
-------
-	Pass all JCK tests
-	Develop new targeted tests for compilation failure and fallback to normal Java execution
-	Develop new targeted tests for deoptimization, safepoints and allocation from kernels


Risks and Assumptions
--------------------
- Other offload solutions besides HSA require copying data over a bus to the offload device. Thus the offload benefit/penalty will be completely different from an HSA based solution.
- The floating point standard used on GPUs is different from that used in Java.


Dependences
-----------
- Our version depends on the HSA runtime. Other offload platforms will have their own software layer.
- For HSA, there will be modifications to the linux kernel which should be generally available in future distros

Impact
------
- JVM modifications similar to what we have implemented in the Graal JVM
- Possibly JDK modifications to direct the workload to the GOU, unless this can be done completely in the JVM
- Requires a new compiler/backend to produce the GPU kernels from the lambda method similar to what we have implemented in the Graal JVM