Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4242172

2.0: poor performance in matrix calculations

XMLWordPrintable

    • beta
    • generic
    • generic
    • Verified

        ======================================================


        Name: dkC59003 Date: 05/28/99



        Compared to classic VM, HotSpot performance seems poor in typical matrix
        calculations.

        I have tested VM performance with my testing programs "Matrix2D.java",
        "Matrix2L.java", "Matrix2F.java", and "Matrix2I.java", which measure the
        time elapsed to calculate matrix product A*A for a square matrix A. These
        four programs implement the same algorithm for different numeric types:
            Matrix2D.java -- multiplies matrices of the type double
            Matrix2F.java -- multiplies matrices of the type float
            Matrix2L.java -- multiplies matrices of the type long
            Matrix2I.java -- multiplies matrices of the type int

        The following HotSpot releases were tested:
          - HotSpot VM for Sparc, release 1.0fcs build B
          - HotSpot VM for Win32, release 1.0fcsE
          - HotSpot VM for Win32, release 2.0devA
          - HotSpot Client VM for Win32, release 1.3betaD

        Classic VM releases were used to compare performance:
          - JDK 1.2fcsV for Sparc, sunwjit ON
          - JDK 1.2fcsV for Win32, symcjit ON

        I have found these JITs essentially faster than those HotSpot releases.
        The following table shows JIT performance normalized to performance of
        HotSpot release measured on the same platform ("overall" performance is
        computed as average of the "float", "double", "int", and "long" columns):

            TABLE 1:
            JIT_performance / HotSpot_performance (percents):

                            ______________ 1 thread _______________
            Win32: float double int long overall
            hs1.0fcsE 365% 370% 395% 353% 371%
            hs1.3betaD 282% 284% 318% 279% 291%
            hs2.0devA 127% 138% 121% 192% 145%
            Sparc:
            hs1.0fcs 327% 321% 293% 170% 278%

                            ______________ 2 threads ______________
            Win32: float double int long overall
            hs1.0fcsE 423% 371% 479% 353% 407%
            hs1.3betaD 409% 389% 471% 345% 404%
            hs2.0devA 144% 135% 137% 213% 157%
            Sparc:
            hs1.0fcs 166% 176% 234% 180% 189%

        I have executed my tests on the following computers, each having 2 CPUs:
          - 2xCPU, Pentium-II 350MHz, RAM 512Mb, Windows NT 4.0 Server
          - 2xCPU, SPARCstation-10, RAM 64Mb, Solaris 2.5.1

        Command lines were:
            ${JAVA_VM} Matrix2D 500 ${THREADS} 5 >report.${JAVA_ID}.double
            ${JAVA_VM} Matrix2F 500 ${THREADS} 5 >report.${JAVA_ID}.float
            ${JAVA_VM} Matrix2L 500 ${THREADS} 5 >report.${JAVA_ID}.long
            ${JAVA_VM} Matrix2I 500 ${THREADS} 5 >report.${JAVA_ID}.int
        where:
            JAVA_VM -- is path to a Java VM wrapper
            JAVA_ID -- is some short name for corresponding VM
            THREADS -- either 1, or 2
            500 -- matrix A was 500x500 elements
            5 -- report average performance in 5 iterations of the calculation

        When parameter THREADS was 2, each program "Matrix2_" split the calculations
        into two independent threads, so that VM seemed to utilize both CPUs installed
        on the computer. (Except of classic JIT for Sparc, which seemed to not utilize
        second CPU.) When THREADS was 1, only one CPU seemed to be utilized.

        Classic JIT (both for Sparc and for Win32), and HotSpot 2.0 for Win32
        demonstrated almost the same performance in all 5 iterations of the tests.
        Indeed, each HotSpot 1.0 (both for Sparc and for Win32), and HotSpot Client
        for Win32 seemed to need 3 iterations to adjust itself for best performance.
        So, I decided to compare average performance in 5 iterations.

        Below are the more detailed results of performance testing, which also
        include performance in 1st and in "best" (fastest) iteration. Here,
        absolute performance in MFLOPS (or integer MOPS) was calculated as:
            N*N*(N+N)/elapsed_time/1000000
        where:
            elapsed_time -- time (in seconds) elapsed to calculate A*A
            N=500 -- A is NxN square matrix
        and:
            N*N*N -- is number of numeric multiplications needed to compute A*A
            N*N*N -- is number of numeric additions needed to compute A*A
            2*N*N*N -- total number of arithmetic operations needed to compute A*A

        Relative performance of JIT was calculated as:
            (JIT_performance / HotSpot_performance) * 100%


            TABLE 2D: results of:
            >>>> java Matrix2D 500 ${THREADS} 5

            Performance (MFLOPS):
                            _______ 1 thread ______ ______ 2 threads ______
            Win32: First Best Average First Best Average
            classic_JIT 17.5 18.2 17.7 31.7 32.8 32.2
            hs1.0fcsE 2.58 11.1 4.79 4.17 21.0 8.67
            hs1.3betaD 2.54 9.84 6.24 4.07 18.7 8.28
            hs2.0devA 12.6 13.1 12.8 23.7 24.7 23.9
            Sparc:
            classic_JIT 1.67 1.81 1.71 1.79 1.79 1.71
            hs1.0fcs 0.304 1.11 0.533 0.504 2.54 0.973

            JIT_performance / HotSpot_performance (percents):
                            _______ 1 thread ______ ______ 2 threads ______
            Win32: First Best Average First Best Average
            hs1.0fcsE 678% 164% 370% 760% 156% 371%
            hs1.3betaD 689% 185% 284% 779% 175% 389%
            hs2.0devA 139% 139% 138% 134% 133% 135%
            Sparc:
            hs1.0fcs 549% 163% 321% 355% 70.5% 176%

            
            TABLE 2F: results of:
            >>>> java Matrix2F 500 ${THREADS} 5

            Performance (MFLOPS):
                            _______ 1 thread ______ ______ 2 threads ______
            Win32: First Best Average First Best Average
            classic_JIT 18.1 19.3 18.7 34.7 37.0 35.7
            hs1.0fcsE 2.72 12.5 5.13 4.27 24.2 8.44
            hs1.3betaD 2.69 10.4 6.62 4.19 20.3 8.72
            hs2.0devA 14.1 15.7 14.7 28.1 30.5 24.8
            Sparc:
            classic_JIT 2.10 2.17 2.09 2.13 2.16 1.81
            hs1.0fcs 0.347 1.48 0.640 0.551 3.20 1.09

            JIT_performance / HotSpot_performance (percents):
                            _______ 1 thread ______ ______ 2 threads ______
            Win32: First Best Average First Best Average
            hs1.0fcsE 665% 154% 365% 813% 153% 423%
            hs1.3betaD 673% 186% 282% 828% 182% 409%
            hs2.0devA 128% 123% 127% 123% 121% 144%
            Sparc:
            hs1.0fcs 605% 147% 327% 387% 67.5% 166%


            TABLE 2L: results of:
            >>>> java Matrix2L 500 ${THREADS} 5

            Performance (MOPS):
                            _______ 1 thread ______ ______ 2 threads ______
            Win32: First Best Average First Best Average
            classic_JIT 13.7 14.3 13.9 25.2 26.2 25.5
            hs1.0fcsE 2.49 6.42 3.94 4.05 12.4 7.23
            hs1.3betaD 2.54 6.59 4.99 4.05 12.7 7.40
            hs2.0devA 7.54 7.55 7.25 14.5 14.5 12.0
            Sparc:
            classic_JIT 0.790 0.815 0.784 0.789 0.810 0.798
            hs1.0fcs 0.275 0.837 0.460 0.447 0.447 0.443

            JIT_performance / HotSpot_performance (percents):
                            _______ 1 thread ______ ______ 2 threads ______
            Win32: First Best Average First Best Average
            hs1.0fcsE 550% 223% 353% 622% 211% 353%
            hs1.3betaD 539% 217% 279% 622% 206% 345%
            hs2.0devA 182% 189% 192% 174% 181% 213%
            Sparc:
            hs1.0fcs 287% 97.4% 170% 177% 181% 180%


            TABLE 2I: results of:
            >>>> java Matrix2I 500 ${THREADS} 5

            Performance (MOPS):
                            _______ 1 thread ______ ______ 2 threads ______
            Win32: First Best Average First Best Average
            classic_JIT 21.9 23.0 22.2 41.5 44.4 42.1
            hs1.0fcsE 2.72 19.2 5.62 4.09 37.4 8.79
            hs1.3betaD 2.70 11.6 6.98 4.15 22.5 8.93
            hs2.0devA 18.2 19.3 18.4 35.2 37.2 30.8
            Sparc:
            classic_JIT 1.97 2.00 1.95 1.99 2.00 1.95
            hs1.0fcs 0.360 1.69 0.665 0.559 2.81 0.832

            JIT_performance / HotSpot_performance (percents):
                            _______ 1 thread ______ ______ 2 threads ______
            Win32: First Best Average First Best Average
            hs1.0fcsE 805% 120% 395% 1015% 119% 479%
            hs1.3betaD 811% 198% 318% 1000% 197% 471%
            hs2.0devA 120% 119% 121% 118% 119% 137%
            Sparc:
            hs1.0fcs 547% 118% 293% 356% 70.1% 234%


        Below, the source of the program "Matrix2D.java" is displayed:

        /* @(#)Matrix2D.java 1.4 99/05/27
         * Copyrigth 99/05/27 Sun Microsystems, Inc.
         */

        // package javasoft.sqe.tests.javaspeed;

        import java.io.*;

        /**
         * This class computes <code>A*A</code> for a square matrix <code>A</code>
         * having random elements, and estimates processor time elapsed for that
         * computation. Command-line parameters for the method <code>main(args[])</code>
         * should prescribe size of the matrix <code>A</code>.
         *
         * @author Eugene I. Latkin
         */
        public class Matrix2D {

            public static void main (String args[]) {
        int exitCode = run(args,System.out);
        System.exit(exitCode + 95);
        // JCK-like exit status
            };

            public static int run (String args[], PrintStream out) {
        if ((args == null) || (args.length < 1) || (args.length > 3)) {
        out.println("Execute:");
        out.println(" java Matrix2D matrix_size [threads [iterations]]");
        return 2; // failure
        };

        int size = Integer.parseInt(args[0]);
        if ((size < 100) || (size > 10000)) {
        out.println("Matrix size should be 100 to 1000 lines & columns.");
        return 2; // failure
        };

        int threads = 1;
        if (args.length >= 2)
        threads = Integer.parseInt(args[1]);
        if ((threads < 1) || (threads > size)) {
        System.err.println("Threads number should be 1 to matrix size.");
        System.exit(1);
        };
        if ((size % threads) != 0) {
        out.println("Threads number should evenly divide matrix size.");
        return 2; // failure
        };

        int iterations = 1;
        if (args.length >= 3)
        iterations = Integer.parseInt(args[2]);
        if ((iterations < 1) || (iterations > 100)) {
        out.println("Iterations number should be 1 to 100.");
        return 2; // failure
        };

        out.println("Should try " + iterations + " iteration(s):");
        out.println("==========================" +
        ((iterations>99)? "==": (iterations>9)? "=": ""));
        out.println("");

        double overallTime = 0;
        double bestTime = Double.POSITIVE_INFINITY;
        double bestPerformance = 0;
        int bestIteration = 0;
        for (int i=0; i<iterations; i++) {
        double seconds = elapsedTime(i,out,size,threads);
        double performance = size*size*(size+size)/seconds/1e6;
        if ((i == 0) || (performance > bestPerformance)) {
        bestTime = seconds;
        bestPerformance = performance;
        bestIteration = i+1;
        };
        overallTime += seconds;
        };
        double averageTime = overallTime / iterations;
        double averagePerformance = size*size*(size+size)/averageTime/1e6;

        out.println("");
        out.println("=======================" +
        ((iterations>99)? "==": (iterations>9)? "=": ""));
        out.println("Overall iteration(s): " + iterations);
        out.println("Overall elapsed time: " + overallTime + " seconds.");
        out.println("Average elapsed time: " + averageTime + " seconds.");
        out.println("Average performance: " + averagePerformance + " MFLOPS");
        out.println("Best elapsed time: " + bestTime + " seconds.");
        out.println("Best performance: " + bestPerformance + " MFLOPS");
        out.println("Best iteration: #" + bestIteration);

        return 0; // PASSED
            }

            private static double elapsedTime (
        int i, PrintStream out, int size, int threads) {

        if (i > 0)
        out.println("");
        out.println("Iteration #" + (i+1) + ":");

        out.print("Preparing A["+size+","+size+"]:");
        SquareMatrix A = new SquareMatrix(size);
        SquareMatrix A2 = new SquareMatrix(size);
        out.println(" done.");

        out.print("Computing A*A with " + threads + " thread(s):");
        long mark1 = System.currentTimeMillis();
        A2.setSquareOf(A,threads);
        long mark2 = System.currentTimeMillis();
        out.println(" done.");

        double sec = (mark2 - mark1) / 1000.0;
        double perf = size*size*(size + size) / sec;
        out.println("Elapsed time: " + sec + " seconds");
        out.println("Performance: " + perf/1e6 + " MFLOPS");

        return sec;
            };

            /**
             * This class computes <code>A*A</code> for square matrix <code>A</code>.
             */
            private static class SquareMatrix {
        private double value[][];

        /**
        * New square matrix with random elements.
        */
        public SquareMatrix (int size) {
        value = new double[size][size];
        for (int line=0; line<size; line++)
        for (int column=0; column<size; column++)
        value[line][column] = Math.random() * size;
        };

        /**
        * Update <code>value[][]</code> of <code>this</code> matrix.
        * @param threads Split computation into the given number of threads.
        */
        public void setSquareOf (SquareMatrix A, int threads) {
        if (this.value.length != A.value.length)
        throw new IllegalArgumentException(
        "this.value.length != A.value.length");

        int size = value.length;
        if ((size % threads) != 0)
        throw new IllegalArgumentException("size%threads != 0");
        int bunch = size / threads;

        Thread task[] = new Thread [ threads ];
        for (int t=0; t<threads; t++) {
        int line0 = bunch * t;
        MatrixComputer computer =
        new MatrixComputer(value,A.value,line0,bunch);
        task[t] = new Thread(computer);
        };

        for (int t=0; t<threads; t++)
        task[t].start();

        for (int t=0; t<threads; t++)
        if (task[t].isAlive())
        try {
        task[t].join();
        } catch (InterruptedException exception) {
        throw new RuntimeException(exception.toString());
        };
        }

        /**
        * Thread to compute a bunch of lines of matrix square.
        */
        private static class MatrixComputer implements Runnable {
        private double result[][];
        private double source[][];
        private int line0;
        private int bunch;

        /**
        * Register a task for matrix multiplication.
        */
        public MatrixComputer (
        double result[][], double source[][], int line0, int bunch) {

        this.result = result; // reference to resulting matrix value
        this.source = source; // reference to matrix to be squared
        this.line0 = line0; // compute lines from line0 to ...
        this.bunch = bunch; // number of resulting lines to compute
        }

        /**
        * Do execute the task just registered for <code>this</code> thread.
        */
        public void run () {
        int line1 = line0+bunch;
        int size = result.length;
        for (int line=line0; line<line1; line++)
        for (int column=0; column<size; column++) {
        double sum = 0;
        for (int i=0; i<size; i++)
        sum += source[line][i] * source[i][column];
        result[line][column] = sum;
        };
        }
        }
            }
        }


        ================

              cclicksunw Clifford Click (Inactive)
              dkhukhrosunw Dmitry Khukhro (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: