Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-7034403

proposal to optimise the performance of the Jar utility

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Unresolved
    • Icon: P4 P4
    • tbd
    • 7
    • tools
    • None
    • jar
    • generic
    • generic

      The optimisations that I have performed are

      1. Allowing the Jar utility to have other compression levels (currently it
      allows default (5) only)
      2. Multi-threading, and pipelining the the file information and access
      3. multi-threading the compression and file writing

      A little background
      A part of the development process of where I work they regularly Jar the content
      of the working projects as part of the distribution to remote systems. This is a
      large and complex code base of 6 million LOC and growing. The Jar file ends up
      compressed to approx 100Mb, Uncompressed the jar size is approx 245mb, about 4-5
      times the size of rt.jar.

      I was looking at ways to improve the performance as this activity occurs several
      times a day for dozens of developers

      In essence when compressing a new jar file the jar utility is single threaded
      and staged. Forgive me if this is an oversimplification

      first it works out all of the files that are specified, buffering the file
      names, (IO bound)
      then it iterates through the files, and for each file, it load the file
      information, and then the file content sending it to a JarOutputStream, (CPU
      bound or IO bound depending on the IO speed)

      The JarOutputStream has a compression of 0 (just store) or 5 (the default), and
      the jar writing is single threaded by the design of the JarOutputStream

      The process of creation of a Jar took about 20 seconds in windows with the help
      of an SSD, and considerable longer without one, and was CPU bound to one CPU
      core

      ----
      The changes that I made were
      1. Allow deferent compression levels (for us a compression level of 1 increases
      the file size of the Jar to 110 Mb but reduces the CPU load in compression to
      approx 30% of what it was (rough estimate)
      2. pipelining the file access
      2.1 one thread is started for each file root (-C on the Jar command line),
      which scans for files and places the file information into a blocking queue(Q1),
      which I set to abretrary size of 200 items
      2.2 one thread pool of 10 threads reads the file information from the queue
      (Q1) and buffers the file content to a specified size (again I specified an
      arbetrary size limit of 25K for a file, and places the buffered content into a
      queue(q2) (again arbetrary size of 10 items
      2.3 one thread takes the filecontent from Q2 and compresses it or checksums
      it and adds it the the JarOutputStream. This process is single threaded due to
      the design of the JarOutputStream

      some other minor performance gain occurred by increasing the buffer on the
      output stream to reduce the IO load

      The end result is that the process takes about approx 5 seconds in the same
      configuration

      The above is in use in production configuration for a few months now

      As a home project I have completed some enhancements to the JarOutputStream, and
      produced a JarWriter that allows multiple threads to work concurrently deflating
      or calculating checksums, which seems to test OK for the test cases that Ihave
      generated,and successfully loads my quad core home dev machine on all cores.
      Each thread allocates a buffer, and the thread compresses a files into the
      buffer, only blocking other threads whenthe buffer is written to the output
      (which is after the compression is complete, unless the file is too large to
      compress

      This JarWriter is not API compatable with the JarOutputStream, it is not a
      stream. It allows the programmer to write a record based of the file information
      and an input stream, and is threadsafe. It is not a drop in replacement for
      JarOutputStream
      I am not an expert in the ZIp file format, but much of the code from
      ZipOutputStream is unchanged, just restructured
      ---
      I did think that there is some scope for improvement, that I have not looked at
      a. thresholding of file size for compression (very small files dont compress
      well
      b. some file types dont compress well (e.g. png, jpeg) as they have been
      compressed already)
      c. using NIO to parallelise the loading of the file information or content
      d. some pre-charging of the deflator dictionary (e.g. a class file contains the
      strings of the class name and packages), but this would make the format
      incompatable with zip, and require changes to the JVM to be useful, and is a
      long way from my comform zone, or skill set. This would reduce the file size

      --
      What is the view of the readers. Is this something, or at least some parts of
      this that could be incorperated into Java 7 or is this too late on the dev cycle

            sherman Xueming Shen
            sherman Xueming Shen
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Imported:
              Indexed: