Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8282648

Weaken the InflaterInputStream specification in order to allow faster Zip implementations




      `java.util.zip.Inflater` is the Java wrapper class for zlib's inflater functionality. `Inflater::inflate(byte[] output, int off, int len)` currently calls zlib's native `inflate(..)` function and passes the address of `output[off]` and `len` to it via JNI.

      The specification of zlib's `inflate(..)` function (i.e. the [API documentation in the original zlib implementation](https://github.com/madler/zlib/blob/cacf7f1d4e3d44d871b605da3b647f07d718623f/zlib.h#L400)) doesn't give any guarantees with regard to usage of the output buffer. It only states that upon completion the function will return the number of bytes that have been written (i.e. "inflated") into the output buffer.

      The original zlib implementation only wrote as many bytes into the output buffer as it inflated. However, this is not a hard requirement and newer, more performant implementations of the zlib library like [zlib-chromium](https://chromium.googlesource.com/chromium/src/third_party/zlib/) or [zlib-cloudflare](https://github.com/cloudflare/zlib) can use more bytes of the output buffer than they actually inflate as a scratch buffer. See https://github.com/simonis/zlib-chromium for a more detailed description of their approach and its performance benefit.

      These new zlib versions can still be used transparently from Java (e.g. by putting them into the `LD_LIBRARY_PATH` or by using `LD_PRELOAD`), because they still fully comply to specification of `Inflater::inflate(..)`. However, we might run into problems when using the `Inflater` functionality from the `InflaterInputStream` class. `InflaterInputStream` is derived from from `InputStream` and as such, its `read(byte[] b, int off, int len)` method is quite constrained. It specifically specifies that if *k* bytes have been read, then "these bytes will be stored in elements `b[off]` through `b[off+`*k*`-1]`, leaving elements `b[off+`*k*`]` through `b[off+len-1]` **unaffected**". But `InflaterInputStream::read(byte[] b, int off, int len)` (which is constrained by `InputStream::read(..)`'s specification) calls `Inflater::inflate(byte[] b, int off, int len)` and directly passes its output buffer down to the native zlib `inflate(..)` method which is free to change the bytes beyond `b[off+`*k*`]` (where *k* is the number of inflated bytes).

      From a practical point of view, I don't see this as a big problem, because callers of `InflaterInputStream::read(byte[] b, int off, int len)` can never know how many bytes will be written into the output buffer `b` (and in fact its content can always be completely overwritten). It therefore makes no sense to depend on any data there being untouched after the call. Also, having used zlib-cloudflare productively for about two years, we haven't seen real-world issues because of this behavior yet. However, from a specification point of view it is easy to artificially construct a program which violates `InflaterInputStream::read(..)`'s postcondition if using one of the alterantive zlib implementations. A recently integrated JTreg test (test/jdk/jdk/nio/zipfs/ZipFSOutputStreamTest.java) "unintentionally" fails with zlib-chromium but could be easily fixed to run with alternative implementations as well.

      What should/can be done to solve this problem? I think we have three options:

      1. Relax the specification of `InflaterInputStream::read(..)` and specifically note in the API documentation that a call to `InflaterInputStream::read(byte[] b, int off, int len)` may write more than *k* characters into `b` where *k* is the returned number of inflated bytes. Notice that there's a precedence for this approach in `java.io.ByteArrayInputStream::read(..)` which "*unlike the overridden method of InputStream, returns -1 instead of zero if the end of the stream has been reached and len==0*".

      2. Tighten the specification of `Inflater::read(byte[] b, int off, int len)` to explicitly forbid any writes into the output array `b` beyond the inflated bytes.

      3. Change the implementation of `InflaterInputStream::read(..)` to call `Inflater::read(..)` with a temporary buffer and only copy the nu,ber of inflated bytes into `InflaterInputStream::read(..)`'s output buffer. I think we all agree that this is only a theoretic option because of its unacceptable performance regression.

      I personally favor option 1 but I'm interested in your opinions?


        Issue Links



              simonis Volker Simonis
              simonis Volker Simonis
              0 Vote for this issue
              6 Start watching this issue