Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8252739

Deflater.setDictionary(byte[], int off, int len) ignores the starting offset for the dictionary

XMLWordPrintable

    • b18
    • generic

      Apache Lucene recently changed in its master branch to use Inflater/Deflater's ability to provide a custom dictionary. The code worked, but nightly testing has shown, that under certain circumstances, the compression does not work.

      Background information: Lucene's index file formats are in most cases handled through a Lucene class BytesRef which is like a pointer into a bytearray that contains much more data than actually needed by using offset and length (a buffer is loaded from disk and then a BytesRef is used to point to a slice). Depending on the indexing process, the dictionary is sometimes not at the beginning of the underlying byte[], so Lucene uses Inflater#setDictionary(byte[], int ofs, int len), passing the slice in the much bigger byte array.

      The bug happens if: ofs >0

      Checking source code of Inflater and the JNI C code shows: The offset is passed down to the JNI code, but the implementation completely ignores the ofs parameter: [https://github.com/openjdk/jdk/blob/1643bc3defa241aef2cad53d0f11076366c3620d/src/java.base/share/native/libzip/Deflater.c#L100-L111]

      We have a simple test case that shows the bug, see attached files.

      WORKAROUND: Create a copy of the byte array slice.

      Code that does not work:
      deflater.setDictionary(data, DICT_OFFSET, DICT_LENGTH);

      Code that works:
      deflater.setDictionary(Arrays.copyOfRange(data, DICT_OFFSET, DICT_OFFSET + DICT_LENGTH));

      At Lucene we will use the workaround for the time beeing, but the code should really be fixed, as it may cause index corrumption. Luckily we did not deploy the code to our users yet.

      We also checked, if the Deflater#setDictionary(ByteBuffer) method as a workaround (by passing a ByteBuffer wrapping the byte array slice), but after reading the source code, the Java part checks for direct buffers and only then passes to the non-buggy version getting a native address/bytebuffer. If the ByteBuffer is a heap bytebuffer it calls the buggy method ignoring offset, too.

      So the bug affects the following methods:
      - Deflater.setDictionary(byte[], int ofs, int len) (if ofs > 0)
      - Deflater.setDictionary(ByteBuffer) (if ByteBuffer is a Heap-ByteBuffer and arrayOffset()/position()!=0)

      The methods in Inflater seems correct.

      Other investigations:
      JDK 8 seems correct: [http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/45506343cb65/src/share/native/java/util/zip/Deflater.c#l116]

      Thanks to Robert Muir for investigating that issue and finding the ignored offset parameter and Adrien Grand (both Lucene) for finding the issue.

            lancea Lance Andersen
            uschindler Uwe Schindler
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: