Details
-
Bug
-
Resolution: Fixed
-
P3
-
11.0.5, 13, 14, 15
-
b18
-
generic
Description
Apache Lucene recently changed in its master branch to use Inflater/Deflater's ability to provide a custom dictionary. The code worked, but nightly testing has shown, that under certain circumstances, the compression does not work.
Background information: Lucene's index file formats are in most cases handled through a Lucene class BytesRef which is like a pointer into a bytearray that contains much more data than actually needed by using offset and length (a buffer is loaded from disk and then a BytesRef is used to point to a slice). Depending on the indexing process, the dictionary is sometimes not at the beginning of the underlying byte[], so Lucene uses Inflater#setDictionary(byte[], int ofs, int len), passing the slice in the much bigger byte array.
The bug happens if: ofs >0
Checking source code of Inflater and the JNI C code shows: The offset is passed down to the JNI code, but the implementation completely ignores the ofs parameter: [https://github.com/openjdk/jdk/blob/1643bc3defa241aef2cad53d0f11076366c3620d/src/java.base/share/native/libzip/Deflater.c#L100-L111]
We have a simple test case that shows the bug, see attached files.
WORKAROUND: Create a copy of the byte array slice.
Code that does not work:
deflater.setDictionary(data, DICT_OFFSET, DICT_LENGTH);
Code that works:
deflater.setDictionary(Arrays.copyOfRange(data, DICT_OFFSET, DICT_OFFSET + DICT_LENGTH));
At Lucene we will use the workaround for the time beeing, but the code should really be fixed, as it may cause index corrumption. Luckily we did not deploy the code to our users yet.
We also checked, if the Deflater#setDictionary(ByteBuffer) method as a workaround (by passing a ByteBuffer wrapping the byte array slice), but after reading the source code, the Java part checks for direct buffers and only then passes to the non-buggy version getting a native address/bytebuffer. If the ByteBuffer is a heap bytebuffer it calls the buggy method ignoring offset, too.
So the bug affects the following methods:
- Deflater.setDictionary(byte[], int ofs, int len) (if ofs > 0)
- Deflater.setDictionary(ByteBuffer) (if ByteBuffer is a Heap-ByteBuffer and arrayOffset()/position()!=0)
The methods in Inflater seems correct.
Other investigations:
JDK 8 seems correct: [http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/45506343cb65/src/share/native/java/util/zip/Deflater.c#l116]
Thanks to Robert Muir for investigating that issue and finding the ignored offset parameter and Adrien Grand (both Lucene) for finding the issue.
Background information: Lucene's index file formats are in most cases handled through a Lucene class BytesRef which is like a pointer into a bytearray that contains much more data than actually needed by using offset and length (a buffer is loaded from disk and then a BytesRef is used to point to a slice). Depending on the indexing process, the dictionary is sometimes not at the beginning of the underlying byte[], so Lucene uses Inflater#setDictionary(byte[], int ofs, int len), passing the slice in the much bigger byte array.
The bug happens if: ofs >0
Checking source code of Inflater and the JNI C code shows: The offset is passed down to the JNI code, but the implementation completely ignores the ofs parameter: [https://github.com/openjdk/jdk/blob/1643bc3defa241aef2cad53d0f11076366c3620d/src/java.base/share/native/libzip/Deflater.c#L100-L111]
We have a simple test case that shows the bug, see attached files.
WORKAROUND: Create a copy of the byte array slice.
Code that does not work:
deflater.setDictionary(data, DICT_OFFSET, DICT_LENGTH);
Code that works:
deflater.setDictionary(Arrays.copyOfRange(data, DICT_OFFSET, DICT_OFFSET + DICT_LENGTH));
At Lucene we will use the workaround for the time beeing, but the code should really be fixed, as it may cause index corrumption. Luckily we did not deploy the code to our users yet.
We also checked, if the Deflater#setDictionary(ByteBuffer) method as a workaround (by passing a ByteBuffer wrapping the byte array slice), but after reading the source code, the Java part checks for direct buffers and only then passes to the non-buggy version getting a native address/bytebuffer. If the ByteBuffer is a heap bytebuffer it calls the buggy method ignoring offset, too.
So the bug affects the following methods:
- Deflater.setDictionary(byte[], int ofs, int len) (if ofs > 0)
- Deflater.setDictionary(ByteBuffer) (if ByteBuffer is a Heap-ByteBuffer and arrayOffset()/position()!=0)
The methods in Inflater seems correct.
Other investigations:
JDK 8 seems correct: [http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/45506343cb65/src/share/native/java/util/zip/Deflater.c#l116]
Thanks to Robert Muir for investigating that issue and finding the ignored offset parameter and Adrien Grand (both Lucene) for finding the issue.
Attachments
Issue Links
- relates to
-
JDK-8225189 assert(!JavaThread::current()->in_critical()) failed: Would deadlock
- Closed
-
JDK-8252976 JDK-8185582 causes SecurityException when accessDeclaredMembers is not given
- Open
-
JDK-8200527 Inflater/Deflater methods to inflate/deflate on byte buffers
- Closed
(1 links to)