Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8323583

Allow ZipInputStream.readEnd to parse small Zip64 ZIP files

XMLWordPrintable

    • Icon: CSR CSR
    • Resolution: Approved
    • Icon: P4 P4
    • 23
    • core-libs
    • None
    • behavioral
    • low
    • Hide
      The intended behavioral change proposed here is to extend the set of ZIP entries parsable by ZipInputStream to include "small Zip64 entries". Such entries meet the following criteria:

      1. They are clearly marked as using the Zip64 format, meaning that the LOC's 'compressed size' and 'uncompressed size' fields are set to 0xFFFFFFFF and that the LOC's extra field includes a valid 'Zip64 Extended Information Field'.

      2. They use 'streaming mode', meaning that the 'general purpose bit flag' 3 is set, that the Zip64 field's 'Original Size' and 'Compressed Size' are both set to zero, and that file data is followed by a 'Data Descriptor' containing the actual size values.

      3. Neither the compressed or uncompressed size of the entry data exceeds 4GB (0xFFFFFFFF)

      4. The Data Descriptor also uses the Zip64 format, meaning it represents size fields using 8 byte fields instead of the regular 4 byte fields.

      The change introduced here makes ZipInputStream assume any entry meeting criteria 1-3 also meets criteria 4.

      It is conceivable but unlikely that ZIP files meeting criteria 1-3, but not 4 exist. That is; a "small" entry is clearly marked as using the Zip64 format in the LOC header and extra field, but then it is not using the 8-byte Zip64 format in the Data Descriptor.

      If such ZIP files exist, they will be made unparsable by this change.

      The reasons such files are unlikely to exist in the wild include:

      - The file would be in clear violation of the APPNOTE.txt specification.
      - Testing show several external tools rejecting or misinterpreting such files, this includes the "zipdetails" tool and the Python library "stream-unzip".
      - ZipOutputStream and ZipFileSystem cannot be used to produce such files, meaning they are unlikely to exist in the Java ecosystem.

      There is also an implementation robustness risk introduced by the parsing of potentially invalid extra data needed to check for valid Zip64 entries. Care has been taken to reduce this risk by use of defensive coding and explicit range checks, similar to how ZipEntry.setExtra0 already parses extra fields.
      Show
      The intended behavioral change proposed here is to extend the set of ZIP entries parsable by ZipInputStream to include "small Zip64 entries". Such entries meet the following criteria: 1. They are clearly marked as using the Zip64 format, meaning that the LOC's 'compressed size' and 'uncompressed size' fields are set to 0xFFFFFFFF and that the LOC's extra field includes a valid 'Zip64 Extended Information Field'. 2. They use 'streaming mode', meaning that the 'general purpose bit flag' 3 is set, that the Zip64 field's 'Original Size' and 'Compressed Size' are both set to zero, and that file data is followed by a 'Data Descriptor' containing the actual size values. 3. Neither the compressed or uncompressed size of the entry data exceeds 4GB (0xFFFFFFFF) 4. The Data Descriptor also uses the Zip64 format, meaning it represents size fields using 8 byte fields instead of the regular 4 byte fields. The change introduced here makes ZipInputStream assume any entry meeting criteria 1-3 also meets criteria 4. It is conceivable but unlikely that ZIP files meeting criteria 1-3, but not 4 exist. That is; a "small" entry is clearly marked as using the Zip64 format in the LOC header and extra field, but then it is not using the 8-byte Zip64 format in the Data Descriptor. If such ZIP files exist, they will be made unparsable by this change. The reasons such files are unlikely to exist in the wild include: - The file would be in clear violation of the APPNOTE.txt specification. - Testing show several external tools rejecting or misinterpreting such files, this includes the "zipdetails" tool and the Python library "stream-unzip". - ZipOutputStream and ZipFileSystem cannot be used to produce such files, meaning they are unlikely to exist in the Java ecosystem. There is also an implementation robustness risk introduced by the parsing of potentially invalid extra data needed to check for valid Zip64 entries. Care has been taken to reduce this risk by use of defensive coding and explicit range checks, similar to how ZipEntry.setExtra0 already parses extra fields.
    • Java API
    • Implementation

      Summary

      Allow java.util.zip.ZipInputStream to parse entries using the Zip64 format where neither the compressed nor uncompressed file size exceeds the 4GB limit.

      Problem

      The compressed and uncompressed size of a ZIP entry are often not known until all entry data has been written by the client.

      If the producer cannot seek back in the ZIP stream to update the size fields in the LOC header, those fields are left as zero and the actual compressed and uncompressed file sizes are instead put in a 'Data Descriptor' record immediately following the file data.

      If the entry uses the Zip64 format, then the 'compressed size' and 'uncompressed size' fields are instead set to the magic marker value 0xFFFFFFFF and a Zip64 extra field is added with the 'Original Size' and 'Compressed Size' both set to zero.

      The 'Data Descriptor' record normally encodes size fields using 4 byte numbers. However, 8-byte numbers should be used instead when either the compressed or uncompressed sizes exceed 4GB, or if the entry uses the Zip64 format:

      4.3.9.2 When compressing files, compressed and uncompressed sizes 
            SHOULD be stored in ZIP64 format (as 8 byte values) when a 
            file's size exceeds 0xFFFFFFFF.   However ZIP64 format MAY be 
            used regardless of the size of a file.  When extracting, if 
            the zip64 extended information extra field is present for 
            the file the compressed and uncompressed sizes will be 8
            byte values.  

      ZipInputStream currently relies solely on the size information aquired from the Inflater when deciding how to parse the data descriptor record. The LOC is not consulted to see if the entry uses the Zip64 format.

      If an entry does use the Zip64 format, but neither the compressed or uncompressed sizes exceed 4GB, then ZipInputStream currently fails to parse the Data Descriptor correctly and a ZipException is thrown instead:

      java.util.zip.ZipException: invalid entry size (expected 0 but got 6 bytes)
          at java.base/java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:616)

      While ZipOutputStream does not use the Zip64 format when writing entries of an unknown size, other tools do produce such files, including Info-ZIP used in streaming mode:

      echo hello | zip -fd > hello.zip

      It would be useful to update ZipInputStream to allow parsing such valid ZIP files. Supporting these files could benefit OpenJDK testing as well, which currently relies on producing very large files to test Zip64.

      Solution

      The solution is to update ZipInputStream such that it not only consults the number of compressed and uncompressed bytes read by the Inflater, but also inspects the LOC header to determine if it uses the Zip64 format. When an entry uses Zip64, then ZipInputStream.readEnd should parse the Data Descriptor using 8-byte numbers instead of the regular 4-bytes.

      ZipInputStream.readLOC is a good decision point for determining whether to expect 4- or 8-byte numbers. This method has full access to the LOC header fields including the extra field where any Zip64 field is located.

      ZipInputStream is updated as follows:

      • A new boolean internal flag ZipInputStream.expect64BitDataDescriptor is added. The purpose of this field is to communicate the number format determined by readLOC to the readEnd method which is responsible for the actual parsing of the Data Descriptor record.
      • readLOC is updated to inspect the LOC and set expect64BitDataDescriptorto true if the LOC uses the Zip64 format; that is if the compressed and uncompressed size fields are both 0xFFFFFFFF and the extra field contains a valid Zip64 extra field. To reduce changes in readLOC, this logic is mostly implemented in the new support methods expect64BitDataDescriptor and isZip64DataDescriptorField.
      • readEnd is updated to read 8-byte fields when the expect64BitDataDescriptor flag is true.

      Specification

      The specification is not changed, this is purely an implementation and behavioral change.

            eirbjo Eirik Bjørsnøs
            eirbjo Eirik Bjørsnøs
            Alan Bateman
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: