Uploaded image for project: 'CCC Migration Project'
  1. CCC Migration Project
  2. CCC-4244499

ZipEntry() does not convert filenames from Unicode to platform

XMLWordPrintable

    • Icon: CSR CSR
    • Resolution: Approved
    • Icon: P2 P2
    • 7
    • core-libs
    • None
    • minimal
    • Hide
      (1)The use of standard UTF-8 charset means the non-shortest UTF-8 form of BMP character is no longer accepted.

          -A desirable change, see http://ccc.sfbay.sun.com/4486841 for details.

      (2)Also the supplementary characters in file names and comments are now output/encoded in standard 4-byte UTF-8 form. Jar tool and java.util.jar/zip package in earlier releases do not understand the 4-byte form.

          -It is the "incompatibility" (forward compatibility) that we can't avoid if we want to update to the latest UTF-8 standard and make the Java Jar/Zip file exchangeable with other standards-compliant Zip implementations. The price we have to pay.

          -The standard UTF-8 charset does accept pair of 3-byte surrogates when decoding, so we are backward-compatible

          -A modified-UTF-8 charset implementation can be added into our java.nio.charset repository should such request/escalation come in (in which case you can use the modified-UTF-8 and the proposed APIs to generate "old-style-zip-jar" file).

      (3)The spec and implementation of ZipEntry.setComment(String) is proposed to change

      from

          Throws: IllegalArgumentException - if the length of the specified comment string is greater than 0xFFFF bytes

      to

          A note in API that states:
          ZIP entry comments have maximum length of 0xffff. If the length of the specified comment string is greater than 0xFFFF bytes, only the first 0xFFFF bytes are output to the ZIP file entry.

      The reason to do this is that ZipEntry itself now does not have the knowledge of the "encoding" that it will be used, until it is "bound" with a ZipOutputStream (ZipInputStream and ZipFile classes do not have this setComment issue). It would be very really rare that an application has dependency on such a IAE.

      So we view these compatibilities as very low risk, to the point that we don't propose to introduce a compatibility property/knob.
      Show
      (1)The use of standard UTF-8 charset means the non-shortest UTF-8 form of BMP character is no longer accepted.     -A desirable change, see http://ccc.sfbay.sun.com/4486841 for details. (2)Also the supplementary characters in file names and comments are now output/encoded in standard 4-byte UTF-8 form. Jar tool and java.util.jar/zip package in earlier releases do not understand the 4-byte form.     -It is the "incompatibility" (forward compatibility) that we can't avoid if we want to update to the latest UTF-8 standard and make the Java Jar/Zip file exchangeable with other standards-compliant Zip implementations. The price we have to pay.     -The standard UTF-8 charset does accept pair of 3-byte surrogates when decoding, so we are backward-compatible     -A modified-UTF-8 charset implementation can be added into our java.nio.charset repository should such request/escalation come in (in which case you can use the modified-UTF-8 and the proposed APIs to generate "old-style-zip-jar" file). (3)The spec and implementation of ZipEntry.setComment(String) is proposed to change from     Throws: IllegalArgumentException - if the length of the specified comment string is greater than 0xFFFF bytes to     A note in API that states:     ZIP entry comments have maximum length of 0xffff. If the length of the specified comment string is greater than 0xFFFF bytes, only the first 0xFFFF bytes are output to the ZIP file entry. The reason to do this is that ZipEntry itself now does not have the knowledge of the "encoding" that it will be used, until it is "bound" with a ZipOutputStream (ZipInputStream and ZipFile classes do not have this setComment issue). It would be very really rare that an application has dependency on such a IAE. So we view these compatibilities as very low risk, to the point that we don't propose to introduce a compatibility property/knob.
    • Java API
    • SE

      Summary

      To be filled in before a CSR is made publicly visible; concise one to two sentence summary of the proposed change.

      Problem

      Two issues to address

      (1)4244499: ZipEntry() does not convert filenames from Unicode to platform

      The Zip specification historically does not specify the character encoding to be used for file names and comments, it has supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in jar files. Our java.util.jar and java.util.zip implementation strictly follows Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.

      However, for normal (non-jar) Zip files, the convention used by other tools is to use either the IBM 437 or the platform encoding for file names. Applications that use the java.util.zip package to read/write normal zip files therefore fail (or produce unreadable files) if a file name contains a non-ASCII character, unless the platform encoding happens to be UTF-8.

      This has been the No.1 bug on our Top25 list for years.

      (2)5030283: Incorrect implementation of UTF-8 in zip package Jar specification clearly says "In JAR files, all file names must be encoded in the UTF-8 encoding". However our Jar/Zip implementation handles the "UTF-8 encoding" inconsistently and incorrectly. It assumes&uses either an "ancient" form of UTF-8 which doesn't have the 4-byte form for supplementary characters (and accepts illegal non-shortest form of UTF-8 byte sequences, see http://ccc.sfbay.sun.com/4486841), or relies on the JVM's modified UTF-8 (via (*env)->NewStringUTF/GetStringUTFLength/GetStringUTFRegion), which has the same limitation. As a consequence, file names (and comments) using supplementary characters can be used, but cannot be exchanged with standards-compliant Zip implementations.

      Related Info:

      (1)The latest PKWare ZIP File Format Specification/APPENDIX D - Language Encoding (EFS):

      ...If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding.  If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification.

      (2)Latest WinZip supports UTF-8 encoding and set the general purpose flag bit 11 on when use UTF-8 for the file names and comments in its output Zip file.

      Solution

      The proposal here is

      (1)to distinguish Jar and Zip files to provide a set of new constructors for ZipFile, ZipInputStream and ZipOutputStream with a Charset parameter,

      ZipFile(java.io.File, int, java.nio.charset.Charset) ZipFile(java.io.File, java.nio.charset.Charset) ZipFile(java.lang.String, java.nio.charset.Charset) ZipInputStream(java.io.InputStream, java.nio.charset.Charset) ZipOutputStream(java.io.OutputStream, java.nio.charset.Charset)

      So application (that have the knowledge about what encoding is used in a particular Zip file) can specify the non-UTF-8 encoding when access those Zip files that use non-UTF8 (usually the default encoding of the native platform), or even generate such Zip files when necessary. Meanwhile the Jar command line tool and java.util.jar package continues to use UTF-8.

      (2)to set general purpose flag bit 11 of each Zip file entry ON when UTF-8 encoding is used to generate Zip files (via Jar/ZipOutputStream)

      (3)UTF-8 is used to decode the file names and comments if the general purpose flag bit 11 is ON, regardless the charset/encoding specified in constructors.

      (4)to use standard UTF-8 charset to handle encoding/decoding for all Jar/Zip files.

      Specification

      See attached BlenderRev.html

            sherman Xueming Shen
            sherman Xueming Shen
            Alan Bateman
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved: