-
CSR
-
Resolution: Approved
-
P2
-
None
-
minimal
-
-
Java API
-
SE
Summary
To be filled in before a CSR is made publicly visible; concise one to two sentence summary of the proposed change.
Problem
Two issues to address
(1)4244499: ZipEntry() does not convert filenames from Unicode to platform
The Zip specification historically does not specify the character encoding to be used for file names and comments, it has supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in jar files. Our java.util.jar and java.util.zip implementation strictly follows Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.
However, for normal (non-jar) Zip files, the convention used by other tools is to use either the IBM 437 or the platform encoding for file names. Applications that use the java.util.zip package to read/write normal zip files therefore fail (or produce unreadable files) if a file name contains a non-ASCII character, unless the platform encoding happens to be UTF-8.
This has been the No.1 bug on our Top25 list for years.
(2)5030283: Incorrect implementation of UTF-8 in zip package Jar specification clearly says "In JAR files, all file names must be encoded in the UTF-8 encoding". However our Jar/Zip implementation handles the "UTF-8 encoding" inconsistently and incorrectly. It assumes&uses either an "ancient" form of UTF-8 which doesn't have the 4-byte form for supplementary characters (and accepts illegal non-shortest form of UTF-8 byte sequences, see http://ccc.sfbay.sun.com/4486841), or relies on the JVM's modified UTF-8 (via (*env)->NewStringUTF/GetStringUTFLength/GetStringUTFRegion), which has the same limitation. As a consequence, file names (and comments) using supplementary characters can be used, but cannot be exchanged with standards-compliant Zip implementations.
Related Info:
(1)The latest PKWare ZIP File Format Specification/APPENDIX D - Language Encoding (EFS):
...If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification.
(2)Latest WinZip supports UTF-8 encoding and set the general purpose flag bit 11 on when use UTF-8 for the file names and comments in its output Zip file.
Solution
The proposal here is
(1)to distinguish Jar and Zip files to provide a set of new constructors for ZipFile, ZipInputStream and ZipOutputStream with a Charset parameter,
ZipFile(java.io.File, int, java.nio.charset.Charset) ZipFile(java.io.File, java.nio.charset.Charset) ZipFile(java.lang.String, java.nio.charset.Charset) ZipInputStream(java.io.InputStream, java.nio.charset.Charset) ZipOutputStream(java.io.OutputStream, java.nio.charset.Charset)
So application (that have the knowledge about what encoding is used in a particular Zip file) can specify the non-UTF-8 encoding when access those Zip files that use non-UTF8 (usually the default encoding of the native platform), or even generate such Zip files when necessary. Meanwhile the Jar command line tool and java.util.jar package continues to use UTF-8.
(2)to set general purpose flag bit 11 of each Zip file entry ON when UTF-8 encoding is used to generate Zip files (via Jar/ZipOutputStream)
(3)UTF-8 is used to decode the file names and comments if the general purpose flag bit 11 is ON, regardless the charset/encoding specified in constructors.
(4)to use standard UTF-8 charset to handle encoding/decoding for all Jar/Zip files.
Specification
See attached BlenderRev.html
- csr for
-
JDK-4244499 ZipEntry() does not convert filenames from Unicode to platform
-
- Resolved
-