-
Bug
-
Resolution: Not an Issue
-
P4
-
None
-
7
-
x86
-
windows_7
FULL PRODUCT VERSION :
jdk1.7.0
ADDITIONAL OS VERSION INFORMATION :
Windows 6.1.7601
A DESCRIPTION OF THE PROBLEM :
Problem Introduction:
No unzipping method that I have used yet works with zipped files with file names containing non-ASCII characters. No method seems to be able to find the real file names of the zipped files. This is a well-known, old, bug that is claimed to be fixed in Java 7. However, I have tried an early access version of Java 7, and also the package from apache: org.apache.tools.zip.*, which is claimed to be a replacement for pre-Java7 zip utils (java.util.zip.*), to add support for file name encoding other than UTF-8. However, neither Java 7 ea, nor the apache solution works. Below I am going through 5 ways to read the zipped file names using java.util.zip.
Problem description and 5 different attempts:
I have a zip file with name åäö.zip. The name of the zip file itself causes no problems; it is encoded/decoded correctly. It is the names of the contained, zipped, files that causes the problem. My zip file contains 2 zipped files with names File_1_refäräns.pdf and File_2_dåvälöpment.pdf. I have tried the following methods and jars/jdk etc:
1. Using jdk1.6.0_20 and java.util.zip:
Following code is evaluating the browsed zip file:
ZipFile zipFile = new ZipFile([myFilePath]);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry zipentry = (ZipEntry) e.nextElement();
String entryname = zipentry.getName();
ETC...
Debugging the file name of the first entry returns: “entryname = File_1_ref?r?ns.p”, whereas it should be File_1_refäräns.pdf. So, it is not even possible to unzip it correctly since it doesn’t even get the file extension correctly.
2. Using jdk1.7.0 ea, java.util.zip, java.nio.charset.Charset, encoding ZipFile by UTF-8:
Following code is evaluating the browsed zip file:
Charset charsetISO = Charset.forName("UTF-8");
ZipFile zipFile = new ZipFile([myFilePath], charsetISO);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
String entryname = zipentry.getName();
ETC...
This time an Exception is thrown:
java.lang.IllegalArgumentException: MALFORMED[1]
at java.util.zip.ZipCoder.toString(ZipCoder.java:53)
at java.util.zip.ZipFile.getZipEntry(ZipFile.java:500)
at java.util.zip.ZipFile.access$800(ZipFile.java:53)
at java.util.zip.ZipFile$2.nextElement(ZipFile.java:482)
at java.util.zip.ZipFile$2.nextElement(ZipFile.java:452)
at se.aklagarmyndigheten.alba.library.contenttransfer.importcontent.SipFile.validateZipEntries(SipFile.java:278)
at se.aklagarmyndigheten.alba.library.contenttransfer.importcontent.SipFile.validate(SipFile.java:214)
at se.aklagarmyndigheten.alba.library.contenttransfer.importcontent.SipImportContainer.onNextComponent(SipImportContainer.java:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:613)
at com.documentum.web.form.FormProcessor.invokeMethod(FormProcessor.java:1630)
…
The red marked reference is pointing to this custom code:
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
3. Using jdk1.7.0 ea, java.util.zip, java.nio.Charset, encoding ZipFile by ISO-8859-1:
Following code is evaluating the browsed zip file:
Charset charsetLatin1 = Charset.forName("ISO-8859-1");
ZipFile zipFile = new ZipFile([myFilePath], charsetLatin1);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
String entryname = zipentry.getName();
ETC...
This time NO Exception is thrown, however debugging the file name of the first entry returns: “entryname = File_1_ref?r?ns.pdf”, whereas it should be File_1_refäräns.pdf. Compared to first attempt (jdk1.6: “File_1_ref?r?ns.p”) this is slightly better: at least the full file extension is read: .pdf.
4. Using jdk1.7.0 ea, java.util.zip, java.nio.Charset, encoding both ZipFile and ZipEntry by ISO-8859-1:
Following code is evaluating the browsed zip file:
Charset charsetLatin1 = Charset.forName("ISO-8859-1");
ZipFile zipFile = new ZipFile([myFilePath], charsetLatin1);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
String entrynameEncodeUnsafeAscii = java.net.URLEncoder.encode(zipentry.getName(), "ISO- 8859-1");
String entryname = entrynameEncodeUnsafeAscii;
ETC...
This time, debugging the file names returns: “entryname = File_1_ref%84r%84ns.pdf”, and “entryname = File_2_d%86v%84l%94pment.pdf”, respectively. At least now, the encoding engine may identify difference between å, ä and ö. In step 1-3 those signs have been translated into a “?” sign.
5. Using jdk1.7.0 ea, java.util.zip, java.nioCharset, encoding both ZipFile and ZipEntry and then decoding ZipEntry by ISO-8859-1:
Following code is evaluating the browsed zip file:
Charset charsetLatin1 = Charset.forName("ISO-8859-1");
ZipFile zipFile = new ZipFile([myFilePath], charsetLatin1);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
String entrynameEncodeUnsafeAscii = java.net.URLEncoder.encode(zipentry.getName(), "ISO- 8859-1");
String entryname = java.net.URLDecoder.decode(entrynameEncodeUnsafeAscii, "ISO-8859-1");
ETC...
And then we’re back to entryname = “File_1_ref?r?ns.pdf”.
//EOF
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Please follow the 5 different approaches to reading the file name of a zipped file, described in the previous section.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Expected to get correct file names (with å, ä and ö) at unzipping.
ACTUAL -
å, ä and ö replaced by either "?" or "%84", "%86" etc.
ERROR MESSAGES/STACK TRACES THAT OCCUR :
N/A for 4 of the approaches, but for the second one (again, see problem description):
java.lang.IllegalArgumentException: MALFORMED[1]
at java.util.zip.ZipCoder.toString(ZipCoder.java:53)
at java.util.zip.ZipFile.getZipEntry(ZipFile.java:500)
at java.util.zip.ZipFile.access$800(ZipFile.java:53)
at java.util.zip.ZipFile$2.nextElement(ZipFile.java:482)
at java.util.zip.ZipFile$2.nextElement(ZipFile.java:452)
at se.aklagarmyndigheten.alba.library.contenttransfer.importcontent.SipFile.validateZipEntries(SipFile.java:278)
at
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
Out of 5 different approaches (please see description section for source code for all of them), I'm only displaying the one that caused an IllegalArgumentException to be thrown, here:
Charset charsetISO = Charset.forName("UTF-8");
ZipFile zipFile = new ZipFile([myFilePath], charsetISO);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
String entryname = zipentry.getName();
...ETC...
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Sorry, no workaround.
SUPPORT :
YES
jdk1.7.0
ADDITIONAL OS VERSION INFORMATION :
Windows 6.1.7601
A DESCRIPTION OF THE PROBLEM :
Problem Introduction:
No unzipping method that I have used yet works with zipped files with file names containing non-ASCII characters. No method seems to be able to find the real file names of the zipped files. This is a well-known, old, bug that is claimed to be fixed in Java 7. However, I have tried an early access version of Java 7, and also the package from apache: org.apache.tools.zip.*, which is claimed to be a replacement for pre-Java7 zip utils (java.util.zip.*), to add support for file name encoding other than UTF-8. However, neither Java 7 ea, nor the apache solution works. Below I am going through 5 ways to read the zipped file names using java.util.zip.
Problem description and 5 different attempts:
I have a zip file with name åäö.zip. The name of the zip file itself causes no problems; it is encoded/decoded correctly. It is the names of the contained, zipped, files that causes the problem. My zip file contains 2 zipped files with names File_1_refäräns.pdf and File_2_dåvälöpment.pdf. I have tried the following methods and jars/jdk etc:
1. Using jdk1.6.0_20 and java.util.zip:
Following code is evaluating the browsed zip file:
ZipFile zipFile = new ZipFile([myFilePath]);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry zipentry = (ZipEntry) e.nextElement();
String entryname = zipentry.getName();
ETC...
Debugging the file name of the first entry returns: “entryname = File_1_ref?r?ns.p”, whereas it should be File_1_refäräns.pdf. So, it is not even possible to unzip it correctly since it doesn’t even get the file extension correctly.
2. Using jdk1.7.0 ea, java.util.zip, java.nio.charset.Charset, encoding ZipFile by UTF-8:
Following code is evaluating the browsed zip file:
Charset charsetISO = Charset.forName("UTF-8");
ZipFile zipFile = new ZipFile([myFilePath], charsetISO);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
String entryname = zipentry.getName();
ETC...
This time an Exception is thrown:
java.lang.IllegalArgumentException: MALFORMED[1]
at java.util.zip.ZipCoder.toString(ZipCoder.java:53)
at java.util.zip.ZipFile.getZipEntry(ZipFile.java:500)
at java.util.zip.ZipFile.access$800(ZipFile.java:53)
at java.util.zip.ZipFile$2.nextElement(ZipFile.java:482)
at java.util.zip.ZipFile$2.nextElement(ZipFile.java:452)
at se.aklagarmyndigheten.alba.library.contenttransfer.importcontent.SipFile.validateZipEntries(SipFile.java:278)
at se.aklagarmyndigheten.alba.library.contenttransfer.importcontent.SipFile.validate(SipFile.java:214)
at se.aklagarmyndigheten.alba.library.contenttransfer.importcontent.SipImportContainer.onNextComponent(SipImportContainer.java:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:613)
at com.documentum.web.form.FormProcessor.invokeMethod(FormProcessor.java:1630)
…
The red marked reference is pointing to this custom code:
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
3. Using jdk1.7.0 ea, java.util.zip, java.nio.Charset, encoding ZipFile by ISO-8859-1:
Following code is evaluating the browsed zip file:
Charset charsetLatin1 = Charset.forName("ISO-8859-1");
ZipFile zipFile = new ZipFile([myFilePath], charsetLatin1);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
String entryname = zipentry.getName();
ETC...
This time NO Exception is thrown, however debugging the file name of the first entry returns: “entryname = File_1_ref?r?ns.pdf”, whereas it should be File_1_refäräns.pdf. Compared to first attempt (jdk1.6: “File_1_ref?r?ns.p”) this is slightly better: at least the full file extension is read: .pdf.
4. Using jdk1.7.0 ea, java.util.zip, java.nio.Charset, encoding both ZipFile and ZipEntry by ISO-8859-1:
Following code is evaluating the browsed zip file:
Charset charsetLatin1 = Charset.forName("ISO-8859-1");
ZipFile zipFile = new ZipFile([myFilePath], charsetLatin1);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
String entrynameEncodeUnsafeAscii = java.net.URLEncoder.encode(zipentry.getName(), "ISO- 8859-1");
String entryname = entrynameEncodeUnsafeAscii;
ETC...
This time, debugging the file names returns: “entryname = File_1_ref%84r%84ns.pdf”, and “entryname = File_2_d%86v%84l%94pment.pdf”, respectively. At least now, the encoding engine may identify difference between å, ä and ö. In step 1-3 those signs have been translated into a “?” sign.
5. Using jdk1.7.0 ea, java.util.zip, java.nioCharset, encoding both ZipFile and ZipEntry and then decoding ZipEntry by ISO-8859-1:
Following code is evaluating the browsed zip file:
Charset charsetLatin1 = Charset.forName("ISO-8859-1");
ZipFile zipFile = new ZipFile([myFilePath], charsetLatin1);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
String entrynameEncodeUnsafeAscii = java.net.URLEncoder.encode(zipentry.getName(), "ISO- 8859-1");
String entryname = java.net.URLDecoder.decode(entrynameEncodeUnsafeAscii, "ISO-8859-1");
ETC...
And then we’re back to entryname = “File_1_ref?r?ns.pdf”.
//EOF
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Please follow the 5 different approaches to reading the file name of a zipped file, described in the previous section.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Expected to get correct file names (with å, ä and ö) at unzipping.
ACTUAL -
å, ä and ö replaced by either "?" or "%84", "%86" etc.
ERROR MESSAGES/STACK TRACES THAT OCCUR :
N/A for 4 of the approaches, but for the second one (again, see problem description):
java.lang.IllegalArgumentException: MALFORMED[1]
at java.util.zip.ZipCoder.toString(ZipCoder.java:53)
at java.util.zip.ZipFile.getZipEntry(ZipFile.java:500)
at java.util.zip.ZipFile.access$800(ZipFile.java:53)
at java.util.zip.ZipFile$2.nextElement(ZipFile.java:482)
at java.util.zip.ZipFile$2.nextElement(ZipFile.java:452)
at se.aklagarmyndigheten.alba.library.contenttransfer.importcontent.SipFile.validateZipEntries(SipFile.java:278)
at
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
Out of 5 different approaches (please see description section for source code for all of them), I'm only displaying the one that caused an IllegalArgumentException to be thrown, here:
Charset charsetISO = Charset.forName("UTF-8");
ZipFile zipFile = new ZipFile([myFilePath], charsetISO);
for (java.util.Enumeration e = zipFile.entries(); e.hasMoreElements();) {
ZipEntry ZipEntry zipentry = (ZipEntry) e.nextElement();
String entryname = zipentry.getName();
...ETC...
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
Sorry, no workaround.
SUPPORT :
YES