Name: krT82822 Date: 01/29/2000
This RFE is related, but not identical, to 4297031. The issue is a
generalization of 4297031, and the remedy suggested here is different;
I'll explain why the mechanism in 4297031 isn't suitable for a general
solution.
The main issue is that Java jars can be successfully sent between platforms
but the cross-platform compatibility breaks down if those jars contain text
files (html documentation or Java source code, for example). Line separators
are the biggest problem. Windows uses \r\n, UNIX uses just \n,
Mac uses just \r. A text file jar'd on one platform and unjar'd on another
will appear either to have spurious control characters at the beginnings or
ends of lines, or to be squashed into one insanely long line. Since every
JRE knows its own platform's line.separator, it would be trivially easy to
convert files as they are extracted, but to do so the tool needs a reliable
way to know which files in the jar are text; conversions applied to non-text
files would corrupt them.
Mistranslated line endings don't always cause problems: for example, Java
sources can be jar'd from one platform to another and still compile because
javac always accepts all 3 styles of line ending regardless of platform. That
doesn't make the problem go away, though. The file can successfully compile
and still give you fits if you try to view it in your platform's text editor,
or run diffs against it.
A second cross-platform issue for text files is translation into the local
character encoding. Here again, Java's support for encodings in
InputStreamReader and OutputStreamWriter makes the conversion itself simple,
but the tool needs a reliable way of knowing which entries to convert and
which conversion to use.
There is no ZIP mechanism to address the character set issue,
ZIP being a "provincial" pre-i18n interchange standard.
The ZIP format (ftp://ftp.uu.net/pub/archiving/zip/doc/appnote-970311-iz.zip)
does includes a mechanism for addressing the line separator problem, and that is
what 4297031 refers to. But here's why it's not quite enough.
=== The ZIP mechanism and why it isn't enough ===
The approach in the ZIP standard to autoconversion of text line endings
involves two fields in the zip directory for each entry: a single bit
text/binary flag, and an 8-bit source operating system code (the high byte
of the "version made by" field). If the bit indicates an entry is text, the
extracting program is to determine the line separator used on the source
operating system (based on the 8-bit OS code) and convert from that separator
to the local style. That approach requires the extracting zip/jar tool to know
something about OTHER operating systems' local conventions, and is much less
practical than the proven approach used in most of our networking protocols
including IP itself: define a standard canonical format to use in transit.
The tools at each endpoint only need to know how to convert between their
local conventions and the canonical form; they don't need to know what's on
the other end. If there are n platforms you only need at most n distinct
conversion schemes, not at most n^2 as in the ZIP approach.
But there's another, fatal problem with the ZIP scheme: autoconversion during
unzip relies on the text/binary bit to indicate which entries to convert, but
that bit isn't reliable. When building an archive, the common ZIP programs
compile statistics on the different byte values contained in the files, and
set the text bit for entries that look statistically similar to text! Binary
files occasionally satisfy that heuristic by chance, get flagged as text, and
are silently corrupted during unzip -a.
About 3% of Java class files look like text to pkzip, based on experience with
a moderately sized package, the ANTLR parser generator. About half a dozen
of its slightly-under-200 class files were marked as text. If the archive was
taken to a different platform and unzipped with autoconversion, those classes
became unusable. Other users of the same distribution jar on other platforms
had no problem, so many bug reports were filed as irreproducible before the
problem was understood. Clearly a scheme that can permit silent and unexpected
corruption of data is not suitable as a general solution.
=== Elements of a general solution ===
Here are the elements a general solution would need:
1. A way to encode in a jar file, unambiguously, which entries should be
subject to text conversions and which entries should not. This information
must not come from blind heuristics but be visible to and under the control
of the developer.
2. That information should be stored separately from the existing ZIP
text/binary bit and not confused with it, because the practice in existing
implementations has made the ZIP text bit unreliable.
3. A canonical form for line separators to be used when text files are stored
in the jar. When the jar is created, files marked as text are converted
from the local line.separator to the canonical form. During extraction,
files marked as text are converted from the canonical to the local form.
4. Either a canonical character encoding (utf-8?) to be used for text files
within the jar, or a way to identify the encoding used in the jar.
On jar creation, text files are read in the local encoding and written to
the jar in the jar encoding. The reverse happens on extraction.
Non-substitution mode should be used (or simulated): if a file contains any
character that cannot be represented in the in-jar encoding, or in the
receiving system's local encoding, an exception should be thrown rather
than silently substituting a question mark.
Relevant standards:
The new Jar File Specification accompanying the Java 1.3 SDK adds a new
per-entry attribute to the manifest.
(http://java.sun.com/products/jdk/1.3/docs/guide/jar/jar.html#Per-Entry
Attributes)
It is now possible, and conformant, to associate a Content-Type with each entry
in a jar. These attributes are visible to and controlled by the developer,
and permit unambiguous determination of which entries are to be treated as
text. So, we can use the Content-Type to address requirement 1. Content types
in the manifest are separate and distinct from the unreliable ZIP text bit,
satisfying requirement 2.
Requirement 3 is also addressed, because any entry whose type is marked
text/* becomes subject to the provisions of RFC2046
(ftp://ftp.isi.edu/in-notes/rfc2046.txt) defining the text type, and
section 4.1.1 of that RFC establishes CRLF as the canonical line ending for
all text types.
Requirement 4 is addressed because RFC2046 also establishes (4.1.2) a charset
parameter that applies to all text types, and the SDK 1.3 API docs finally
establish a straightforward mapping between charset names and encoding names.
(http://java.sun.com/products/jdk/1.3/docs/api/java/lang/package-summary.html#charenc)
(However, a bug in the jar specification could, if not corrected, result in
the existence of implementations that fail to parse content type parameters;
see [internal review id 100506].)
Content types in a manifest do not violate earlier jar standards; they would
be ignored but in no way make a jar file incompatible with earlier jar tools.
=== RFE (the moment you've been waiting for) ===
Enhance the jar tool to pay attention to Content-Type attributes in the manifest
both during creation/update and during extraction. Processing of entries with
any non-text type to be unchanged. For text types, if the type attribute
includes a charset parameter, for creation/update convert the content so:
(local file) -> InputStreamReader(,local default)
-> OutputStreamWriter(,specified charset) -> (archive)
and for extraction:
(archive) -> InputStreamReader(,specified charset)
-> OutputStreamWriter(,local default) -> (local file)
Perform line separator conversion between the Reader and Writer above;
for creation/update:
(local file) -> Reader -> (local line.separator -> CRLF) -> Writer -> (archive)
and for extraction:
(archive) -> Reader -> (CRLF -> local line.separator) -> Writer -> (local file).
Entries with no content type specified to be treated as non-text, therefore
processed without change from current versions. So the enhancement should be
fully transparent to all jars that do not include content-type attributes with
text types. The processing applied to (correctly marked) text types is
appropriate and correct; nevertheless, an option to disable such processing
(treat all entries as nontext) will likely be useful in some situations.
Gravy: provide a way to indicate, on creation/update, that certain local files
are already in an encoding other than the platform default (create a different
InputStreamReader); or, on extract, that certain entries should be extracted
and stored in an encoding other than the platform default (create a different
OutputStreamWriter). Without the gravy, this can be achieved by multiple
invocations of the jar tool with -Dfile.encoding.
Question to be resolved: should line separator processing be applied to all
subtypes of text, or only to text/plain? RFC2046 specifies CRLF as the
canonical form for all text subtypes, but I'm not sure if that means the
conversion to local separator can safely be applied to all of them. Other
opinions should be sought.
Thanks for listening...
Chapman Flack Purdue CERIAS ###@###.###
(Review ID: 100510)
======================================================================
This RFE is related, but not identical, to 4297031. The issue is a
generalization of 4297031, and the remedy suggested here is different;
I'll explain why the mechanism in 4297031 isn't suitable for a general
solution.
The main issue is that Java jars can be successfully sent between platforms
but the cross-platform compatibility breaks down if those jars contain text
files (html documentation or Java source code, for example). Line separators
are the biggest problem. Windows uses \r\n, UNIX uses just \n,
Mac uses just \r. A text file jar'd on one platform and unjar'd on another
will appear either to have spurious control characters at the beginnings or
ends of lines, or to be squashed into one insanely long line. Since every
JRE knows its own platform's line.separator, it would be trivially easy to
convert files as they are extracted, but to do so the tool needs a reliable
way to know which files in the jar are text; conversions applied to non-text
files would corrupt them.
Mistranslated line endings don't always cause problems: for example, Java
sources can be jar'd from one platform to another and still compile because
javac always accepts all 3 styles of line ending regardless of platform. That
doesn't make the problem go away, though. The file can successfully compile
and still give you fits if you try to view it in your platform's text editor,
or run diffs against it.
A second cross-platform issue for text files is translation into the local
character encoding. Here again, Java's support for encodings in
InputStreamReader and OutputStreamWriter makes the conversion itself simple,
but the tool needs a reliable way of knowing which entries to convert and
which conversion to use.
There is no ZIP mechanism to address the character set issue,
ZIP being a "provincial" pre-i18n interchange standard.
The ZIP format (ftp://ftp.uu.net/pub/archiving/zip/doc/appnote-970311-iz.zip)
does includes a mechanism for addressing the line separator problem, and that is
what 4297031 refers to. But here's why it's not quite enough.
=== The ZIP mechanism and why it isn't enough ===
The approach in the ZIP standard to autoconversion of text line endings
involves two fields in the zip directory for each entry: a single bit
text/binary flag, and an 8-bit source operating system code (the high byte
of the "version made by" field). If the bit indicates an entry is text, the
extracting program is to determine the line separator used on the source
operating system (based on the 8-bit OS code) and convert from that separator
to the local style. That approach requires the extracting zip/jar tool to know
something about OTHER operating systems' local conventions, and is much less
practical than the proven approach used in most of our networking protocols
including IP itself: define a standard canonical format to use in transit.
The tools at each endpoint only need to know how to convert between their
local conventions and the canonical form; they don't need to know what's on
the other end. If there are n platforms you only need at most n distinct
conversion schemes, not at most n^2 as in the ZIP approach.
But there's another, fatal problem with the ZIP scheme: autoconversion during
unzip relies on the text/binary bit to indicate which entries to convert, but
that bit isn't reliable. When building an archive, the common ZIP programs
compile statistics on the different byte values contained in the files, and
set the text bit for entries that look statistically similar to text! Binary
files occasionally satisfy that heuristic by chance, get flagged as text, and
are silently corrupted during unzip -a.
About 3% of Java class files look like text to pkzip, based on experience with
a moderately sized package, the ANTLR parser generator. About half a dozen
of its slightly-under-200 class files were marked as text. If the archive was
taken to a different platform and unzipped with autoconversion, those classes
became unusable. Other users of the same distribution jar on other platforms
had no problem, so many bug reports were filed as irreproducible before the
problem was understood. Clearly a scheme that can permit silent and unexpected
corruption of data is not suitable as a general solution.
=== Elements of a general solution ===
Here are the elements a general solution would need:
1. A way to encode in a jar file, unambiguously, which entries should be
subject to text conversions and which entries should not. This information
must not come from blind heuristics but be visible to and under the control
of the developer.
2. That information should be stored separately from the existing ZIP
text/binary bit and not confused with it, because the practice in existing
implementations has made the ZIP text bit unreliable.
3. A canonical form for line separators to be used when text files are stored
in the jar. When the jar is created, files marked as text are converted
from the local line.separator to the canonical form. During extraction,
files marked as text are converted from the canonical to the local form.
4. Either a canonical character encoding (utf-8?) to be used for text files
within the jar, or a way to identify the encoding used in the jar.
On jar creation, text files are read in the local encoding and written to
the jar in the jar encoding. The reverse happens on extraction.
Non-substitution mode should be used (or simulated): if a file contains any
character that cannot be represented in the in-jar encoding, or in the
receiving system's local encoding, an exception should be thrown rather
than silently substituting a question mark.
Relevant standards:
The new Jar File Specification accompanying the Java 1.3 SDK adds a new
per-entry attribute to the manifest.
(http://java.sun.com/products/jdk/1.3/docs/guide/jar/jar.html#Per-Entry
Attributes)
It is now possible, and conformant, to associate a Content-Type with each entry
in a jar. These attributes are visible to and controlled by the developer,
and permit unambiguous determination of which entries are to be treated as
text. So, we can use the Content-Type to address requirement 1. Content types
in the manifest are separate and distinct from the unreliable ZIP text bit,
satisfying requirement 2.
Requirement 3 is also addressed, because any entry whose type is marked
text/* becomes subject to the provisions of RFC2046
(ftp://ftp.isi.edu/in-notes/rfc2046.txt) defining the text type, and
section 4.1.1 of that RFC establishes CRLF as the canonical line ending for
all text types.
Requirement 4 is addressed because RFC2046 also establishes (4.1.2) a charset
parameter that applies to all text types, and the SDK 1.3 API docs finally
establish a straightforward mapping between charset names and encoding names.
(http://java.sun.com/products/jdk/1.3/docs/api/java/lang/package-summary.html#charenc)
(However, a bug in the jar specification could, if not corrected, result in
the existence of implementations that fail to parse content type parameters;
see [internal review id 100506].)
Content types in a manifest do not violate earlier jar standards; they would
be ignored but in no way make a jar file incompatible with earlier jar tools.
=== RFE (the moment you've been waiting for) ===
Enhance the jar tool to pay attention to Content-Type attributes in the manifest
both during creation/update and during extraction. Processing of entries with
any non-text type to be unchanged. For text types, if the type attribute
includes a charset parameter, for creation/update convert the content so:
(local file) -> InputStreamReader(,local default)
-> OutputStreamWriter(,specified charset) -> (archive)
and for extraction:
(archive) -> InputStreamReader(,specified charset)
-> OutputStreamWriter(,local default) -> (local file)
Perform line separator conversion between the Reader and Writer above;
for creation/update:
(local file) -> Reader -> (local line.separator -> CRLF) -> Writer -> (archive)
and for extraction:
(archive) -> Reader -> (CRLF -> local line.separator) -> Writer -> (local file).
Entries with no content type specified to be treated as non-text, therefore
processed without change from current versions. So the enhancement should be
fully transparent to all jars that do not include content-type attributes with
text types. The processing applied to (correctly marked) text types is
appropriate and correct; nevertheless, an option to disable such processing
(treat all entries as nontext) will likely be useful in some situations.
Gravy: provide a way to indicate, on creation/update, that certain local files
are already in an encoding other than the platform default (create a different
InputStreamReader); or, on extract, that certain entries should be extracted
and stored in an encoding other than the platform default (create a different
OutputStreamWriter). Without the gravy, this can be achieved by multiple
invocations of the jar tool with -Dfile.encoding.
Question to be resolved: should line separator processing be applied to all
subtypes of text, or only to text/plain? RFC2046 specifies CRLF as the
canonical form for all text subtypes, but I'm not sure if that means the
conversion to local separator can safely be applied to all of them. Other
opinions should be sought.
Thanks for listening...
Chapman Flack Purdue CERIAS ###@###.###
(Review ID: 100510)
======================================================================