Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4308363

Jar tool support for text/* content types

XMLWordPrintable

    • jar
    • generic
    • generic

      Name: krT82822 Date: 01/29/2000


      This RFE is related, but not identical, to 4297031. The issue is a
      generalization of 4297031, and the remedy suggested here is different;
      I'll explain why the mechanism in 4297031 isn't suitable for a general
      solution.

      The main issue is that Java jars can be successfully sent between platforms
      but the cross-platform compatibility breaks down if those jars contain text
      files (html documentation or Java source code, for example). Line separators
      are the biggest problem. Windows uses \r\n, UNIX uses just \n,
      Mac uses just \r. A text file jar'd on one platform and unjar'd on another
      will appear either to have spurious control characters at the beginnings or
      ends of lines, or to be squashed into one insanely long line. Since every
      JRE knows its own platform's line.separator, it would be trivially easy to
      convert files as they are extracted, but to do so the tool needs a reliable
      way to know which files in the jar are text; conversions applied to non-text
      files would corrupt them.

      Mistranslated line endings don't always cause problems: for example, Java
      sources can be jar'd from one platform to another and still compile because
      javac always accepts all 3 styles of line ending regardless of platform. That
      doesn't make the problem go away, though. The file can successfully compile
      and still give you fits if you try to view it in your platform's text editor,
      or run diffs against it.

      A second cross-platform issue for text files is translation into the local
      character encoding. Here again, Java's support for encodings in
      InputStreamReader and OutputStreamWriter makes the conversion itself simple,
      but the tool needs a reliable way of knowing which entries to convert and
      which conversion to use.

      There is no ZIP mechanism to address the character set issue,
      ZIP being a "provincial" pre-i18n interchange standard.
      The ZIP format (ftp://ftp.uu.net/pub/archiving/zip/doc/appnote-970311-iz.zip)
      does includes a mechanism for addressing the line separator problem, and that is
      what 4297031 refers to. But here's why it's not quite enough.

      === The ZIP mechanism and why it isn't enough ===

      The approach in the ZIP standard to autoconversion of text line endings
      involves two fields in the zip directory for each entry: a single bit
      text/binary flag, and an 8-bit source operating system code (the high byte
      of the "version made by" field). If the bit indicates an entry is text, the
      extracting program is to determine the line separator used on the source
      operating system (based on the 8-bit OS code) and convert from that separator
      to the local style. That approach requires the extracting zip/jar tool to know
      something about OTHER operating systems' local conventions, and is much less
      practical than the proven approach used in most of our networking protocols
      including IP itself: define a standard canonical format to use in transit.
      The tools at each endpoint only need to know how to convert between their
      local conventions and the canonical form; they don't need to know what's on
      the other end. If there are n platforms you only need at most n distinct
      conversion schemes, not at most n^2 as in the ZIP approach.

      But there's another, fatal problem with the ZIP scheme: autoconversion during
      unzip relies on the text/binary bit to indicate which entries to convert, but
      that bit isn't reliable. When building an archive, the common ZIP programs
      compile statistics on the different byte values contained in the files, and
      set the text bit for entries that look statistically similar to text! Binary
      files occasionally satisfy that heuristic by chance, get flagged as text, and
      are silently corrupted during unzip -a.

      About 3% of Java class files look like text to pkzip, based on experience with
      a moderately sized package, the ANTLR parser generator. About half a dozen
      of its slightly-under-200 class files were marked as text. If the archive was
      taken to a different platform and unzipped with autoconversion, those classes
      became unusable. Other users of the same distribution jar on other platforms
      had no problem, so many bug reports were filed as irreproducible before the
      problem was understood. Clearly a scheme that can permit silent and unexpected
      corruption of data is not suitable as a general solution.

      === Elements of a general solution ===

      Here are the elements a general solution would need:

      1. A way to encode in a jar file, unambiguously, which entries should be
          subject to text conversions and which entries should not. This information
          must not come from blind heuristics but be visible to and under the control
          of the developer.

      2. That information should be stored separately from the existing ZIP
          text/binary bit and not confused with it, because the practice in existing
          implementations has made the ZIP text bit unreliable.

      3. A canonical form for line separators to be used when text files are stored
          in the jar. When the jar is created, files marked as text are converted
          from the local line.separator to the canonical form. During extraction,
          files marked as text are converted from the canonical to the local form.

      4. Either a canonical character encoding (utf-8?) to be used for text files
          within the jar, or a way to identify the encoding used in the jar.
          On jar creation, text files are read in the local encoding and written to
          the jar in the jar encoding. The reverse happens on extraction.
          Non-substitution mode should be used (or simulated): if a file contains any
          character that cannot be represented in the in-jar encoding, or in the
          receiving system's local encoding, an exception should be thrown rather
          than silently substituting a question mark.

      Relevant standards:

      The new Jar File Specification accompanying the Java 1.3 SDK adds a new
      per-entry attribute to the manifest.
      (http://java.sun.com/products/jdk/1.3/docs/guide/jar/jar.html#Per-Entry
      Attributes)
      It is now possible, and conformant, to associate a Content-Type with each entry
      in a jar. These attributes are visible to and controlled by the developer,
      and permit unambiguous determination of which entries are to be treated as
      text. So, we can use the Content-Type to address requirement 1. Content types
      in the manifest are separate and distinct from the unreliable ZIP text bit,
      satisfying requirement 2.

      Requirement 3 is also addressed, because any entry whose type is marked
      text/* becomes subject to the provisions of RFC2046
      (ftp://ftp.isi.edu/in-notes/rfc2046.txt) defining the text type, and
      section 4.1.1 of that RFC establishes CRLF as the canonical line ending for
      all text types.

      Requirement 4 is addressed because RFC2046 also establishes (4.1.2) a charset
      parameter that applies to all text types, and the SDK 1.3 API docs finally
      establish a straightforward mapping between charset names and encoding names.
      (http://java.sun.com/products/jdk/1.3/docs/api/java/lang/package-summary.html#charenc)
      (However, a bug in the jar specification could, if not corrected, result in
      the existence of implementations that fail to parse content type parameters;
      see [internal review id 100506].)

      Content types in a manifest do not violate earlier jar standards; they would
      be ignored but in no way make a jar file incompatible with earlier jar tools.

      === RFE (the moment you've been waiting for) ===

      Enhance the jar tool to pay attention to Content-Type attributes in the manifest
      both during creation/update and during extraction. Processing of entries with
      any non-text type to be unchanged. For text types, if the type attribute
      includes a charset parameter, for creation/update convert the content so:

        (local file) -> InputStreamReader(,local default)
                     -> OutputStreamWriter(,specified charset) -> (archive)

      and for extraction:

        (archive) -> InputStreamReader(,specified charset)
                  -> OutputStreamWriter(,local default) -> (local file)

      Perform line separator conversion between the Reader and Writer above;
      for creation/update:

      (local file) -> Reader -> (local line.separator -> CRLF) -> Writer -> (archive)

      and for extraction:

      (archive) -> Reader -> (CRLF -> local line.separator) -> Writer -> (local file).

      Entries with no content type specified to be treated as non-text, therefore
      processed without change from current versions. So the enhancement should be
      fully transparent to all jars that do not include content-type attributes with
      text types. The processing applied to (correctly marked) text types is
      appropriate and correct; nevertheless, an option to disable such processing
      (treat all entries as nontext) will likely be useful in some situations.

      Gravy: provide a way to indicate, on creation/update, that certain local files
      are already in an encoding other than the platform default (create a different
      InputStreamReader); or, on extract, that certain entries should be extracted
      and stored in an encoding other than the platform default (create a different
      OutputStreamWriter). Without the gravy, this can be achieved by multiple
      invocations of the jar tool with -Dfile.encoding.

      Question to be resolved: should line separator processing be applied to all
      subtypes of text, or only to text/plain? RFC2046 specifies CRLF as the
      canonical form for all text subtypes, but I'm not sure if that means the
      conversion to local separator can safely be applied to all of them. Other
      opinions should be sought.

      Thanks for listening...
      Chapman Flack Purdue CERIAS ###@###.###
      (Review ID: 100510)
      ======================================================================

            bristor Dave Bristor (Inactive)
            kryansunw Kevin Ryan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: