Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4100320

URLEncoder.encode() incorrect on non-ASCII platforms

XMLWordPrintable

    • 1.2beta4
    • sparc
    • solaris_2.5.1
    • Not verified




      Name: mf23781 Date: 12/18/97


      The URLEncoder.encode() method converts a String to its URL-encoded
      form:
         o regular alphanumeric characters are not changed.
         o non-alphanumeric characters are converted to %xx, where xx represents
           the ASCII hexadecimal value of the character.
         o spaces are convert to '+',

      Hence, "abc+ def" becomes "abc%2b++def"

         URL-encoded characters are those from a portable subset of ASCII,
         destined to be used in URLs so that they can be correctly handled
         by computers around the world. (Ref: Java 1.1 Developer's Handbook)
       
         Problem 1:
         As each character is handled (whether converted to hex or not),
         it is written to the ByteArrayOutputStream (out) using the out.write
         method. Once the whole string has been written to the stream,
         the stream is converted to a String, using the
         ByteArrayOutputStream.toString method:

              return out.toString();

         This method converts a stream according to the default local encoding.
         This is incorrect as the stream does not contain data in the default
         local encoding, but contains data in ASCII and/or hex, created by out.write.
         The result is that the encoded string returned from this method is garbage.
        
         Solution:
         Explicitly specify the encoding of the stream, rather than letting it
         default to the platform's local encoding:
              try {
                  return out.toString("8859_1");
              } catch(Exception e) {
                  System.out.println("Exception: " + e);
                  return null;
              }


         Once Problem 1 is corrected, the output is:
         OS/390: abc%4e++def where %4e is ebcdic '+'
         aix: abc%2b++def where %2b is ascii '+'

         Problem 2:
         The above output is incorrect on any non-ASCII platform. The purpose of the
         URLEncoder.encode() method is to convert non-alphanumeric characters
         to their ASCII hexadecimal value, but we have EBCDIC hex values for OS/390.
         The URLEncoder.encode() method states that it "converts to the
         external encoding before hex conversion", hence we end up with
         the local encoding's hex value of the character.

         Solution:
         Explicitly specify the encoding of the writer, rather than letting it
         default to the platform's local encoding:
              OutputStreamWriter writer = null;
              try {
                  writer = new OutputStreamWriter(buf,"8859_1");
              } catch(Exception e) {
                  System.out.println("Exception: " + e);
              }
         This will ensure that an ASCII character is written to the buffer and
         subsequently converted into hex.

      Circumvention:

       Temporarily change the default local encoding
       to ASCII, around the call to URLEncoder.encode(),
       eg.
       
      Properties prop = System.getProperties();
      prop.put("file.encoding", "8859_1");
      result = URLEncoder.encode(s); // reset to local encoding
      prop.put("file.encoding", "Cp1047");

      ======================================================================

      A licensee has this to add :

      I have had a look at various of the RFC's concerning
      HTTP (RFC 2068),
      URI's (RFC 2396, http://www.ics.uci.edu/pub/ietf/uri/)
      and HTML (http://www.w3.org/MarkUp/). While all of these where
      first designed to use only the US_ASCII character encoding, there
      are moves afoot to extend the standards to encompass different
      encodings. Specifically, an http server can specify the
      character encoding of the message content, and an http client
      can request content in a particular character encoding from
      the server.

      However, URL encoding is more complicated, since there are the
      issues of both external encoding, which is the same as the
      content encoding of an http message, for example, and the
      internal encoding, which is the tranlation of characters to
      octets (bytes) PRIOR to URLEncoding of the octets.

      Section 2.1 of RFC 2396 says:

      " 2.1 URI and non-ASCII characters

         The relationship between URI and characters has been a source of
         confusion for characters that are not part of US-ASCII. To describe
         the relationship, it is useful to distinguish between a "character"
         (as a distinguishable semantic entity) and an "octet" (an 8-bit
         byte). There are two mappings, one from URI characters to octets, and

         a second from octets to original characters:

         URI character sequence->octet sequence->original character sequence

         A URI is represented as a sequence of characters, not as a sequence
         of octets. That is because URI might be "transported" by means that
         are not through a computer network, e.g., printed on paper, read over

         the radio, etc.

         A URI scheme may define a mapping from URI characters to octets;
         whether this is done depends on the scheme. Commonly, within a
         delimited component of a URI, a sequence of characters may be used to

         represent a sequence of octets. For example, the character "a"
         represents the octet 97 (decimal), while the character sequence "%",
         "0", "a" represents the octet 10 (decimal).

         There is a second translation for some resources: the sequence of
         octets defined by a component of the URI is subsequently used to
         represent a sequence of characters. A 'charset' defines this mapping.

         There are many charsets in use in Internet protocols. For example,
         UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences

         of characters in the repertoire of ISO 10646.

         In the simplest case, the original character sequence contains only
         characters that are defined in US-ASCII, and the two levels of
         mapping are simple and easily invertible: each 'original character'
         is represented as the octet for the US-ASCII code for it, which is,
         in turn, represented as either the US-ASCII character, or else the
         "%" escape sequence for that octet.

         For original character sequences that contain non-ASCII characters,
         however, the situation is more difficult. Internet protocols that
         transmit octet sequences intended to represent character sequences
         are expected to provide some way of identifying the charset used, if
         there might be more than one [RFC2277]. However, there is currently
         no provision within the generic URI syntax to accomplish this
         identification. An individual URI scheme may require a single
         charset, define a default charset, or provide a way to indicate the
         charset used.

         It is expected that a systematic treatment of character encoding
         within URI will be developed as a future modification of this
         specification.

      "

      Of special significance to our issue is:

       "Internet protocols that
         transmit octet sequences intended to represent character sequences
         are expected to provide some way of identifying the charset used, if
         there might be more than one [RFC2277]. However, there is currently
         no provision within the generic URI syntax to accomplish this
         identification"

      As I read it, this states that there is no mechanism for
      specifying the internal character encoding within an URL. This
      is obvious from the syntax definition for URLS.

      Consequently, a consumer of an URL has to somehow know what the
      internal encoding of the URL producer was. There is no protocol
      to do so.

      Moreover, the internal encoding is not necessarily the same as
      the document encoding (external encoding). Consider the
      following scenario:

      An OS/390 machine serves an HTML document for a client
      request. The document is in EBCDIC, and the URL's within the
      document have been internally encoded using the EBCDIC
      character encoding. However, the client requests the document
      in ASCII. Consequently, the server auto-translates the
      document from EBCDIC to ASCII, and the document is transmitted
      back to the client in US_ASCII encoding. However, the URLS
      within the document were internally encoded in EBCDIC, and to
      decode them, the client needs to URLDecode the (ASCII-encoded)
      octets making up the URL, then run these decoded octets
      backwards through EBCDIC to get the original characters of the
      URL. There is no way for the client to tell that the URL's
      were internally encoded in EBCDIC, however, so it is going to
      use ASCII internally (its default encoding) and get some very
      funny looking URL's.

      Note that the above is not a rhetorical situation. It is exactly
      how http servers and proxies are supposed to behave in the
      presence of requests and documents in different encodings.

      Now, as far as I am concerned, there are two sensible approaches
      to the issue of internal URL encoding/decoding:

      1) Always use the default platform encoding/decoding.
      2) Always use some standard character encoding/decoding.

      Lets examine these in turn:

      1) URL's generated and consumed on the same machine, or on
      machines with the same default encoding, will work OK. URL's
      generated and consumed on machines with disparate default
      encodings will not work. Older internet utility programs (mail,
      http clients, etc.) won't work unless the default encoding is
      US-ASCII, because they won't know to think about anything else.

      2) The obvious choice for a uniform encoding is US-ASCII, since
      it is the historical choice. Now, if all machines use the
      same encoding, then URL's will work across all machines,
      regardless of their default encoding. If we choose US-ASCII
      as the default choice, then older internet programs will also
      work fine. However, non-ASCII characters cannot be handled.

      However, there is a better choice, as far as I know. Utf-8 can
      be used as a transparent extension to US_ASCII. Since all
      ASCII characters are identically mapped in Utf-8, it is
      backwards compatible with old programs, and it can also support
      arbitrary unicode characters, allowing URL's containing
      arbitrary unicode characters.

      Hence, I would recommend in the interests of maximum
      interoperability and minimum interference with existing
      systems, that URLEncoder.encode() uses utf-8 as its internal
      encoding on all platforms.



      mick.fleming@Ireland 1998-12-10

            mmcclosksunw Michael Mccloskey (Inactive)
            miflemi Mick Fleming
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: