-
Bug
-
Resolution: Fixed
-
P3
-
1.1.4
-
1.2beta4
-
sparc
-
solaris_2.5.1
-
Not verified
Name: mf23781 Date: 12/18/97
The URLEncoder.encode() method converts a String to its URL-encoded
form:
o regular alphanumeric characters are not changed.
o non-alphanumeric characters are converted to %xx, where xx represents
the ASCII hexadecimal value of the character.
o spaces are convert to '+',
Hence, "abc+ def" becomes "abc%2b++def"
URL-encoded characters are those from a portable subset of ASCII,
destined to be used in URLs so that they can be correctly handled
by computers around the world. (Ref: Java 1.1 Developer's Handbook)
Problem 1:
As each character is handled (whether converted to hex or not),
it is written to the ByteArrayOutputStream (out) using the out.write
method. Once the whole string has been written to the stream,
the stream is converted to a String, using the
ByteArrayOutputStream.toString method:
return out.toString();
This method converts a stream according to the default local encoding.
This is incorrect as the stream does not contain data in the default
local encoding, but contains data in ASCII and/or hex, created by out.write.
The result is that the encoded string returned from this method is garbage.
Solution:
Explicitly specify the encoding of the stream, rather than letting it
default to the platform's local encoding:
try {
return out.toString("8859_1");
} catch(Exception e) {
System.out.println("Exception: " + e);
return null;
}
Once Problem 1 is corrected, the output is:
OS/390: abc%4e++def where %4e is ebcdic '+'
aix: abc%2b++def where %2b is ascii '+'
Problem 2:
The above output is incorrect on any non-ASCII platform. The purpose of the
URLEncoder.encode() method is to convert non-alphanumeric characters
to their ASCII hexadecimal value, but we have EBCDIC hex values for OS/390.
The URLEncoder.encode() method states that it "converts to the
external encoding before hex conversion", hence we end up with
the local encoding's hex value of the character.
Solution:
Explicitly specify the encoding of the writer, rather than letting it
default to the platform's local encoding:
OutputStreamWriter writer = null;
try {
writer = new OutputStreamWriter(buf,"8859_1");
} catch(Exception e) {
System.out.println("Exception: " + e);
}
This will ensure that an ASCII character is written to the buffer and
subsequently converted into hex.
Circumvention:
Temporarily change the default local encoding
to ASCII, around the call to URLEncoder.encode(),
eg.
Properties prop = System.getProperties();
prop.put("file.encoding", "8859_1");
result = URLEncoder.encode(s); // reset to local encoding
prop.put("file.encoding", "Cp1047");
======================================================================
A licensee has this to add :
I have had a look at various of the RFC's concerning
HTTP (RFC 2068),
URI's (RFC 2396, http://www.ics.uci.edu/pub/ietf/uri/)
and HTML (http://www.w3.org/MarkUp/). While all of these where
first designed to use only the US_ASCII character encoding, there
are moves afoot to extend the standards to encompass different
encodings. Specifically, an http server can specify the
character encoding of the message content, and an http client
can request content in a particular character encoding from
the server.
However, URL encoding is more complicated, since there are the
issues of both external encoding, which is the same as the
content encoding of an http message, for example, and the
internal encoding, which is the tranlation of characters to
octets (bytes) PRIOR to URLEncoding of the octets.
Section 2.1 of RFC 2396 says:
" 2.1 URI and non-ASCII characters
The relationship between URI and characters has been a source of
confusion for characters that are not part of US-ASCII. To describe
the relationship, it is useful to distinguish between a "character"
(as a distinguishable semantic entity) and an "octet" (an 8-bit
byte). There are two mappings, one from URI characters to octets, and
a second from octets to original characters:
URI character sequence->octet sequence->original character sequence
A URI is represented as a sequence of characters, not as a sequence
of octets. That is because URI might be "transported" by means that
are not through a computer network, e.g., printed on paper, read over
the radio, etc.
A URI scheme may define a mapping from URI characters to octets;
whether this is done depends on the scheme. Commonly, within a
delimited component of a URI, a sequence of characters may be used to
represent a sequence of octets. For example, the character "a"
represents the octet 97 (decimal), while the character sequence "%",
"0", "a" represents the octet 10 (decimal).
There is a second translation for some resources: the sequence of
octets defined by a component of the URI is subsequently used to
represent a sequence of characters. A 'charset' defines this mapping.
There are many charsets in use in Internet protocols. For example,
UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
of characters in the repertoire of ISO 10646.
In the simplest case, the original character sequence contains only
characters that are defined in US-ASCII, and the two levels of
mapping are simple and easily invertible: each 'original character'
is represented as the octet for the US-ASCII code for it, which is,
in turn, represented as either the US-ASCII character, or else the
"%" escape sequence for that octet.
For original character sequences that contain non-ASCII characters,
however, the situation is more difficult. Internet protocols that
transmit octet sequences intended to represent character sequences
are expected to provide some way of identifying the charset used, if
there might be more than one [RFC2277]. However, there is currently
no provision within the generic URI syntax to accomplish this
identification. An individual URI scheme may require a single
charset, define a default charset, or provide a way to indicate the
charset used.
It is expected that a systematic treatment of character encoding
within URI will be developed as a future modification of this
specification.
"
Of special significance to our issue is:
"Internet protocols that
transmit octet sequences intended to represent character sequences
are expected to provide some way of identifying the charset used, if
there might be more than one [RFC2277]. However, there is currently
no provision within the generic URI syntax to accomplish this
identification"
As I read it, this states that there is no mechanism for
specifying the internal character encoding within an URL. This
is obvious from the syntax definition for URLS.
Consequently, a consumer of an URL has to somehow know what the
internal encoding of the URL producer was. There is no protocol
to do so.
Moreover, the internal encoding is not necessarily the same as
the document encoding (external encoding). Consider the
following scenario:
An OS/390 machine serves an HTML document for a client
request. The document is in EBCDIC, and the URL's within the
document have been internally encoded using the EBCDIC
character encoding. However, the client requests the document
in ASCII. Consequently, the server auto-translates the
document from EBCDIC to ASCII, and the document is transmitted
back to the client in US_ASCII encoding. However, the URLS
within the document were internally encoded in EBCDIC, and to
decode them, the client needs to URLDecode the (ASCII-encoded)
octets making up the URL, then run these decoded octets
backwards through EBCDIC to get the original characters of the
URL. There is no way for the client to tell that the URL's
were internally encoded in EBCDIC, however, so it is going to
use ASCII internally (its default encoding) and get some very
funny looking URL's.
Note that the above is not a rhetorical situation. It is exactly
how http servers and proxies are supposed to behave in the
presence of requests and documents in different encodings.
Now, as far as I am concerned, there are two sensible approaches
to the issue of internal URL encoding/decoding:
1) Always use the default platform encoding/decoding.
2) Always use some standard character encoding/decoding.
Lets examine these in turn:
1) URL's generated and consumed on the same machine, or on
machines with the same default encoding, will work OK. URL's
generated and consumed on machines with disparate default
encodings will not work. Older internet utility programs (mail,
http clients, etc.) won't work unless the default encoding is
US-ASCII, because they won't know to think about anything else.
2) The obvious choice for a uniform encoding is US-ASCII, since
it is the historical choice. Now, if all machines use the
same encoding, then URL's will work across all machines,
regardless of their default encoding. If we choose US-ASCII
as the default choice, then older internet programs will also
work fine. However, non-ASCII characters cannot be handled.
However, there is a better choice, as far as I know. Utf-8 can
be used as a transparent extension to US_ASCII. Since all
ASCII characters are identically mapped in Utf-8, it is
backwards compatible with old programs, and it can also support
arbitrary unicode characters, allowing URL's containing
arbitrary unicode characters.
Hence, I would recommend in the interests of maximum
interoperability and minimum interference with existing
systems, that URLEncoder.encode() uses utf-8 as its internal
encoding on all platforms.
mick.fleming@Ireland 1998-12-10
- relates to
-
JDK-4146652 java.net.URLEncoder.java needs to support more forms of encode method
- Closed