-
Bug
-
Resolution: Fixed
-
P4
-
1.2.0
-
hopper
-
x86
-
windows_nt
-
Verified
Name: dbT83986 Date: 03/01/99
The apparent case sensitivity to character set encoding names
is very annoying, and possibly incorrect. It is claimed that
the accepted names are from the IANA charset names, but the
IANA says that names are case insensitive (second paragraph of
http://www.isi.edu/in-notes/iana/assignments/character-sets)
Further more, it isn't clear which aliases to various character
sets are supported. For example, the document referenced above
gives the following entry for the ascii charset:
Name: ANSI_X3.4-1968
...
Alias: iso-ir-6
Alias: ANSI_X3.4-1986
Alias: ISO_646.irv:1991
Alias: ASCII
Alias: ISO646-US
Alias: US-ASCII (preferred MIME name)
Alias: us
Alias: IBM367
Alias: cp367
Alias: csASCII
But my tests show that only "ASCII" (but *not* "ascii") and "US-ASCII" (where "us-ascii" and even "US-ascii" *are* recognized) can be used to refer to the ASCII character set.
Attached is a small program that tries various encoding names
and prints out if they seem to legal or not. It tries various
mixed case renderings and aliases for the US-ASCII, Unicode,
Big5, and Cp1252 encodings. On my machine running Java 1.2
I get the following output:
Encoding "ANSI_X3.4-1968" NOT recognized
Encoding "iso-ir-6" NOT recognized
Encoding "ANSI_X3.4-1986" NOT recognized
Encoding "ISO_646.irv:1991" NOT recognized
Encoding "ASCII" recognized
Encoding "ascii" NOT recognized
Encoding "Ascii" NOT recognized
Encoding "ISO646-US" NOT recognized
Encoding "US-ASCII" recognized
Encoding "us-ascii" recognized
Encoding "US-Ascii" recognized
Encoding "us" NOT recognized
Encoding "IBM367" NOT recognized
Encoding "cp367" NOT recognized
Encoding "csASCII" NOT recognized
Encoding "Unicode" recognized
Encoding "UNICODE" NOT recognized
Encoding "unicode" NOT recognized
Encoding "Big5" recognized
Encoding "big5" recognized
Encoding "bIg5" recognized
Encoding "biG5" recognized
Encoding "bIG5" recognized
Encoding "Cp1252" recognized
Encoding "cp1252" NOT recognized
Encoding "CP1252" NOT recognized
The primary reason I find this annoying is when dealing with
Transferables with the "text/plain" MIME type. I want to be
able to create a InputStreamReader using the encoding
described by the charset parameter of the MIME type. On my
machine these always come back in lower case, so I get encoding
names such as "ascii" and "unicode". Passing these strings to
the InputStreamReader constructor results in an exception, so I
have to change the strings to "ASCII" and "Unicode", and there
doesn't seem to an easy way to know, in general, which letters
need to be capitalized to make the encoding name acceptable.
import java.lang.*;
import java.io.*;
public class EncodingsTest
{
public static void main(String args[])
{
// Try various encoding names in mixed cases
// Various forms of US-ASCII
tryToEncode( "ANSI_X3.4-1968" );
tryToEncode( "iso-ir-6" );
tryToEncode( "ANSI_X3.4-1986" );
tryToEncode( "ISO_646.irv:1991" );
tryToEncode( "ASCII" );
tryToEncode( "ascii" );
tryToEncode( "Ascii" );
tryToEncode( "ISO646-US" );
tryToEncode( "US-ASCII" );
tryToEncode( "us-ascii" );
tryToEncode( "US-Ascii" );
tryToEncode( "us" );
tryToEncode( "IBM367" );
tryToEncode( "cp367" );
tryToEncode( "csASCII" );
// Variants on Unicode
tryToEncode( "Unicode" );
tryToEncode( "UNICODE" );
tryToEncode( "unicode" );
// Variants on Big5
tryToEncode( "Big5" );
tryToEncode( "big5" );
tryToEncode( "bIg5" );
tryToEncode( "biG5" );
tryToEncode( "bIG5" );
// Variants of Cp1252
tryToEncode( "Cp1252" );
tryToEncode( "cp1252" );
tryToEncode( "CP1252" );
}
public static final String ENCODE_STRING = "Encode me";
public static void tryToEncode( String encoding )
{
try
{
byte[] bytes = ENCODE_STRING.getBytes( encoding );
System.out.println( "Encoding \"" + encoding + "\" recognized" );
}
catch( UnsupportedEncodingException e )
{
System.out.println( "Encoding \"" + encoding + "\" NOT recognized" );
}
}
}
(Review ID: 54372)
======================================================================