Name: auR10023 Date: 02/20/2003
java.nio.charset.CharsetEncoder.isLegalReplacement(byte[]) returns true
with unmapable byte sequence in UTF-16, UTF-16BE and UTF-16LE. Javadoc for
this method says:
...
Tells whether or not the given byte array is a legal replacement value for this encoder.
A replacement is legal if, and only if, it is a legal sequence of bytes in this encoder's charset; that is, it must be possible to decode the replacement into one or more sixteen-bit Unicode characters.
...
RFC2781 describes process of decoding as follows:
...
Decoding of a single character from UTF-16 to an ISO 10646 character
value proceeds as follows. Let W1 be the next 16-bit integer in the
sequence of integers representing the text. Let W2 be the (eventual)
next integer following W1.
1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value
of W1. Terminate.
2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence
is in error and no valid character can be obtained using W1.
Terminate.
3) If there is no W2 (that is, the sequence ends with W1), or if W2
is not between 0xDC00 and 0xDFFF, the sequence is in error.
Terminate.
4) Construct a 20-bit unsigned integer U', taking the 10 low-order
bits of W1 as its 10 high-order bits and the 10 low-order bits of
W2 as its 10 low-order bits.
...
Here is the example:
------------- test.java --------------
import java.nio.charset.*;
import java.util.*;
public class test {
static Object [][] bSeqs = new Object [][] {
{"UTF-16", new byte [] { (byte)0xd8, 0, (byte)0xdc, 0}},
{"UTF-16BE", new byte [] { (byte)0xd8, 0, (byte)0xdc, 0}},
{"UTF-16LE", new byte [] { 0, (byte)0xd8, 0, (byte)0xdc}}
};
public static void main (String[] args) {
CharsetEncoder en = null;
for (int i = 0; i < bSeqs.length; i++) {
String chrName = (String)bSeqs[i][0];
try {
en = Charset.forName(chrName).newEncoder();
} catch(IllegalArgumentException e) {
e.printStackTrace(System.out);
return;
}
byte bArray [] = (byte [])(bSeqs[i][1]);
if (en.isLegalReplacement(bArray)) {
System.out.println("isLegalReplacement(byte[] repl) should " +
"return false with " + chrName);
}
}
}
}
#java -version
java version "1.4.2-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2-beta-b16)
Java HotSpot(TM) Client VM (build 1.4.2-beta-b16, mixed mode)
#java test
isLegalReplacement(byte[] repl) should return false with UTF-16
isLegalReplacement(byte[] repl) should return false with UTF-16BE
isLegalReplacement(byte[] repl) should return false with UTF-16LE
======================================================================
java.nio.charset.CharsetEncoder.isLegalReplacement(byte[]) returns true
with unmapable byte sequence in UTF-16, UTF-16BE and UTF-16LE. Javadoc for
this method says:
...
Tells whether or not the given byte array is a legal replacement value for this encoder.
A replacement is legal if, and only if, it is a legal sequence of bytes in this encoder's charset; that is, it must be possible to decode the replacement into one or more sixteen-bit Unicode characters.
...
RFC2781 describes process of decoding as follows:
...
Decoding of a single character from UTF-16 to an ISO 10646 character
value proceeds as follows. Let W1 be the next 16-bit integer in the
sequence of integers representing the text. Let W2 be the (eventual)
next integer following W1.
1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value
of W1. Terminate.
2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence
is in error and no valid character can be obtained using W1.
Terminate.
3) If there is no W2 (that is, the sequence ends with W1), or if W2
is not between 0xDC00 and 0xDFFF, the sequence is in error.
Terminate.
4) Construct a 20-bit unsigned integer U', taking the 10 low-order
bits of W1 as its 10 high-order bits and the 10 low-order bits of
W2 as its 10 low-order bits.
...
Here is the example:
------------- test.java --------------
import java.nio.charset.*;
import java.util.*;
public class test {
static Object [][] bSeqs = new Object [][] {
{"UTF-16", new byte [] { (byte)0xd8, 0, (byte)0xdc, 0}},
{"UTF-16BE", new byte [] { (byte)0xd8, 0, (byte)0xdc, 0}},
{"UTF-16LE", new byte [] { 0, (byte)0xd8, 0, (byte)0xdc}}
};
public static void main (String[] args) {
CharsetEncoder en = null;
for (int i = 0; i < bSeqs.length; i++) {
String chrName = (String)bSeqs[i][0];
try {
en = Charset.forName(chrName).newEncoder();
} catch(IllegalArgumentException e) {
e.printStackTrace(System.out);
return;
}
byte bArray [] = (byte [])(bSeqs[i][1]);
if (en.isLegalReplacement(bArray)) {
System.out.println("isLegalReplacement(byte[] repl) should " +
"return false with " + chrName);
}
}
}
}
#java -version
java version "1.4.2-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2-beta-b16)
Java HotSpot(TM) Client VM (build 1.4.2-beta-b16, mixed mode)
#java test
isLegalReplacement(byte[] repl) should return false with UTF-16
isLegalReplacement(byte[] repl) should return false with UTF-16BE
isLegalReplacement(byte[] repl) should return false with UTF-16LE
======================================================================