-
Bug
-
Resolution: Won't Fix
-
P4
-
None
-
None
-
None
-
generic
-
generic
Attached reproducer demonstrates the issue. The test converts two sequences of bytes from Windows-31J to UTF_16BE, below is the output
Windows-31J : 81 E8 81 E8
UTF_16BE : 22 2C 22 2C
Windows-31J : 81 E8 81 E9 81 E8
UTF_16BE : 22 2C FF FD 9A 55 FF FD
The first sequence consists of two identical characters (“multiple integral”).
This character is represented in code chart at
https://en.wikipedia.org/wiki/JIS_X_0208#Character_set_0x22_(row_number_2,_special_characters)
Its position is 2-74. This sequence converted to 222C 222C and this result looks expected.
The second sequence consists of three characters, at positions 2-74 2-75 2-74 (“empty cell” at position 2-75 added). One option to treat this case would be to convert this empty cell to replacement character (FFFD) and this sequence would be converted to 222C FFFD 222C. But the current behavior is that only first-byte of empty cell is converted to FFFD and the sequence converted to 222C FFFD 9A55 FFFD
After digging into the source code, my understanding is that the current behavior is implemented as a part of the patch for https://bugs.openjdk.java.net/browse/JDK-8008386
The specific change is in DoubleByte.java (http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/3b00bf85a6f5#l1.43) and the fallback logic is that it’s treated as first-byte invalid if one of the following conditions is met: 1) first byte is not leading byte, 2) second byte is leading byte, 3) second byte could be decoded as single
For the scenario above (with empty cell), the second byte is valid leading byte and hence only first-byte is replaced with FFFD. It might make sense to slightly relax this check by avoiding the condition 2) so that the empty cell will be treated double-byte invalid.
Windows-31J : 81 E8 81 E8
UTF_16BE : 22 2C 22 2C
Windows-31J : 81 E8 81 E9 81 E8
UTF_16BE : 22 2C FF FD 9A 55 FF FD
The first sequence consists of two identical characters (“multiple integral”).
This character is represented in code chart at
https://en.wikipedia.org/wiki/JIS_X_0208#Character_set_0x22_(row_number_2,_special_characters)
Its position is 2-74. This sequence converted to 222C 222C and this result looks expected.
The second sequence consists of three characters, at positions 2-74 2-75 2-74 (“empty cell” at position 2-75 added). One option to treat this case would be to convert this empty cell to replacement character (FFFD) and this sequence would be converted to 222C FFFD 222C. But the current behavior is that only first-byte of empty cell is converted to FFFD and the sequence converted to 222C FFFD 9A55 FFFD
After digging into the source code, my understanding is that the current behavior is implemented as a part of the patch for https://bugs.openjdk.java.net/browse/JDK-8008386
The specific change is in DoubleByte.java (http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/3b00bf85a6f5#l1.43) and the fallback logic is that it’s treated as first-byte invalid if one of the following conditions is met: 1) first byte is not leading byte, 2) second byte is leading byte, 3) second byte could be decoded as single
For the scenario above (with empty cell), the second byte is valid leading byte and hence only first-byte is replaced with FFFD. It might make sense to slightly relax this check by avoiding the condition 2) so that the empty cell will be treated double-byte invalid.