-
Enhancement
-
Resolution: Unresolved
-
P4
-
None
-
6u23
-
x86
-
windows_xp
A DESCRIPTION OF THE REQUEST :
As described in http://www.ogcio.gov.hk/ccli/eng/hkscs/download/big5cmp.txt, 84 GCCS (HK Government Common Character Set) were not included in the HKSCS because they had been unified with formal Unicode characters by 1999.
"MS950_HKSCS" is decoding the Big5 codes to PUA code points. "Big5_HKSCS" is decoding them as \uFFFD. It is suggested to revise the mappings as follows to align with the unification:
Decode Mapping:
Big5_Code Now-Decoded-As New-Decoded-As
x8E69 uE33A u7BB8
x8E6F uE340 u7C06
x8E7E uE34F u7CCE
x8EAB uE35A u7DD2
x8EB4 uE363 u7E1D
x8ECD uE37C u8005
x8ED0 uE37F u8028
x8F57 uE3C5 u83C1
x8F69 uE3D7 u84A8
x8F6E uE3DC u840F
x8FCB uE417 u89A6
x8FCC uE418 u89A9
x8FFE uE44A u8D77
x906D uE478 u90FD
x907A uE485 u92B9
x90DC uE4C5 u975C
x90F1 uE4DA u97FF
x91BF uE545 u9F16
x9244 uE589 u8503
x92AF uE5D2 u5159
x92B0 uE5D3 u515B
x92B1 uE5D4 u515D
x92B2 uE5D5 u515E
x92C8 uE5EB u936E
x92D1 uE5F4 u7479
x9447 uE6C6 u6D67
x94CA uE727 u799B
x95D9 uE7D3 u9097
x9644 uE7FD u975D
x96ED uE884 u701E
x96FC uE893 u5B28
x9B76 uEB40 u7201
x9B78 uEB42 u77D7
x9B7B uEB45 u7E87
x9BC6 uEB6E u99D6
x9BDE uEB86 u91D4
x9BEC uEB94 u60DE
x9BF6 uEB9E u6FB6
x9C42 uEBA9 u8F36
x9C53 uEBBA u4FBB
x9C62 uEBC9 u71DF
x9C68 uEBCF u9104
x9C6B uEBD2 u9DF0
x9C77 uEBDE u83CF
x9CBC uEC01 u5C10
x9CBD uEC02 u79E3
x9CD0 uEC15 u5A67
x9D57 uEC5B u8F0B
x9D5A uEC5E u7B51
x9DC4 uECA6 u62D0
x9EA9 uED28 u6062
x9EEF uED6E u75F9
x9EFD uED7C u6C4A
x9F60 uED9E u9B2E
x9F66 uEDA4 u9F17
x9FCB uEDE7 u50ED
x9FD8 uEDF4 u5F0C
xA063 uEE3E u880F
xA077 uEE52 u62CE
xA0D5 uEE8E u7468
xA0DF uEE98 u7162
xA0E4 uEE9D u7250
xFA5F uE01F u5029
xFA66 uE026 u507D
xFABD uE05B u5305
xFAC5 uE063 u5344
xFAD5 uE073 u537F
xFB48 uE0A5 u5605
xFBB8 uE0F3 u5A77
xFBF3 uE12E u5E75
xFBF9 uE134 u5ED0
xFC4F uE149 u5F58
xFC6C uE166 u60A4
xFCB9 uE191 u6490
xFCE2 uE1BA u6674
xFCF1 uE1C9 u675E
xFDB7 uE22C u6C9C
xFDB8 uE22D u6E1D
xFDBB uE230 u6E2F
xFDF1 uE266 u716E
xFE52 uE286 u732A
xFE6F uE2A3 u745C
xFEAA uE2BC u74E9
xFEDD uE2EF u7809
Encode Mapping:
Unicode Now-Encoded-As New-Encoded-As
uE33A x8E69 xBAE6
uE340 x8E6F xEDCA
uE34F x8E7E xA261
uE35A x8EAB xBAFC
uE363 x8EB4 xBFA6
uE37C x8ECD xAACC
uE37F x8ED0 xBFAE
uE3C5 x8F57 xB5D7
uE3D7 x8F69 xE3C8
uE3DC x8F6E xDB79
uE417 x8FCB xBFCC
uE418 x8FCC xA0D4
uE44A x8FFE xB05F
uE478 x906D xB3A3
uE485 x907A xF9D7
uE4C5 x90DC xC052
uE4DA x90F1 xC554
uE545 x91BF xF1E3
uE589 x9244 x9242
uE5D2 x92AF xA259
uE5D3 x92B0 xA25A
uE5D4 x92B1 xA25C
uE5D5 x92B2 xA25B
uE5EB x92C8 xA05F
uE5F4 x92D1 xE6AB
uE6C6 x9447 xD256
uE727 x94CA xE6D0
uE7D3 x95D9 xCA52
uE7FD x9644 x9CE4
uE884 x96ED x96EE
uE893 x96FC xE959
uEB40 x9B76 xEFF9
uEB42 x9B78 xC5F7
uEB45 x9B7B xF5E8
uEB6E x9BC6 xE8CD
uEB86 x9BDE xD0C0
uEB94 x9BEC xFD64
uEB9E x9BF6 xBF47
uEBA9 x9C42 xEBC9
uEBBA x9C53 xCDE7
uEBC9 x9C62 xC0E7
uEBCF x9C68 xDC52
uEBD2 x9C6B xF86D
uEBDE x9C77 xDB5D
uEC01 x9CBC xC95C
uEC02 x9CBD xAFB0
uEC15 x9CD0 xD4D1
uEC5B x9D57 xE07C
uEC5E x9D5A xB5AE
uECA6 x9DC4 xA9E4
uED28 x9EA9 xABEC
uED6E x9EEF xDECD
uED7C x9EFD xC9FC
uED9E x9F60 xF9C4
uEDA4 x9F66 x91BE
uEDE7 x9FCB xB9B0
uEDF4 x9FD8 x9361
uEE3E xA063 x8FB6
uEE52 xA077 xA9F0
uEE8E xA0D5 x947A
uEE98 xA0DF xDE72
uEE9D xA0E4 x9455
uE01F xFA5F xADC5
uE026 xFA66 xB0B0
uE05B xFABD xA55D
uE063 xFAC5 xA2CD
uE073 xFAD5 xADEB
uE0A5 xFB48 x9DEF
uE0F3 xFBB8 xB440
uE12E xFBF3 xC9DB
uE134 xFBF9 x9DFB
uE149 xFC4F xD8F4
uE166 xFC6C xA0DC
uE191 xFCB9 xBCB5
uE1BA xFCE2 xB4B8
uE1C9 xFCF1 xA7FB
uE22C xFDB7 xCB58
uE22D xFDB8 xB4FC
uE230 xFDBB xB4E4
uE266 xFDF1 xB54E
uE286 xFE52 x9975
uE2A3 xFE6F xB7EC
uE2BC xFEAA xA260
uE2EF xFEDD xCFF1
JUSTIFICATION :
These 84 characters are special cases of unification. Some were unified with formal Unicode code point and some were unified with other HKSCS characters. (Please refer to Annex I of http://www.ogcio.gov.hk/ccli/eng/hkscs/download/ehkscs99.pdf)
The "Big5_HKSCS" encoding drops these characters when mapping and is thus disastrous for data conversion tasks. The "MS950_HKSCS" encoding maps the characters to the compatibility point and thus fails to conform to the specified unification.
By implementation the suggested encoding and decoding mapping entries, these two encodings will be more usable in data conversion tasks while conforming to the above-mentioned unification of the 84 characters.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
"MS950_HKSCS" and "Big5_HKSCS" to be implemented with the suggested encode and decode mappings.
Desired Test case print-out:
-----------------------------------------
x8E69 -> u7BB8 u7BB8
uE33A -> xBAE6 xBAE6
ACTUAL -
"MS950_HKSCS" - mapping to compatibility point irrespective of the unification
"Big5_HKSCS" - no mapping for these 84 characters
Actual Test case print-out:
-----------------------------------------
x8E69 -> uE33A uFFFD
uE33A -> x8E69 x3F
---------- BEGIN SOURCE ----------
public class A
{
public static void main(String argv[]) throws Exception
{
byte b1[]=null, b2[]=null;
String s1=null, s2=null;
b1 = new byte[] {(byte)0x8E, (byte)0x69};
s1 = new String(b1, "MS950_HKSCS");
s2 = new String(b1, "Big5_HKSCS");
System.out.printf("x8E69 -> u%X u%X%n", (int) s1.charAt(0), (int) s2.charAt(0));
s1 = "\uE33A";
b1 = s1.getBytes("MS950_HKSCS");
b2 = s2.getBytes("Big5_HKSCS");
int c1 = b1.length == 1 ? (int) b1[0] : 0xFFFF&(((int)b1[0])<<8|(int)b1[1]);
int c2 = b2.length == 1 ? (int) b2[0] : 0xFFFF&(((int)b2[0])<<8|(int)b2[1]);
System.out.printf("uE33A -> x%X x%X%n", c1, c2);
}
}
---------- END SOURCE ----------
As described in http://www.ogcio.gov.hk/ccli/eng/hkscs/download/big5cmp.txt, 84 GCCS (HK Government Common Character Set) were not included in the HKSCS because they had been unified with formal Unicode characters by 1999.
"MS950_HKSCS" is decoding the Big5 codes to PUA code points. "Big5_HKSCS" is decoding them as \uFFFD. It is suggested to revise the mappings as follows to align with the unification:
Decode Mapping:
Big5_Code Now-Decoded-As New-Decoded-As
x8E69 uE33A u7BB8
x8E6F uE340 u7C06
x8E7E uE34F u7CCE
x8EAB uE35A u7DD2
x8EB4 uE363 u7E1D
x8ECD uE37C u8005
x8ED0 uE37F u8028
x8F57 uE3C5 u83C1
x8F69 uE3D7 u84A8
x8F6E uE3DC u840F
x8FCB uE417 u89A6
x8FCC uE418 u89A9
x8FFE uE44A u8D77
x906D uE478 u90FD
x907A uE485 u92B9
x90DC uE4C5 u975C
x90F1 uE4DA u97FF
x91BF uE545 u9F16
x9244 uE589 u8503
x92AF uE5D2 u5159
x92B0 uE5D3 u515B
x92B1 uE5D4 u515D
x92B2 uE5D5 u515E
x92C8 uE5EB u936E
x92D1 uE5F4 u7479
x9447 uE6C6 u6D67
x94CA uE727 u799B
x95D9 uE7D3 u9097
x9644 uE7FD u975D
x96ED uE884 u701E
x96FC uE893 u5B28
x9B76 uEB40 u7201
x9B78 uEB42 u77D7
x9B7B uEB45 u7E87
x9BC6 uEB6E u99D6
x9BDE uEB86 u91D4
x9BEC uEB94 u60DE
x9BF6 uEB9E u6FB6
x9C42 uEBA9 u8F36
x9C53 uEBBA u4FBB
x9C62 uEBC9 u71DF
x9C68 uEBCF u9104
x9C6B uEBD2 u9DF0
x9C77 uEBDE u83CF
x9CBC uEC01 u5C10
x9CBD uEC02 u79E3
x9CD0 uEC15 u5A67
x9D57 uEC5B u8F0B
x9D5A uEC5E u7B51
x9DC4 uECA6 u62D0
x9EA9 uED28 u6062
x9EEF uED6E u75F9
x9EFD uED7C u6C4A
x9F60 uED9E u9B2E
x9F66 uEDA4 u9F17
x9FCB uEDE7 u50ED
x9FD8 uEDF4 u5F0C
xA063 uEE3E u880F
xA077 uEE52 u62CE
xA0D5 uEE8E u7468
xA0DF uEE98 u7162
xA0E4 uEE9D u7250
xFA5F uE01F u5029
xFA66 uE026 u507D
xFABD uE05B u5305
xFAC5 uE063 u5344
xFAD5 uE073 u537F
xFB48 uE0A5 u5605
xFBB8 uE0F3 u5A77
xFBF3 uE12E u5E75
xFBF9 uE134 u5ED0
xFC4F uE149 u5F58
xFC6C uE166 u60A4
xFCB9 uE191 u6490
xFCE2 uE1BA u6674
xFCF1 uE1C9 u675E
xFDB7 uE22C u6C9C
xFDB8 uE22D u6E1D
xFDBB uE230 u6E2F
xFDF1 uE266 u716E
xFE52 uE286 u732A
xFE6F uE2A3 u745C
xFEAA uE2BC u74E9
xFEDD uE2EF u7809
Encode Mapping:
Unicode Now-Encoded-As New-Encoded-As
uE33A x8E69 xBAE6
uE340 x8E6F xEDCA
uE34F x8E7E xA261
uE35A x8EAB xBAFC
uE363 x8EB4 xBFA6
uE37C x8ECD xAACC
uE37F x8ED0 xBFAE
uE3C5 x8F57 xB5D7
uE3D7 x8F69 xE3C8
uE3DC x8F6E xDB79
uE417 x8FCB xBFCC
uE418 x8FCC xA0D4
uE44A x8FFE xB05F
uE478 x906D xB3A3
uE485 x907A xF9D7
uE4C5 x90DC xC052
uE4DA x90F1 xC554
uE545 x91BF xF1E3
uE589 x9244 x9242
uE5D2 x92AF xA259
uE5D3 x92B0 xA25A
uE5D4 x92B1 xA25C
uE5D5 x92B2 xA25B
uE5EB x92C8 xA05F
uE5F4 x92D1 xE6AB
uE6C6 x9447 xD256
uE727 x94CA xE6D0
uE7D3 x95D9 xCA52
uE7FD x9644 x9CE4
uE884 x96ED x96EE
uE893 x96FC xE959
uEB40 x9B76 xEFF9
uEB42 x9B78 xC5F7
uEB45 x9B7B xF5E8
uEB6E x9BC6 xE8CD
uEB86 x9BDE xD0C0
uEB94 x9BEC xFD64
uEB9E x9BF6 xBF47
uEBA9 x9C42 xEBC9
uEBBA x9C53 xCDE7
uEBC9 x9C62 xC0E7
uEBCF x9C68 xDC52
uEBD2 x9C6B xF86D
uEBDE x9C77 xDB5D
uEC01 x9CBC xC95C
uEC02 x9CBD xAFB0
uEC15 x9CD0 xD4D1
uEC5B x9D57 xE07C
uEC5E x9D5A xB5AE
uECA6 x9DC4 xA9E4
uED28 x9EA9 xABEC
uED6E x9EEF xDECD
uED7C x9EFD xC9FC
uED9E x9F60 xF9C4
uEDA4 x9F66 x91BE
uEDE7 x9FCB xB9B0
uEDF4 x9FD8 x9361
uEE3E xA063 x8FB6
uEE52 xA077 xA9F0
uEE8E xA0D5 x947A
uEE98 xA0DF xDE72
uEE9D xA0E4 x9455
uE01F xFA5F xADC5
uE026 xFA66 xB0B0
uE05B xFABD xA55D
uE063 xFAC5 xA2CD
uE073 xFAD5 xADEB
uE0A5 xFB48 x9DEF
uE0F3 xFBB8 xB440
uE12E xFBF3 xC9DB
uE134 xFBF9 x9DFB
uE149 xFC4F xD8F4
uE166 xFC6C xA0DC
uE191 xFCB9 xBCB5
uE1BA xFCE2 xB4B8
uE1C9 xFCF1 xA7FB
uE22C xFDB7 xCB58
uE22D xFDB8 xB4FC
uE230 xFDBB xB4E4
uE266 xFDF1 xB54E
uE286 xFE52 x9975
uE2A3 xFE6F xB7EC
uE2BC xFEAA xA260
uE2EF xFEDD xCFF1
JUSTIFICATION :
These 84 characters are special cases of unification. Some were unified with formal Unicode code point and some were unified with other HKSCS characters. (Please refer to Annex I of http://www.ogcio.gov.hk/ccli/eng/hkscs/download/ehkscs99.pdf)
The "Big5_HKSCS" encoding drops these characters when mapping and is thus disastrous for data conversion tasks. The "MS950_HKSCS" encoding maps the characters to the compatibility point and thus fails to conform to the specified unification.
By implementation the suggested encoding and decoding mapping entries, these two encodings will be more usable in data conversion tasks while conforming to the above-mentioned unification of the 84 characters.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
"MS950_HKSCS" and "Big5_HKSCS" to be implemented with the suggested encode and decode mappings.
Desired Test case print-out:
-----------------------------------------
x8E69 -> u7BB8 u7BB8
uE33A -> xBAE6 xBAE6
ACTUAL -
"MS950_HKSCS" - mapping to compatibility point irrespective of the unification
"Big5_HKSCS" - no mapping for these 84 characters
Actual Test case print-out:
-----------------------------------------
x8E69 -> uE33A uFFFD
uE33A -> x8E69 x3F
---------- BEGIN SOURCE ----------
public class A
{
public static void main(String argv[]) throws Exception
{
byte b1[]=null, b2[]=null;
String s1=null, s2=null;
b1 = new byte[] {(byte)0x8E, (byte)0x69};
s1 = new String(b1, "MS950_HKSCS");
s2 = new String(b1, "Big5_HKSCS");
System.out.printf("x8E69 -> u%X u%X%n", (int) s1.charAt(0), (int) s2.charAt(0));
s1 = "\uE33A";
b1 = s1.getBytes("MS950_HKSCS");
b2 = s2.getBytes("Big5_HKSCS");
int c1 = b1.length == 1 ? (int) b1[0] : 0xFFFF&(((int)b1[0])<<8|(int)b1[1]);
int c2 = b2.length == 1 ? (int) b2[0] : 0xFFFF&(((int)b2[0])<<8|(int)b2[1]);
System.out.printf("uE33A -> x%X x%X%n", c1, c2);
}
}
---------- END SOURCE ----------