Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-7006360

Improving support on 84 unified HKSCS characters in Big5_HKSCS and MS950_HKSCS

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Unresolved
    • Icon: P4 P4
    • None
    • 6u23
    • core-libs

      A DESCRIPTION OF THE REQUEST :
      As described in http://www.ogcio.gov.hk/ccli/eng/hkscs/download/big5cmp.txt, 84 GCCS (HK Government Common Character Set) were not included in the HKSCS because they had been unified with formal Unicode characters by 1999.

      "MS950_HKSCS" is decoding the Big5 codes to PUA code points. "Big5_HKSCS" is decoding them as \uFFFD. It is suggested to revise the mappings as follows to align with the unification:

      Decode Mapping:
      Big5_Code Now-Decoded-As New-Decoded-As
      x8E69 uE33A u7BB8
      x8E6F uE340 u7C06
      x8E7E uE34F u7CCE
      x8EAB uE35A u7DD2
      x8EB4 uE363 u7E1D
      x8ECD uE37C u8005
      x8ED0 uE37F u8028
      x8F57 uE3C5 u83C1
      x8F69 uE3D7 u84A8
      x8F6E uE3DC u840F
      x8FCB uE417 u89A6
      x8FCC uE418 u89A9
      x8FFE uE44A u8D77
      x906D uE478 u90FD
      x907A uE485 u92B9
      x90DC uE4C5 u975C
      x90F1 uE4DA u97FF
      x91BF uE545 u9F16
      x9244 uE589 u8503
      x92AF uE5D2 u5159
      x92B0 uE5D3 u515B
      x92B1 uE5D4 u515D
      x92B2 uE5D5 u515E
      x92C8 uE5EB u936E
      x92D1 uE5F4 u7479
      x9447 uE6C6 u6D67
      x94CA uE727 u799B
      x95D9 uE7D3 u9097
      x9644 uE7FD u975D
      x96ED uE884 u701E
      x96FC uE893 u5B28
      x9B76 uEB40 u7201
      x9B78 uEB42 u77D7
      x9B7B uEB45 u7E87
      x9BC6 uEB6E u99D6
      x9BDE uEB86 u91D4
      x9BEC uEB94 u60DE
      x9BF6 uEB9E u6FB6
      x9C42 uEBA9 u8F36
      x9C53 uEBBA u4FBB
      x9C62 uEBC9 u71DF
      x9C68 uEBCF u9104
      x9C6B uEBD2 u9DF0
      x9C77 uEBDE u83CF
      x9CBC uEC01 u5C10
      x9CBD uEC02 u79E3
      x9CD0 uEC15 u5A67
      x9D57 uEC5B u8F0B
      x9D5A uEC5E u7B51
      x9DC4 uECA6 u62D0
      x9EA9 uED28 u6062
      x9EEF uED6E u75F9
      x9EFD uED7C u6C4A
      x9F60 uED9E u9B2E
      x9F66 uEDA4 u9F17
      x9FCB uEDE7 u50ED
      x9FD8 uEDF4 u5F0C
      xA063 uEE3E u880F
      xA077 uEE52 u62CE
      xA0D5 uEE8E u7468
      xA0DF uEE98 u7162
      xA0E4 uEE9D u7250
      xFA5F uE01F u5029
      xFA66 uE026 u507D
      xFABD uE05B u5305
      xFAC5 uE063 u5344
      xFAD5 uE073 u537F
      xFB48 uE0A5 u5605
      xFBB8 uE0F3 u5A77
      xFBF3 uE12E u5E75
      xFBF9 uE134 u5ED0
      xFC4F uE149 u5F58
      xFC6C uE166 u60A4
      xFCB9 uE191 u6490
      xFCE2 uE1BA u6674
      xFCF1 uE1C9 u675E
      xFDB7 uE22C u6C9C
      xFDB8 uE22D u6E1D
      xFDBB uE230 u6E2F
      xFDF1 uE266 u716E
      xFE52 uE286 u732A
      xFE6F uE2A3 u745C
      xFEAA uE2BC u74E9
      xFEDD uE2EF u7809

      Encode Mapping:
      Unicode Now-Encoded-As New-Encoded-As
      uE33A x8E69 xBAE6
      uE340 x8E6F xEDCA
      uE34F x8E7E xA261
      uE35A x8EAB xBAFC
      uE363 x8EB4 xBFA6
      uE37C x8ECD xAACC
      uE37F x8ED0 xBFAE
      uE3C5 x8F57 xB5D7
      uE3D7 x8F69 xE3C8
      uE3DC x8F6E xDB79
      uE417 x8FCB xBFCC
      uE418 x8FCC xA0D4
      uE44A x8FFE xB05F
      uE478 x906D xB3A3
      uE485 x907A xF9D7
      uE4C5 x90DC xC052
      uE4DA x90F1 xC554
      uE545 x91BF xF1E3
      uE589 x9244 x9242
      uE5D2 x92AF xA259
      uE5D3 x92B0 xA25A
      uE5D4 x92B1 xA25C
      uE5D5 x92B2 xA25B
      uE5EB x92C8 xA05F
      uE5F4 x92D1 xE6AB
      uE6C6 x9447 xD256
      uE727 x94CA xE6D0
      uE7D3 x95D9 xCA52
      uE7FD x9644 x9CE4
      uE884 x96ED x96EE
      uE893 x96FC xE959
      uEB40 x9B76 xEFF9
      uEB42 x9B78 xC5F7
      uEB45 x9B7B xF5E8
      uEB6E x9BC6 xE8CD
      uEB86 x9BDE xD0C0
      uEB94 x9BEC xFD64
      uEB9E x9BF6 xBF47
      uEBA9 x9C42 xEBC9
      uEBBA x9C53 xCDE7
      uEBC9 x9C62 xC0E7
      uEBCF x9C68 xDC52
      uEBD2 x9C6B xF86D
      uEBDE x9C77 xDB5D
      uEC01 x9CBC xC95C
      uEC02 x9CBD xAFB0
      uEC15 x9CD0 xD4D1
      uEC5B x9D57 xE07C
      uEC5E x9D5A xB5AE
      uECA6 x9DC4 xA9E4
      uED28 x9EA9 xABEC
      uED6E x9EEF xDECD
      uED7C x9EFD xC9FC
      uED9E x9F60 xF9C4
      uEDA4 x9F66 x91BE
      uEDE7 x9FCB xB9B0
      uEDF4 x9FD8 x9361
      uEE3E xA063 x8FB6
      uEE52 xA077 xA9F0
      uEE8E xA0D5 x947A
      uEE98 xA0DF xDE72
      uEE9D xA0E4 x9455
      uE01F xFA5F xADC5
      uE026 xFA66 xB0B0
      uE05B xFABD xA55D
      uE063 xFAC5 xA2CD
      uE073 xFAD5 xADEB
      uE0A5 xFB48 x9DEF
      uE0F3 xFBB8 xB440
      uE12E xFBF3 xC9DB
      uE134 xFBF9 x9DFB
      uE149 xFC4F xD8F4
      uE166 xFC6C xA0DC
      uE191 xFCB9 xBCB5
      uE1BA xFCE2 xB4B8
      uE1C9 xFCF1 xA7FB
      uE22C xFDB7 xCB58
      uE22D xFDB8 xB4FC
      uE230 xFDBB xB4E4
      uE266 xFDF1 xB54E
      uE286 xFE52 x9975
      uE2A3 xFE6F xB7EC
      uE2BC xFEAA xA260
      uE2EF xFEDD xCFF1


      JUSTIFICATION :
      These 84 characters are special cases of unification. Some were unified with formal Unicode code point and some were unified with other HKSCS characters. (Please refer to Annex I of http://www.ogcio.gov.hk/ccli/eng/hkscs/download/ehkscs99.pdf)

      The "Big5_HKSCS" encoding drops these characters when mapping and is thus disastrous for data conversion tasks. The "MS950_HKSCS" encoding maps the characters to the compatibility point and thus fails to conform to the specified unification.

      By implementation the suggested encoding and decoding mapping entries, these two encodings will be more usable in data conversion tasks while conforming to the above-mentioned unification of the 84 characters.

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      "MS950_HKSCS" and "Big5_HKSCS" to be implemented with the suggested encode and decode mappings.

      Desired Test case print-out:
      -----------------------------------------
      x8E69 -> u7BB8 u7BB8
      uE33A -> xBAE6 xBAE6

      ACTUAL -
      "MS950_HKSCS" - mapping to compatibility point irrespective of the unification

      "Big5_HKSCS" - no mapping for these 84 characters

      Actual Test case print-out:
      -----------------------------------------
      x8E69 -> uE33A uFFFD
      uE33A -> x8E69 x3F

      ---------- BEGIN SOURCE ----------
      public class A
      {

      public static void main(String argv[]) throws Exception
      {
          byte b1[]=null, b2[]=null;
          String s1=null, s2=null;
          b1 = new byte[] {(byte)0x8E, (byte)0x69};
          s1 = new String(b1, "MS950_HKSCS");
          s2 = new String(b1, "Big5_HKSCS");
          System.out.printf("x8E69 -> u%X u%X%n", (int) s1.charAt(0), (int) s2.charAt(0));
          
          s1 = "\uE33A";
          b1 = s1.getBytes("MS950_HKSCS");
          b2 = s2.getBytes("Big5_HKSCS");
          int c1 = b1.length == 1 ? (int) b1[0] : 0xFFFF&(((int)b1[0])<<8|(int)b1[1]);
          int c2 = b2.length == 1 ? (int) b2[0] : 0xFFFF&(((int)b2[0])<<8|(int)b2[1]);
          System.out.printf("uE33A -> x%X x%X%n", c1, c2);
      }

      }

      ---------- END SOURCE ----------

            sherman Xueming Shen
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Imported:
              Indexed: