Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-5065634

Improve JISAutoDetect heuristics for guessing between EUC and SJIS

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Fix
    • Icon: P4 P4
    • None
    • 6
    • core-libs

      If the input to JISAutoDetect is both a valid EUC-encoded and SJIS-encoded
      byte stream, then the decoder must guess. The current algorithm guesses
      EUC if the EUC-encoded text contains more than one "regular" hiragana
      or more than one half-width katakana.

      This algorithm is poor because
      - it does not take into account the number of input bytes. A good heuristic
        should deal with ratios, not absolute number of characters or bytes.
      - The algorithm disregards other popular characters, such as "regular"
        katakana
      - halfwidth katakana is very unlikely to be mixed with fullwidth characters.
        In particular, halfwidth katakana mixed with fullwidth katakana is likely
        to occur only in didactic or humorous uses.

      We can do much better than the current heuristics without considering
      expensive tests such as kanji compounds, etc...

      ###@###.### 2004-06-18

            sherman Xueming Shen
            martin Martin Buchholz
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: