-
Bug
-
Resolution: Won't Fix
-
P4
-
None
-
6
-
generic
-
generic
If the input to JISAutoDetect is both a valid EUC-encoded and SJIS-encoded
byte stream, then the decoder must guess. The current algorithm guesses
EUC if the EUC-encoded text contains more than one "regular" hiragana
or more than one half-width katakana.
This algorithm is poor because
- it does not take into account the number of input bytes. A good heuristic
should deal with ratios, not absolute number of characters or bytes.
- The algorithm disregards other popular characters, such as "regular"
katakana
- halfwidth katakana is very unlikely to be mixed with fullwidth characters.
In particular, halfwidth katakana mixed with fullwidth katakana is likely
to occur only in didactic or humorous uses.
We can do much better than the current heuristics without considering
expensive tests such as kanji compounds, etc...
###@###.### 2004-06-18
byte stream, then the decoder must guess. The current algorithm guesses
EUC if the EUC-encoded text contains more than one "regular" hiragana
or more than one half-width katakana.
This algorithm is poor because
- it does not take into account the number of input bytes. A good heuristic
should deal with ratios, not absolute number of characters or bytes.
- The algorithm disregards other popular characters, such as "regular"
katakana
- halfwidth katakana is very unlikely to be mixed with fullwidth characters.
In particular, halfwidth katakana mixed with fullwidth katakana is likely
to occur only in didactic or humorous uses.
We can do much better than the current heuristics without considering
expensive tests such as kanji compounds, etc...
###@###.### 2004-06-18