Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Fix
Priority: P4
Fix Version/s: None
Affects Version/s: 6
Component/s: core-libs
Labels:
- martin

Subcomponent:
java.nio.charsets
CPU:

generic
OS:

generic

If the input to JISAutoDetect is both a valid EUC-encoded and SJIS-encoded
byte stream, then the decoder must guess. The current algorithm guesses
EUC if the EUC-encoded text contains more than one "regular" hiragana
or more than one half-width katakana.

This algorithm is poor because
- it does not take into account the number of input bytes. A good heuristic
  should deal with ratios, not absolute number of characters or bytes.
- The algorithm disregards other popular characters, such as "regular"
  katakana
- halfwidth katakana is very unlikely to be mixed with fullwidth characters.
  In particular, halfwidth katakana mixed with fullwidth katakana is likely
  to occur only in didactic or humorous uses.

We can do much better than the current heuristics without considering
expensive tests such as kanji compounds, etc...

###@###.### 2004-06-18

Assignee:: Xueming Shen

Reporter:: Martin Buchholz

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2004-06-18 16:36

Updated:: 2024-10-09 13:23

Resolved:: 2017-08-09 13:54

Imported:: 15/Sep/12 1:26 PM

Indexed:: 17/Jul/12 10:57 AM

Details

Description

Attachments

Activity

People

Dates