Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4422038

(cs) CharsetEncoder/Decoder needs API to control fallbacks

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Unresolved
    • Icon: P4 P4
    • None
    • 1.2.0, 1.4.0
    • core-libs
    • Fix Understood
    • generic
    • generic

      The CharsetEncoder and CharsetDecoder classes currently provide API to choose between two ways of handling unmappable input data: map to a substitution character or throw an exception. They should provide API for a third way: using fallback sequences.

      Fallback sequences are encoding-dependent output sequences that are not part of the core specification of a character encoding and generally cannot be mapped back, but are acceptable in many circumstances. For example, if mapping the Unicode character 00A9 COPYRIGHT SIGN to US-ASCII, "(C)" is a commonly used fallback sequence. If mapping the Unicode character 20AA NEW SHEQEL SIGN to ISO 8859-8, a desirable fallback sequence would be 0xF922E7 (HEBREW LETTER SHIN, QUOTATION MARK, HEBREW LETTER HET).

      Since fallback sequences are encoding dependent, they cannot be provided by the client of the character converter. On the other hand, some clients (e.g., file system access) depend on perfect roundtrip conversion, and other clients (e.g., SGML-based applications) have their own fallback mechanism. Character converters can therefore not blindly apply fallback sequences. A controlling API is needed.

      Preferably the API would take the form of a flag on the constructor that allows or disallows fallback sequences for the constructed character converter, so that implementations can choose between different tables at construction time.


      Additional information from 4166607:

      As far as character set conversions go, the Java API is deficient, and
      seriously compromises its functionality.

      Missing Characters. There is no way to control what happens when a given
      converter does not support particular characters. The converters don't allow:
      + Customizable marker--use of anything but ? as a missing character marker.
      + Normalization--even if 0041 0300 could be represented by codes representing
        A-grave in the target set, they still don't convert to the A-grave codes.
      + Fallbacks--if 201D (curly right quotation mark) does not exist in a target
        set, it is represented by a ? instead of a reasonable fallback (").
      + Escapes--you can't have a missing character be represented by the standard
        \u201D notation.

      Illegal Codes. There is no way to control what happens with illegal byte
      sequences. They are usually just skipped, with no warning.

            Unassigned Unassigned
            nlindenbsunw Norbert Lindenberg (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Imported:
              Indexed: