Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4216686

RFE: Need way to identify which character encodings are reversible

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Duplicate
    • Icon: P5 P5
    • None
    • 1.2.0
    • core-libs



      Name: dbT83986 Date: 03/02/99


      Is there a way to identify which character encodings are reversible? The documentation on them at least should mention which ones are.

      ==================================
      REVIEW NOTE 3/10/99 - User responded with additional information

      Hi, David. Thanks for the reply.

      If you refer to the "DATA TRANSFER PROBLEMS ON WINDOWS" section in the
      'README' file in jdk1.1.7B, you'll see a paragraph discussing the
      occasional need to store byte data in a string. I use this sometimes to
      avoid large static initializers.

      [I pasted the relevent snippet below.]

      An easy way to convert bytes to a string is to use one of the
      encodings. (Please remember that this function is meant for converting
      bytes that have already been written out via an encoding).

      Note that I want to use this in kind of an unconventional way in that I
      want to start with bytes and convert to a string, then convert back.
      The bytes are the first step in this case.

      The notes below outline that only encodings that are "reversible" can
      convert in two directions. One example is ISO8859_1. This encoder,
      however, I think wastes space in the string by only using the lower
      half. Perhaps other encodings would give slightly better results, but
      I'm not sure which can go both ways.

      Of course, I realize this isn't the purpose, but it can reduce code size
      by not having to write byte->char packing/unpacking code. My solution
      up until now has been to gzip the byte array, then pack two bytes per
      char in a string.

      I guess the whole point to my ramblings is twofold:

      1) Have a method or something, such as isReversible(String) for the
      various encodings,
      and
      2) Maybe create a new encoding that packs bytes into chars, two bytes
      per char. We can call it "DAVID-SHAWN".

      Thanks for your time,
      -Shawn

      P.S. Here is the README snippet:

      =======================================================================
                         DATA TRANSFER PROBLEMS ON WINDOWS
      =======================================================================

      A bug in the data transfer API (4032895) prevents most objects from
      being copied to the Win32 clipboard. A common workaround is to convert
      objects to a String representation, since String objects are not
      affected by this bug.

      One popular technique for converting an object to a string is to write
      the object into a ByteArrayOutputStream and convert the stream to a
      String with toString(). String.getBytes() reverses the process.

      There is a potential problem with this kind of byte/character
      conversion. Both toString() and getBytes() rely on a locale-specific
      character encoder to translate byte values to and from Unicode
      character values. Not all encoders assume a one-to-one relationship
      between byte values and character values. To ensure a reliable
      translation, do not rely on the default locale encoder. Explicitly
      specify an encoder that uses a reversible translation, such as
      ISO8859_1. Do this by passing the encoder name to toString() and
      getBytes():

          aString = aStream.toString("ISO8859_1");
          aByteArray = aString.getBytes("ISO8859_1");

      In previous releases, the need to use a reversible encoding was not
      apparent to most programmers. ISO8859_1 was the default encoder for
      western locales on both Solaris and Win32. A program's dependence on
      ISO8859_1 might not be apparent if the program was not tested under a
      non-western locale.

      JDK software running on Win32 machines uses Cp1252 (Windows Latin-1) as
      the default encoding for western locales. Cp1252 does not implement a
      reversible byte/character translation. It may appear to some
      programmers that 1.1.7 introduces an incompatibiity. The real problem
      is a programming technique that unintentionally relies on the features
      of specific locales.

      (Review ID: 52380)
      ======================================================================

            nlindenbsunw Norbert Lindenberg (Inactive)
            dblairsunw Dave Blair (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: