Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4209582

[Col] Collation keys are fragile and can't be versioned

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Won't Fix
    • Icon: P4 P4
    • None
    • 1.2.0, 1.3.0
    • core-libs
    • b01
    • generic, x86
    • generic, windows_nt

      Name: clC74495 Date: 02/08/99


      It appears that the output of the java.txt.CollationKey.toByteArray()
      method has changed from 1.1 to 1.2. At least I know this is true for US
      locale.

      This is a serious issue for us. We have a pure java database
      implementation that uses these byte arrays as sort keys in our the
      btrees than maintain secondary indexes. When this format changes, the
      old database files become invalid because these sortKeys is changed.
      All secondary indexes must be rebuilt and the database files cannot be
      shared between different versions of the java.

      The CollationKey is a powerful feature that allows us to support
      internationalized secondary index views. If they change format
      frequently, its impacts us and our customers greatly.

      Questions for sun:

      1) Why did the format of these CollationKeys change in JDK 1.2?

      2) Are the changes specific to a particular locale?

      3) Is there a way to generate old format CollationKeys?

      4) Does sun place any importance on applications that need to persist
      CollationKeys?

      5) Is there a way to detect if the format has changed from one release
      to another? If not, can you please, please provide such a mechanism?

      Below is a code example that shows that the CollationKey format changed
      from 1.1 to 1.2. Generated byte array will have a different format from
      one jdk 1.1 to 1.2.

          collator = (RuleBasedCollator)Collator.getInstance(new Locale("en",
      "US", ""));//kna
          collator.setStrength(Collator.TERTIARY);
          collator.setDecomposition(Collator.NO_DECOMPOSITION);

          CollationKey collationKey = collator.getCollationKey("3");
           byte[] collationKeyArray = collationKey.toByteArray(); // content
      different for 1.1 and 1.2
      (Review ID: 53523)
      ======================================================================

      From 4302264, ###@###.###:

      A. Collation Versioning

      Problem

      Collation sort keys need to have long lives--they can be stored in database
      fields and retrieved and used for years to come. However, there are a
      number of factors that can cause the generation of sort keys to change over
      time, causing mismatches.

      1. The code generating the keys changes
      2. There are updates to data for a given language.

      For #1, the problem is that a number of improvements can be made in the
      sort key construction in the future, both in terms of performance and size.
      A number of these are discussed in
      http://www.unicode.org/unicode/reports/tr10/tr10-2d3. We could probably get
      the Java sort key to half its current size, for example. This can make a
      tremendous difference to database storage requirements. Saying that the
      format is fixed for all time prevents us from making improvements, and from
      fixing bugs if we find them.

      For #2, we will always be making tweeks to the orderings for different
      languages as new data comes in. While the major languages are pretty
      stable, the less common ones are not as well attested--plus linguistic
      standards change subtly over time--look at the recent spelling reform in
      German.

      There is a strong, legitimate need for stability of sort keys across
      versions. Here is a proposal for handling this.

      Proposal

      A. Add API to Collator:


      /** Returns the version of the Collator's key format */

      int getVersion(Locale desiredLocale);

      /** Gets the desired version of the collation key format. If that version
      is not supported by the current collator, an IllegalArgumentException is
      thrown. */

      Collator getInstance(Locale desiredLocale, int version);

      /** Returns a list of all the supported collation key formats. The last
      item is the latest version supported. */

      static int[] getAvailableVersions(Locale desiredLocale);


      B. Add new rules to the rule interpreter:


      "{version <nnn>}" // sets the version of the collation data.


      C. Add new keys to the Locale resource bundles, allowing for multiple sets
      of collation rules


      {"Collator-1", <collation rules>},
      {"Collator-2", <collation rules>},
      {"Collator-Latest", "Collator-2"},


      Implementation

      Add to RuleBasedCollator two new private fields:

      short codeVersion;
      short dataVersion;

      Add to Collator a new field:

      private static final short latestCodeVersion.

      Any time the RuleBasedCollator code is changed so that it would affect the
      sortkey format (which should be very rarely!), then the old code is
      retained. The codeVersion is used to switch between the new and old code.
      latestCodeVersion is set to reflect the new latest version. The zero
      versions for each are the current 1.2 code.

      Any time the data is changed, the old data is retained under the old
      resource key. The new data gets a new version, and is entered with a new
      key. The "Collator-Latest" key points to the key for the latest one.

      The version number is ((dataVersion << 16) | codeVersion), although its
      composition is private. When getInstance is called with no version number,
      the latest code version and latest data version are used. When it is called
      with a specific version (presumably one that was derived and stored
      earlier), then those specific data and code versions are used. The data
      version is used to access the right resource. The code version is stored in
      the new collator, and used to switch processing as described above.

      The getAvailableVersions call returns the available combinations of
      resource versions and code versions. The code versions will be zero through
      latestCodeVersion. The data versions are those available for the given
      Locale's ResourceBundle. These are combined together to get all the
      versions. That is, if there are 3 resource versions and 2 code versions, it
      would return (hex) [00000000, 00000001, 00010000, 00010001, 00020000,
      00020001]

      Usage

      If someone wants to store sort keys over time, they first tag their
      database with the right Locale and latest version number at creation time.
      Whenever they generate new sort keys, they pass the Locale and version
      number in to getInstance. All of their sort keys will remain comparable. If
      they ever re-index the entire database field, then they can use the latest
      version to get the best results and bug fixes.

      Issues

      People can also get into trouble if they compare sort keys built with
      different rules. If we wanted to remedy that, we could prefix all sort keys
      with a hashCode, as discussed in:

      http://www.unicode.org/unicode/reports/tr10/tr10-2d3#Catching Mismatches

      Upside: The CollationKey objects could throw exceptions when comparing keys
      built by different rules. if we wanted. Byte sort keys would at least not
      sort randomly; all of one language would come before all of another.

      Downside: Sort keys are 4 bytes longer. Also, method is not fail-safe
      (although the chances of collisions are infinitesimal).

      ======================================================================
      ###@###.### 11/2/04 18:21 GMT

            peytoia Yuka Kamiya (Inactive)
            clucasius Carlos Lucasius (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: