Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4302264

Collation keys are fragile and can't be versioned

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Duplicate
    • Icon: P5 P5
    • None
    • 1.3.0
    • core-libs
    • generic
    • generic



      Name: mf23781 Date: 01/03/2000


      A. Collation Versioning

      Problem

      Collation sort keys need to have long lives--they can be stored in database
      fields and retrieved and used for years to come. However, there are a
      number of factors that can cause the generation of sort keys to change over
      time, causing mismatches.

      1. The code generating the keys changes
      2. There are updates to data for a given language.

      For #1, the problem is that a number of improvements can be made in the
      sort key construction in the future, both in terms of performance and size.
      A number of these are discussed in
      http://www.unicode.org/unicode/reports/tr10/tr10-2d3. We could probably get
      the Java sort key to half its current size, for example. This can make a
      tremendous difference to database storage requirements. Saying that the
      format is fixed for all time prevents us from making improvements, and from
      fixing bugs if we find them.

      For #2, we will always be making tweeks to the orderings for different
      languages as new data comes in. While the major languages are pretty
      stable, the less common ones are not as well attested--plus linguistic
      standards change subtly over time--look at the recent spelling reform in
      German.

      There is a strong, legitimate need for stability of sort keys across
      versions. Here is a proposal for handling this.

      Proposal

      A. Add API to Collator:


      /** Returns the version of the Collator's key format */

      int getVersion(Locale desiredLocale);

      /** Gets the desired version of the collation key format. If that version
      is not supported by the current collator, an IllegalArgumentException is
      thrown. */

      Collator getInstance(Locale desiredLocale, int version);

      /** Returns a list of all the supported collation key formats. The last
      item is the latest version supported. */

      static int[] getAvailableVersions(Locale desiredLocale);


      B. Add new rules to the rule interpreter:


      "{version <nnn>}" // sets the version of the collation data.


      C. Add new keys to the Locale resource bundles, allowing for multiple sets
      of collation rules


      {"Collator-1", <collation rules>},
      {"Collator-2", <collation rules>},
      {"Collator-Latest", "Collator-2"},


      Implementation

      Add to RuleBasedCollator two new private fields:

      short codeVersion;
      short dataVersion;

      Add to Collator a new field:

      private static final short latestCodeVersion.

      Any time the RuleBasedCollator code is changed so that it would affect the
      sortkey format (which should be very rarely!), then the old code is
      retained. The codeVersion is used to switch between the new and old code.
      latestCodeVersion is set to reflect the new latest version. The zero
      versions for each are the current 1.2 code.

      Any time the data is changed, the old data is retained under the old
      resource key. The new data gets a new version, and is entered with a new
      key. The "Collator-Latest" key points to the key for the latest one.

      The version number is ((dataVersion << 16) | codeVersion), although its
      composition is private. When getInstance is called with no version number,
      the latest code version and latest data version are used. When it is called
      with a specific version (presumably one that was derived and stored
      earlier), then those specific data and code versions are used. The data
      version is used to access the right resource. The code version is stored in
      the new collator, and used to switch processing as described above.

      The getAvailableVersions call returns the available combinations of
      resource versions and code versions. The code versions will be zero through
      latestCodeVersion. The data versions are those available for the given
      Locale's ResourceBundle. These are combined together to get all the
      versions. That is, if there are 3 resource versions and 2 code versions, it
      would return (hex) [00000000, 00000001, 00010000, 00010001, 00020000,
      00020001]

      Usage

      If someone wants to store sort keys over time, they first tag their
      database with the right Locale and latest version number at creation time.
      Whenever they generate new sort keys, they pass the Locale and version
      number in to getInstance. All of their sort keys will remain comparable. If
      they ever re-index the entire database field, then they can use the latest
      version to get the best results and bug fixes.

      Issues

      People can also get into trouble if they compare sort keys built with
      different rules. If we wanted to remedy that, we could prefix all sort keys
      with a hashCode, as discussed in:

      http://www.unicode.org/unicode/reports/tr10/tr10-2d3#Catching Mismatches

      Upside: The CollationKey objects could throw exceptions when comparing keys
      built by different rules. if we wanted. Byte sort keys would at least not
      sort randomly; all of one language would come before all of another.

      Downside: Sort keys are 4 bytes longer. Also, method is not fail-safe
      (although the chances of collisions are infinitesimal).
      (Review ID: 99520)

      ======================================================================

            nlindenbsunw Norbert Lindenberg (Inactive)
            miflemi Mick Fleming
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: