Name: mf23781 Date: 01/03/2000
A. Collation Versioning
Problem
Collation sort keys need to have long lives--they can be stored in database
fields and retrieved and used for years to come. However, there are a
number of factors that can cause the generation of sort keys to change over
time, causing mismatches.
1. The code generating the keys changes
2. There are updates to data for a given language.
For #1, the problem is that a number of improvements can be made in the
sort key construction in the future, both in terms of performance and size.
A number of these are discussed in
http://www.unicode.org/unicode/reports/tr10/tr10-2d3. We could probably get
the Java sort key to half its current size, for example. This can make a
tremendous difference to database storage requirements. Saying that the
format is fixed for all time prevents us from making improvements, and from
fixing bugs if we find them.
For #2, we will always be making tweeks to the orderings for different
languages as new data comes in. While the major languages are pretty
stable, the less common ones are not as well attested--plus linguistic
standards change subtly over time--look at the recent spelling reform in
German.
There is a strong, legitimate need for stability of sort keys across
versions. Here is a proposal for handling this.
Proposal
A. Add API to Collator:
/** Returns the version of the Collator's key format */
int getVersion(Locale desiredLocale);
/** Gets the desired version of the collation key format. If that version
is not supported by the current collator, an IllegalArgumentException is
thrown. */
Collator getInstance(Locale desiredLocale, int version);
/** Returns a list of all the supported collation key formats. The last
item is the latest version supported. */
static int[] getAvailableVersions(Locale desiredLocale);
B. Add new rules to the rule interpreter:
"{version <nnn>}" // sets the version of the collation data.
C. Add new keys to the Locale resource bundles, allowing for multiple sets
of collation rules
{"Collator-1", <collation rules>},
{"Collator-2", <collation rules>},
{"Collator-Latest", "Collator-2"},
Implementation
Add to RuleBasedCollator two new private fields:
short codeVersion;
short dataVersion;
Add to Collator a new field:
private static final short latestCodeVersion.
Any time the RuleBasedCollator code is changed so that it would affect the
sortkey format (which should be very rarely!), then the old code is
retained. The codeVersion is used to switch between the new and old code.
latestCodeVersion is set to reflect the new latest version. The zero
versions for each are the current 1.2 code.
Any time the data is changed, the old data is retained under the old
resource key. The new data gets a new version, and is entered with a new
key. The "Collator-Latest" key points to the key for the latest one.
The version number is ((dataVersion << 16) | codeVersion), although its
composition is private. When getInstance is called with no version number,
the latest code version and latest data version are used. When it is called
with a specific version (presumably one that was derived and stored
earlier), then those specific data and code versions are used. The data
version is used to access the right resource. The code version is stored in
the new collator, and used to switch processing as described above.
The getAvailableVersions call returns the available combinations of
resource versions and code versions. The code versions will be zero through
latestCodeVersion. The data versions are those available for the given
Locale's ResourceBundle. These are combined together to get all the
versions. That is, if there are 3 resource versions and 2 code versions, it
would return (hex) [00000000, 00000001, 00010000, 00010001, 00020000,
00020001]
Usage
If someone wants to store sort keys over time, they first tag their
database with the right Locale and latest version number at creation time.
Whenever they generate new sort keys, they pass the Locale and version
number in to getInstance. All of their sort keys will remain comparable. If
they ever re-index the entire database field, then they can use the latest
version to get the best results and bug fixes.
Issues
People can also get into trouble if they compare sort keys built with
different rules. If we wanted to remedy that, we could prefix all sort keys
with a hashCode, as discussed in:
http://www.unicode.org/unicode/reports/tr10/tr10-2d3#Catching Mismatches
Upside: The CollationKey objects could throw exceptions when comparing keys
built by different rules. if we wanted. Byte sort keys would at least not
sort randomly; all of one language would come before all of another.
Downside: Sort keys are 4 bytes longer. Also, method is not fail-safe
(although the chances of collisions are infinitesimal).
(Review ID: 99520)
======================================================================
- duplicates
-
JDK-4209582 [Col] Collation keys are fragile and can't be versioned
-
- Closed
-