Loading...

Type: Enhancement
Resolution: Won't Fix
Priority: P4
Fix Version/s: None
Affects Version/s: 1.2.0, 1.3.0
Component/s: core-libs
Labels:
- hopper-waive
- licbug

Subcomponent:
java.text
Introduced In Build:
b01
Introduced In Version:

1.1.7
CPU:

generic, x86
OS:

generic, windows_nt

Name: clC74495 Date: 02/08/99

It appears that the output of the java.txt.CollationKey.toByteArray()
method has changed from 1.1 to 1.2. At least I know this is true for US
locale.

This is a serious issue for us. We have a pure java database
implementation that uses these byte arrays as sort keys in our the
btrees than maintain secondary indexes. When this format changes, the
old database files become invalid because these sortKeys is changed.
All secondary indexes must be rebuilt and the database files cannot be
shared between different versions of the java.

The CollationKey is a powerful feature that allows us to support
internationalized secondary index views. If they change format
frequently, its impacts us and our customers greatly.

Questions for sun:

1) Why did the format of these CollationKeys change in JDK 1.2?

2) Are the changes specific to a particular locale?

3) Is there a way to generate old format CollationKeys?

4) Does sun place any importance on applications that need to persist
CollationKeys?

5) Is there a way to detect if the format has changed from one release
to another? If not, can you please, please provide such a mechanism?

Below is a code example that shows that the CollationKey format changed
from 1.1 to 1.2. Generated byte array will have a different format from
one jdk 1.1 to 1.2.

    collator = (RuleBasedCollator)Collator.getInstance(new Locale("en",
"US", ""));//kna
    collator.setStrength(Collator.TERTIARY);
    collator.setDecomposition(Collator.NO_DECOMPOSITION);

    CollationKey collationKey = collator.getCollationKey("3");
     byte[] collationKeyArray = collationKey.toByteArray(); // content
different for 1.1 and 1.2
(Review ID: 53523)
======================================================================

From 4302264, ###@###.###:

A. Collation Versioning

Problem

Collation sort keys need to have long lives--they can be stored in database
fields and retrieved and used for years to come. However, there are a
number of factors that can cause the generation of sort keys to change over
time, causing mismatches.

1. The code generating the keys changes
2. There are updates to data for a given language.

For #1, the problem is that a number of improvements can be made in the
sort key construction in the future, both in terms of performance and size.
A number of these are discussed in
http://www.unicode.org/unicode/reports/tr10/tr10-2d3. We could probably get
the Java sort key to half its current size, for example. This can make a
tremendous difference to database storage requirements. Saying that the
format is fixed for all time prevents us from making improvements, and from
fixing bugs if we find them.

For #2, we will always be making tweeks to the orderings for different
languages as new data comes in. While the major languages are pretty
stable, the less common ones are not as well attested--plus linguistic
standards change subtly over time--look at the recent spelling reform in
German.

There is a strong, legitimate need for stability of sort keys across
versions. Here is a proposal for handling this.

Proposal

A. Add API to Collator:

/** Returns the version of the Collator's key format */

int getVersion(Locale desiredLocale);

/** Gets the desired version of the collation key format. If that version
is not supported by the current collator, an IllegalArgumentException is
thrown. */

Collator getInstance(Locale desiredLocale, int version);

/** Returns a list of all the supported collation key formats. The last
item is the latest version supported. */

static int[] getAvailableVersions(Locale desiredLocale);

B. Add new rules to the rule interpreter:

"{version <nnn>}" // sets the version of the collation data.

C. Add new keys to the Locale resource bundles, allowing for multiple sets
of collation rules

{"Collator-1", <collation rules>},
{"Collator-2", <collation rules>},
{"Collator-Latest", "Collator-2"},

Implementation

Add to RuleBasedCollator two new private fields:

short codeVersion;
short dataVersion;

Add to Collator a new field:

private static final short latestCodeVersion.

Any time the RuleBasedCollator code is changed so that it would affect the
sortkey format (which should be very rarely!), then the old code is
retained. The codeVersion is used to switch between the new and old code.
latestCodeVersion is set to reflect the new latest version. The zero
versions for each are the current 1.2 code.

Any time the data is changed, the old data is retained under the old
resource key. The new data gets a new version, and is entered with a new
key. The "Collator-Latest" key points to the key for the latest one.

The version number is ((dataVersion << 16) | codeVersion), although its
composition is private. When getInstance is called with no version number,
the latest code version and latest data version are used. When it is called
with a specific version (presumably one that was derived and stored
earlier), then those specific data and code versions are used. The data
version is used to access the right resource. The code version is stored in
the new collator, and used to switch processing as described above.

The getAvailableVersions call returns the available combinations of
resource versions and code versions. The code versions will be zero through
latestCodeVersion. The data versions are those available for the given
Locale's ResourceBundle. These are combined together to get all the
versions. That is, if there are 3 resource versions and 2 code versions, it
would return (hex) [00000000, 00000001, 00010000, 00010001, 00020000,
00020001]

Usage

If someone wants to store sort keys over time, they first tag their
database with the right Locale and latest version number at creation time.
Whenever they generate new sort keys, they pass the Locale and version
number in to getInstance. All of their sort keys will remain comparable. If
they ever re-index the entire database field, then they can use the latest
version to get the best results and bug fixes.

Issues

People can also get into trouble if they compare sort keys built with
different rules. If we wanted to remedy that, we could prefix all sort keys
with a hashCode, as discussed in:

http://www.unicode.org/unicode/reports/tr10/tr10-2d3#Catching Mismatches

Upside: The CollationKey objects could throw exceptions when comparing keys
built by different rules. if we wanted. Byte sort keys would at least not
sort randomly; all of one language would come before all of another.

Downside: Sort keys are 4 bytes longer. Also, method is not fail-safe
(although the chances of collisions are infinitesimal).

======================================================================
###@###.### 11/2/04 18:21 GMT

duplicates

JDK-4302264 Collation keys are fragile and can't be versioned

Closed

Details

Description

Attachments

Issue Links

Activity

People

Dates