Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: P3
Fix Version/s: 7
Affects Version/s: 7
Component/s: core-libs
Labels:
- martin

Subcomponent:
java.nio.charsets
Resolved In Build:
b43
CPU:

generic
OS:

generic
Verification:
Not verified

Encoding is a tougher problem than decoding for single byte coders,
because decoding is easily implemented using a mapping table of size 256,
but for encoders it is generally unreasonable to use a table of size 64k.
Nevertheless, for most coders, it is possible to write table access code
that uses a small amount of storage (< 4kb), and uses only one read access
to this table for all (or at least most) characters to decode.

If the range of the mapping is small (e.g. ISO-8859-2) then we can
throw away the trailing unmapped part of the table, and compute an index
as follows:

int dl = dp + Math.min(src.remaining(),
                                   dst.remaining());
            while (dp < dl) {
                char c = sa[sp];
                int i = c & ((c - charToByteLen) >> 31);
                byte b = charToByte[i];
                if (b != 0 || c == 0) {
                    da[dp++] = b;
                    sp++;
                } else
                    return encodeArrayError(src, sp, dst, dp);
            }

For ISO-8859-2 this gives about a factor of 3 speedup for encoding long sequences.

If the characters fall into two widely separated ranges,
0 - hi1, lo2 - hi2,
we can have a table
containing the concatenation of the two ranges,
and compute the index thus:

int i = (c & ((c - hi1) >> 31)) + (c & ((c - lo2) ^ (c - hi2)) >> 31);

This should win because computation is becoming free, relative to memory access.

This technique can be extended to more ranges as well, depending on the coder.
For many coders, special values or ranges can be ignored on the first try.
Only if we get ZERO (normally unmappable character) do we check the rare ranges.
This is the way the pesky Euro character should be handled.
We can make the Euro 5 times as slow to make all other characters twice as fast.

The mapping table should be computed at runtime from the Decoder's table,
and stored in a byte array, not a wasteful char array, as is the current practice.
Here's the one for the easy case ISO-8859-2 (a single range starting at 0)

        private static final byte[] charToByte;
        private static final int charToByteLen;
        static {
            char[] byteToChar = Decoder.byteToCharTable.toCharArray();
            char maxChar = 0;
            for (char c : byteToChar)
                if (maxChar < c)
                    maxChar = c;
            charToByte = new byte[maxChar+1];
            for (int i = 0; i < byteToChar.length; i++)
                charToByte[byteToChar[i]] = (byte) (i - 0x80);
            charToByteLen = charToByte.length;
        }

Assignee:: Xueming Shen

Reporter:: Martin Buchholz

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Created:: 2007-12-14 18:11

Updated:: 2011-05-17 19:42

Resolved:: 2011-05-17 19:42

Imported:: 15/Sep/12 1:24 PM

Indexed:: 17/Jul/12 10:56 AM

Details

Description

Attachments

Activity

People

Dates