Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6931812

A better implementation of sun.nio.cs.Surrogate.isBMP(int)

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Fixed
    • Icon: P4 P4
    • 7
    • 7
    • core-libs
    • None

      ###@###.### is suggesting a better sun.nio.cs.Surrogate.isBMP(int) implementation, see blow for details.

      Am 03.03.2010 09:00, schrieb Martin Buchholz:
      > On Tue, Mar 2, 2010 at 15:34, Ulf Zibis<###@###.###> wrote:
      >
      >> Am 26.08.2009 20:02, schrieb Xueming Shen:
      >>
      >>> For example, the isBMP(int), it might be convenient, but it can be easily
      >>> archived by the one line code
      >>>
      >>> (int)(char)codePoint == codePoint;
      >>>
      >>> or more readable form
      >>>
      >>> codePoint< Character.MIN_SUPPLEMENTARY_COE_POINT;
      >>>
      >>>
      >> In class sun.nio.cs.Surrogate we have:
      >> public static boolean isBMP(int uc) {
      >> return (int) (char) uc == uc;
      >> }
      >>
      >> 1.) It's enough to have:
      >> return (char)uc == uc;
      >> better:
      >> assert MIN_VALUE == 0&& MAX_VALUE == 0xFFFF;
      >> return (char)uc == uc;
      >> // Optimized form of: uc>= MIN_VALUE&& uc<= MAX_VALUE
      >>
      >> 2.) Above code is compiled to (needs 16 bytes of machine code):
      >> 0x00b87ad8: mov %ebx,%ebp
      >> 0x00b87ada: and $0xffff,%ebp
      >> 0x00b87ae0: cmp %ebx,%ebp
      >> 0x00b87ae2: jne 0x00b87c52
      >> 0x00b87ae8:
      >>
      >> We could code:
      >> assert MIN_VALUE == 0&& (MAX_VALUE + 1) == (1<< 16);
      >> return (uc>> 16) == 0;
      >> // Optimized form of: uc>= MIN_VALUE&& uc<= MAX_VALUE
      >>
      > I agree that
      > return (uc>> 16) == 0;
      > is marginally better than my
      > return (int) (char) uc == uc;
      > (although I think the redundant cast to int
      > makes the code more readable).
      >

      Seems to be individual. I always stumble over superfluous casts by thinking about, what they have to do.

      > I approve such a change to isBMPCodePoint()
      > and inclusion of such a method in Character.
      >

      Pleased! Who could file the bug? I would provide the patch.

      >
      >
      >> is compiled to (needs only 9 bytes of machine code):
      >> 0x00b87aac: mov %ebx,%ecx
      >> 0x00b87aae: sar $0x10,%ecx
      >> 0x00b87ab1: test %ecx,%ecx
      >> 0x00b87ab3: je 0x00b87acb
      >> 0x00b87ab5:
      >>
      >> 1.) If we have:
      >> public static boolean isSupplementaryCodePoint(int codePoint) {
      >> assert MIN_SUPPLEMENTARY_CODE_POINT == (1<< 16)&&
      >> (MAX_SUPPLEMENTARY_CODE_POINT + 1) % (1<< 16) == 0;
      >> return (codePoint>> 16) != 0
      >> && (codePoint>> 16)< (MAX_SUPPLEMENTARY_CODE_POINT + 1>> 16);
      >> // Optimized form of: codePoint>= MIN_SUPPLEMENTARY_CODE_POINT
      >> //&& codePoint<= MAX_SUPPLEMENTARY_CODE_POINT;
      >> }
      >>
      > Keep in mind that supplementary characters are extremely rare.
      >

      Yes, but many API's in the JDK are used rarely.
      Why should they waste memory footprint / perform bad, particularly if it doesn't cost anything.

      > Therefore the existing implementation
      >
      > return codePoint>= MIN_SUPPLEMENTARY_CODE_POINT
      > && codePoint<= MAX_CODE_POINT;
      >
      > will almost always perform just one comparison against a constant,
      > which is hard to beat.
      >

      1. Wondering: I think there are TWO comparisons.
      2. Those comparisons need to load 32 bit values from machine code, against only 8 bit values in my case.
      3. The first of the 2 comparisons becomes outlined if compiled in combination with isBMPCodePoint(). (see below)


      > I'm not sure whether your code above gets the right answer for negative input.
      > Perhaps you need to do
      > (codePoint>>> 16)< ...
      >

      Oops, I'm afraid you are right, but fortunately it doesn't cost anything: sar becomes replaced by shr.

      > Martin
      >
      >
      >> and:
      >> if (Surrogate.isBMP(uc))
      >> ...;
      >> else if (Character.isSupplementaryCodePoint(uc))
      >> ...;
      >> else
      >> ...;
      >>
      >> we get (needs only 18 bytes of machine code):
      >> 0x00b87aac: mov %ebx,%ecx
      >> 0x00b87aae: sar $0x10,%ecx
      >> 0x00b87ab1: test %ecx,%ecx
      >> 0x00b87ab3: je 0x00b87acb
      >> 0x00b87ab5: cmp $0x11,%ecx
      >> 0x00b87ab8: jge 0x00b87ce6
      >> 0x00b87abe:
      >>
      >>

      BTW the compiled code of the existing code (needs *36* bytes of machine code):
        0x00b87a5c: test %ebp,%ebp
        0x00b87a5e: jl 0x00b87a68
        0x00b87a60: cmp $0x10000,%ebp
        0x00b87a66: jl 0x00b87a8d
        0x00b87a68: cmp $0x10000,%ebp
        0x00b87a6e: jl 0x00b87c63
        0x00b87a74: cmp $0x10ffff,%ebp
        0x00b87a7a: jg 0x00b87c63
        0x00b87a80:

      BTW 2: The code example is seen in one of the String constructors, where there the 2-comparison code is manually inlined instead of using Surrogate.isBMP().

      -Ulf

            sherman Xueming Shen
            sherman Xueming Shen
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: