Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4867170

Pattern doesn't work with composite character in CANON_EQ mode

XMLWordPrintable

    • b119
    • generic, x86
    • generic, windows
    • Verified

        (1) Composite characters only "Character Classes" pattern will throw
            Exception, example below shows the problem.

        import java.util.regex.*;

        public class RegTest {

            public static void main(String args[]) {

                CharSequence inputStr = "ab\u1f82cd";
                String patternStr = "[\u1f80\u1f82]";

                Pattern pattern = Pattern.compile(patternStr, Pattern.CANON_EQ);
                Matcher matcher = pattern.matcher(inputStr);
                boolean matchFound = matcher.find();

                if (matchFound) {
                    System.out.println("<" + Integer.toString(matcher.start())
        + ","
        + Integer.toString(matcher.end())
        + "> ");
                }
            }
        }

        (2) replace the pattern to
            String patternStr = "\u1f80\u1f82";
            also throw exception


        (3)Pattern "[\u1f80-\u1f82]" will not have match for input string
           "ab\u1f81cd" in CANONO_EQ mode, though it does catch character
           \u1f80 and \u1f82. Need to iterate all characters in "Range"
           and list all their "EquivalentAlternation" in CANONO_EQ mode.

        import java.util.regex.*;
        public class RegTest {
            public static void main(String args[]) {

                CharSequence inputStr = "ab\u1f81cd";
                String patternStr = "[\u1f80-\u1f82]";

                Pattern pattern = Pattern.compile(patternStr, Pattern.CANON_EQ);
                Matcher matcher = pattern.matcher(inputStr);
                boolean matchFound = matcher.find();

                if (matchFound) {
                    System.out.println("<" + Integer.toString(matcher.start())
        + ","
        + Integer.toString(matcher.end())
        + "> ");
                } else {
                    System.out.println("No Match");
                }

            }
        }

        (4)Though not critical, but seems like there will be some redundency
           patterns created by produceEquivalentAlternation() when dealint with
           multiple combining characters in CANON_EQ mode

           for example

           pattern "\u1f80" will create
         (?: 0x3b1 0x313 0x345 | 0x1f00 0x345 | 0x1f80 | 0x3b1 0x345 0x313 | 0x1fb3 0x313 | 0x1f80)

           and "\u1f82" will create
        (?: 0x3b1 0x313 0x300 0x345 | 0x1f00 0x300 0x345 | 0x1f02 0x345 | 0x1f82 | 0x1f00 0x345 0x300 | 0x1f80 0x300 | 0x1f82 | 0x3b1 0x313 0x345 0x300 | 0x1f00 0x345 0x300 | 0x1f80 0x300 | 0x1f82 | 0x1f00 0x300 0x345 | 0x1f02 0x345 | 0x1f82 | 0x3b1 0x345 0x313 0x300 | 0x1fb3 0x313 0x300 | 0x1f80 0x300 | 0x1f82)

           #space has been added between hexadecimal numbers

              sherman Xueming Shen
              sherman Xueming Shen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: