Loading...

Type: Bug
Resolution: Fixed
Priority: P3
Fix Version/s: 15
Affects Version/s: 12, 13, 15
Component/s: core-libs
Labels:
- 15ea
- dcsaw
- regression
- reproducer-yes
- webbug

Subcomponent:
java.util.regex
Introduced In Build:
b24
Introduced In Version:

12
Resolved In Build:
b20
Verification:
Not verified

ADDITIONAL SYSTEM INFORMATION :
System/OS/runtime-independent

A DESCRIPTION OF THE PROBLEM :
the more advanced the current position in a string to be matched with grapheme regexes \X the longer it takes to find the next match. For longer strings performance quickly becomes unbearable.

The algorithm is overly complex (O(number of characters in input string) instead of O(1) per match) because Grapheme.nextBoundary always starts at the beginning of the whole input string in order to correctly identify pairs of regional indicator symbols.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Find all matches of \X in a not too short string and measure the time.
Find all matches of \X in a n times longer string than the first one and again measure the time.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
n times longer string results in n times longer processing time.
ACTUAL -
takes longer than that. much longer

---------- BEGIN SOURCE ----------
/*
* Copyright (c) 2020, Oracle and/or its affiliates. All rights reserved.
* DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
*
* This code is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License version 2 only, as
* published by the Free Software Foundation.
*
* This code is distributed in the hope that it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
* version 2 for more details (a copy is included in the LICENSE file that
* accompanied this code).
*
* You should have received a copy of the GNU General Public License version
* 2 along with this work; if not, write to the Free Software Foundation,
* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
*
* Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
* or visit www.oracle.com if you need additional information or have any
* questions.
*/

import static java.lang.Math.sqrt;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.testng.annotations.*;
import static org.testng.Assert.*;

/**
* @test
* @bug 6202130
* @library ..
* @run testng/manual GraphemeTimePerCharTest
* @summary Measures {@link java.util.regex.Grapheme#nextBoundary} performance
* depending on input character sequence size.
*/
public class GraphemeTimePerCharTest {

    static final int STEPS = 10;
    static final int LIMIT = 1000000; // maximum string size tested

    static final Pattern CHARACTER_REGEX = Pattern.compile("\\X");

    void matchPerformance(String value) {
        Matcher charMatcher = CHARACTER_REGEX.matcher(value);
        while (charMatcher.find()) charMatcher.group();
    }

    @Test
    public void test() throws Exception {
        matchPerformance("warmup");

        double[][] strTimesByNChars = new double[STEPS][];
        double[][] chrTimesByNChars = new double[STEPS][];
        System.out.println("number of chars, time string [s], time per char [ns]");
        for (int i = 0; i < STEPS; i++) {
            int chunkSize = (i + 1) * LIMIT / STEPS;
            String value = "x".repeat(chunkSize);
            long start = System.nanoTime();
            matchPerformance(value);
            long end = System.nanoTime();
            long t = end - start;
            strTimesByNChars[i] = new double[] { chunkSize, t };
            double timePerChar = (double) t / (double) chunkSize;
            chrTimesByNChars[i] = new double[] { chunkSize, timePerChar };
            System.out.printf("%7d %6d %10.3f%n", chunkSize, (t / 1000000), timePerChar);
        }
        double cStr = correlate(strTimesByNChars);
        double cChr = correlate(chrTimesByNChars);
        System.out.printf("correlation factor between string size and total time (expected 1): %10.3f%n", cStr);
        System.out.printf("correlation factor between string size and time per character (expected 0): %10.3f%n", cChr);
        assertTrue(cStr > .5 && cChr < .5);
    }

    double correlate(double[][] data) {
        double sumx = 0, sumy = 0;
        for (int i = 0; i < data.length; i++) {
            sumx += data[i][0];
            sumy += data[i][1];
        }
        double avgx = sumx / data.length;
        double avgy = sumy / data.length;
        double sxy = 0, sxx = 0, syy = 0;
        for (int i = 0; i < data.length; i++) {
            sxy += (data[i][0] - avgx) * (data[i][1] - avgy);
            sxx += (data[i][0] - avgx) * (data[i][0] - avgx);
            syy += (data[i][1] - avgy) * (data[i][1] - avgy);
        }
        return sxy / sqrt(sxx * syy);
    }

}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
No generally applicable workaround is known to me. For matching all \X after one another through all of a string, BreakIterator might do.

FREQUENCY : always

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

BreakIteratorCharacterInstanceTime.java
2020-03-15 22:22
0.6 kB
Andrew Wang
BreakIteratorCharacterInstanceTimePerChar.java
2020-03-15 22:22
2 kB
Andrew Wang
GraphemeTime.java
2020-03-15 22:22
0.6 kB
Andrew Wang
GraphemeTimePerChar.java
2020-03-15 22:22
2 kB
Andrew Wang

Details

Description

Attachments

Attachments

Activity

People

Dates