Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8258259

Unicode linebreak matching behavior is incorrect; backout JDK-8235812

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P3 P3
    • 16
    • 15, 16
    • core-libs
    • None

        Bug JDK-8235812 changed the behavior of matching of the Unicode linebreak pattern, \R. This change should be reverted.

        The problem stated in that bug report was that the pattern \R{2} did not match the string "\r\n" and the fix changed the behavior so that a match was successful. This *seemed* the correct thing to do, as the Pattern class spec has a definition for \R which is essentially

        -----
        \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
        -----

        and the behavior after the change conforms to that definition.

        The problem is that this definition of the \R pattern doesn't match the recommendation from TR18, which is

        -----
        (?:\u000D\u000A)|(?!\u000D\u000A)[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
        -----

        (Based on http://unicode.org/reports/tr18/#Line_Boundaries and corrected and transliterated to Java regex syntax.)

        The salient difference is the appearance of a negative lookahead pattern "?!" which causes the pattern not to match a \r if it's immediately followed by \n. Thus, the TR18 recommendation would have the pattern \R{2} NOT match the string "\r\n". Indeed, PCRE has this behavior.

        This bug covers backing out of the JDK-8235812 change. Follow-on bug JDK-8258119 covers further changes in this area. In particular, the Pattern spec's definition of \R should be revisited to see if it should be adjusted to match TR18 more closely.

              smarks Stuart Marks
              smarks Stuart Marks
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: