-
Bug
-
Resolution: Unresolved
-
P3
-
15
-
None
Bug JDK-8235812 changed the behavior of matching of the Unicode linebreak pattern, \R. This change will be backed out by JDK-8258259.
The problem stated inJDK-8235812 was that the pattern \R{2} did not match the string "\r\n" and the fix changed the behavior so that a match was successful. This *seemed* the correct thing to do, as the Pattern class spec has a definition for \R which is essentially
-----
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----
and the behavior after the change conforms to that definition.
The problem is that this definition of the \R pattern doesn't match the recommendation from TR18, which is
-----
(?:\u000D\u000A)|(?!\u000D\u000A)[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----
(Based on http://unicode.org/reports/tr18/#Line_Boundaries and corrected and transliterated to Java regex syntax.)
The salient difference is the appearance of a negative lookahead pattern "?!" which causes the pattern not to match a \r if it's immediately followed by \n. Thus, the TR18 recommendation would have the pattern \R{2} NOT match the string "\r\n". Indeed, PCRE has this behavior.
The Pattern spec's definition of \R should be revisited to see if it should be adjusted to match TR18 more closely. The test cases removed in the backout changesetJDK-8258259 should be revisited. The code changes should also be revisited. It seems odd that the implementation of \R doesn't simply expand to something more-or-less equivalent to the TR18 expression. It may be that there are special cases in the code to handle \R instead of treating it as a "macro" that is expanded to a more complicated sequence. It's not clear which is preferable.
The problem stated in
-----
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----
and the behavior after the change conforms to that definition.
The problem is that this definition of the \R pattern doesn't match the recommendation from TR18, which is
-----
(?:\u000D\u000A)|(?!\u000D\u000A)[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----
(Based on http://unicode.org/reports/tr18/#Line_Boundaries and corrected and transliterated to Java regex syntax.)
The salient difference is the appearance of a negative lookahead pattern "?!" which causes the pattern not to match a \r if it's immediately followed by \n. Thus, the TR18 recommendation would have the pattern \R{2} NOT match the string "\r\n". Indeed, PCRE has this behavior.
The Pattern spec's definition of \R should be revisited to see if it should be adjusted to match TR18 more closely. The test cases removed in the backout changeset
- relates to
-
JDK-8235812 Unicode linebreak with quantifier does not match valid input
-
- Resolved
-
-
JDK-8258259 Unicode linebreak matching behavior is incorrect; backout JDK-8235812
-
- Resolved
-