-
Bug
-
Resolution: Fixed
-
P3
-
15, 16
-
None
-
b30
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8258760 | 17 | Stuart Marks | P3 | Resolved | Fixed | b03 |
JDK-8260176 | 16.0.1 | Stuart Marks | P3 | Resolved | Fixed | b03 |
Bug JDK-8235812 changed the behavior of matching of the Unicode linebreak pattern, \R. This change should be reverted.
The problem stated in that bug report was that the pattern \R{2} did not match the string "\r\n" and the fix changed the behavior so that a match was successful. This *seemed* the correct thing to do, as the Pattern class spec has a definition for \R which is essentially
-----
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----
and the behavior after the change conforms to that definition.
The problem is that this definition of the \R pattern doesn't match the recommendation from TR18, which is
-----
(?:\u000D\u000A)|(?!\u000D\u000A)[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----
(Based on http://unicode.org/reports/tr18/#Line_Boundaries and corrected and transliterated to Java regex syntax.)
The salient difference is the appearance of a negative lookahead pattern "?!" which causes the pattern not to match a \r if it's immediately followed by \n. Thus, the TR18 recommendation would have the pattern \R{2} NOT match the string "\r\n". Indeed, PCRE has this behavior.
This bug covers backing out of theJDK-8235812 change. Follow-on bug JDK-8258119 covers further changes in this area. In particular, the Pattern spec's definition of \R should be revisited to see if it should be adjusted to match TR18 more closely.
The problem stated in that bug report was that the pattern \R{2} did not match the string "\r\n" and the fix changed the behavior so that a match was successful. This *seemed* the correct thing to do, as the Pattern class spec has a definition for \R which is essentially
-----
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----
and the behavior after the change conforms to that definition.
The problem is that this definition of the \R pattern doesn't match the recommendation from TR18, which is
-----
(?:\u000D\u000A)|(?!\u000D\u000A)[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----
(Based on http://unicode.org/reports/tr18/#Line_Boundaries and corrected and transliterated to Java regex syntax.)
The salient difference is the appearance of a negative lookahead pattern "?!" which causes the pattern not to match a \r if it's immediately followed by \n. Thus, the TR18 recommendation would have the pattern \R{2} NOT match the string "\r\n". Indeed, PCRE has this behavior.
This bug covers backing out of the
- backported by
-
JDK-8258760 Unicode linebreak matching behavior is incorrect; backout JDK-8235812
-
- Resolved
-
-
JDK-8260176 Unicode linebreak matching behavior is incorrect; backout JDK-8235812
-
- Resolved
-
- relates to
-
JDK-8258119 Linebreak pattern needs adjustment to conform to Unicode TR18 and PCRE
-
- Open
-
-
JDK-8235812 Unicode linebreak with quantifier does not match valid input
-
- Resolved
-
(1 links to)