Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8258119

Linebreak pattern needs adjustment to conform to Unicode TR18 and PCRE



    • Bug
    • Resolution: Unresolved
    • P3
    • tbd
    • 15
    • core-libs
    • None


      Bug JDK-8235812 changed the behavior of matching of the Unicode linebreak pattern, \R. This change will be backed out by JDK-8258259.

      The problem stated in JDK-8235812 was that the pattern \R{2} did not match the string "\r\n" and the fix changed the behavior so that a match was successful. This *seemed* the correct thing to do, as the Pattern class spec has a definition for \R which is essentially


      and the behavior after the change conforms to that definition.

      The problem is that this definition of the \R pattern doesn't match the recommendation from TR18, which is


      (Based on http://unicode.org/reports/tr18/#Line_Boundaries and corrected and transliterated to Java regex syntax.)

      The salient difference is the appearance of a negative lookahead pattern "?!" which causes the pattern not to match a \r if it's immediately followed by \n. Thus, the TR18 recommendation would have the pattern \R{2} NOT match the string "\r\n". Indeed, PCRE has this behavior.

      The Pattern spec's definition of \R should be revisited to see if it should be adjusted to match TR18 more closely. The test cases removed in the backout changeset JDK-8258259 should be revisited. The code changes should also be revisited. It seems odd that the implementation of \R doesn't simply expand to something more-or-less equivalent to the TR18 expression. It may be that there are special cases in the code to handle \R instead of treating it as a "macro" that is expanded to a more complicated sequence. It's not clear which is preferable.


        Issue Links



              smarks Stuart Marks
              smarks Stuart Marks
              0 Vote for this issue
              2 Start watching this issue