Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8191916

Regex: String with 4 byte UTF-8 characters is split incorrectly on empty string

XMLWordPrintable

      FULL PRODUCT VERSION :


      ADDITIONAL OS VERSION INFORMATION :
      macOS version 10.12.6

      A DESCRIPTION OF THE PROBLEM :
      This bug surfaces when empty string is used to split a string that has 4 byte UTF-8 encoded characters.
      For example: String to split: String str = "$¢€𐍈�"
      $ -> 00100100
      ¢ -> 11000010 10100010
      € -> 11100010 10000010 10101100
      𐍈� -> 11110000 10010000 10001101 10001000
      When the following is executed:
      str.split("")
      It should generate
      [$, ¢, €, 𐍈�]
      But it generates the following array
      [$, ¢, €, ?, ?]
      ? -> 00111111

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      String to split: String str = "$¢€𐍈�"
      $ -> 00100100
      ¢ -> 11000010 10100010
      € -> 11100010 10000010 10101100
      𐍈� -> 11110000 10010000 10001101 10001000
      When the following is executed:
      str.split("")

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      [$, ¢, €, 𐍈�]

      ACTUAL -
      [$, ¢, €, ?, ?]


      REPRODUCIBILITY :
      This bug can be reproduced always.

            sherman Xueming Shen
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: