Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8132859

String's substring constructor chokes on UTF16 BE and LE.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not an Issue
    • Icon: P4 P4
    • None
    • 8-pool, 9
    • core-libs

      FULL PRODUCT VERSION :
      java version "1.8.0_51"
      Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
      Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)


      ADDITIONAL OS VERSION INFORMATION :
      Fedora 22
      4.0.4-301.fc22.x86_64


      A DESCRIPTION OF THE PROBLEM :
      I'm on a UTF-8 platform, testing the String constructors ability to handle different charsets. Converting a byte array in 16BE or 16LE encoding to a String works fine if I use a constructor that converts the whole string. However, using the constructors that extract a substring -- those break.

      String( byte[], int, int, String csn )
      and
      String( byte[], int, int, Charset )
      both break. This is just one bug report, not two, since I figure the second one calls the first one, or vice versa, and therefore fixing one will fix the other.



      ADDITIONAL REGRESSION INFORMATION:
      I have no idea if this worked right in previous versions.

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      Each encoding is being used to do the same thing: isolate the word "cow" and print it to the screen. Only UTF-8 is successful.

      Each encoding can print the entire phrase "Moo cow". Slicing a substring is where the problem lies.
      ACTUAL -
      UTF8_substring is cow
      UTF16BE_substring is o�
      UTF16LE_substring is o�

      REPRODUCIBILITY :
      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------


      import java.io.*;
      import java.nio.charset.*;

      public class StringBreaker
      {
              public static void main( String [] args )
              {
                      PrintStream p = System.out;
                      try
                      {
                              // UTF-8 works, and this is how the others SHOULD work too.
                              byte utf8_bytes[] = "Moo cow".getBytes( StandardCharsets.UTF_8 );
                              String UTF8_substring = new String( utf8_bytes, 4, 3, StandardCharsets.UTF_8 );
                              p.println( "UTF8_substring is " + UTF8_substring ); // prints "cow"

                              // UTF-16BE fails
                              byte utf16be_bytes[] = "Moo cow".getBytes( StandardCharsets.UTF_16BE );
                              String UTF16BE_substring = new String( utf16be_bytes, 4, 3, StandardCharsets.UTF_16BE );
                              // substring now holds the letter 'o' with visible garbage after it.
                              p.println( "UTF16BE_substring is " + UTF16BE_substring );

                              // UTF-16LE fails
                              byte utf16le_bytes[] = "Moo cow".getBytes( StandardCharsets.UTF_16LE );
                              String UTF16LE_substring = new String( utf16le_bytes, 4, 3, StandardCharsets.UTF_16LE );
                              // substring now holds the letter 'o' with visible garbage after it.
                              p.println( "UTF16LE_substring is " + UTF16LE_substring );

                              p.println( "The constructors that convert the whole byte array work fine. Only the substring constructors are broken." );
                              p.println( "UTF-16BE bytes printable: " +
                                      new String( utf16be_bytes, StandardCharsets.UTF_16BE ) );
                              p.println( "UTF-16LE bytes printable: " +
                                      new String( utf16le_bytes, StandardCharsets.UTF_16LE ) );
                      }
                      catch( Exception e )
                      {
                              p.println( "ERROR: Bad charset string." );
                              System.exit(1);
                      }
              }
      }

      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      Convert all 16BE and 16LE byte arrays to Strings, not substrings.

      Then get your substring from that, in the platform's default charset.

      Lastly, convert the substring back into 16BE or 16LE bytes.

            sherman Xueming Shen
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: