-
Bug
-
Resolution: Not an Issue
-
P4
-
None
-
1.4.0
-
x86
-
windows_2000
Name: nt126004 Date: 08/24/2001
java version "1.3.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1-b24)
Java HotSpot(TM) Client VM (build 1.3.1-b24, mixed mode)
We have created a document in Germany (ISO-8859-1 encoding) that contains German
"umlautz" like ???????. On an NT box in Korea, this file is read into a string.
Then, on the string a .getBytes("ISO-8859-1") is called. The resulting byte
array contains ? for the German umlautz, and at it's end, it contains as many
null-bytes at it contains umlautz in the body. The same code works well in
Germany.
Ok : Having "?" in the bytes.
Bug: Having null bytes at the end. This should never happen in my opinion.
This is the sample source:
1) Contents of the script file to read:
-----------------------------------------
My script contains umlautz like ??? or ??? or ? and more.
-----------------------------------------
2) The test case:
-----------------------------------------
package unibug;
import java.io.*;
/**
* Demonstrate a .getBytes bug in Unicode environments.
* @author Peter Holzwarth
* @version 1.0
*/
public class UniBug {
public static void main(String[] args) {
try {
String script= readFile("script.txt");
byte[] buf= script.getBytes("ISO-8859-1");
// doesn't change anything:
// byte[] buf= script.getBytes();
for (int i= 0; i<buf.length; i++) {
System.out.println("byte ["+i+"] is "+buf[i]);
}
} catch (java.io.UnsupportedEncodingException uee) {
// won't happen for "ISO-8859-1"
} catch (java.io.IOException ioe) {
// test file not there
System.err.println("Couldn't read script: "+ioe);
}
}
protected static String readFile(String filename) throws IOException {
FileInputStream stream = new FileInputStream(filename);
InputStreamReader isr = new InputStreamReader(stream);
// partial fix, if we knew the locale of the source file:
// InputStreamReader isr = new InputStreamReader(stream, "ISO-8859-1");
char[] data = new char[stream.available()];
isr.read(data);
isr.close();
stream.close();
return new String(data);
}
}
-----------------------------------------
3) The log output in Korea:
-----------------------------------------
byte [0] is 77
byte [1] is 121
byte [2] is 32
byte [3] is 115
byte [4] is 99
byte [5] is 114
byte [6] is 105
byte [7] is 112
byte [8] is 116
byte [9] is 32
byte [10] is 99
byte [11] is 111
byte [12] is 110
byte [13] is 116
byte [14] is 97
byte [15] is 105
byte [16] is 110
byte [17] is 115
byte [18] is 32
byte [19] is 117
byte [20] is 109
byte [21] is 108
byte [22] is 97
byte [23] is 117
byte [24] is 116
byte [25] is 122
byte [26] is 32
byte [27] is 108
byte [28] is 105
byte [29] is 107
byte [30] is 101
byte [31] is 32
byte [32] is 63
byte [33] is 63
byte [34] is 111
byte [35] is 114
byte [36] is 32
byte [37] is 63
byte [38] is 63
byte [39] is 111
byte [40] is 114
byte [41] is 32
byte [42] is 63
byte [43] is 97
byte [44] is 110
byte [45] is 100
byte [46] is 32
byte [47] is 109
byte [48] is 111
byte [49] is 114
byte [50] is 101
byte [51] is 46
byte [52] is 13
byte [53] is 10
byte [54] is 13
byte [55] is 10
byte [56] is 0
byte [57] is 0
byte [58] is 0
byte [59] is 0
byte [60] is 0
-----------------------------------------
(Review ID: 130521)
======================================================================</TEXTAREA>
</td>
</tr>
<TR>
<TD colspan="2" bgcolor="#BFBFBF"> </td>
</tr>
<a name="comments"></a>
<!-- COMMENTS -->
<TR>
<TD bgcolor="#BFBFBF" align="left" valign="bottom" height="24">
<img src="/bugz/images/dot.gif" width="10">Comments
</td>
<TD bgcolor="#BFBFBF" align="left" valign="bottom" height="24">
<!-- BEGIN:TBR Mohan
<A href="javascript:doDateStampSubmit(document.editbug_general, 'comments');"><font size="-1">[ Date Stamp ]</font></A>
<img src="/bugz/images/dot.gif" width="18">
END:TBR -->
<A href="javascript:doFullPageSubmit(document.editbug_general, 'comments');"><font size="-1">[ Full Page ]</font></a>
<img src="/bugz/images/dot.gif" width="22">
<FONT size="-1" color="darkblue">--- Enter SUN Proprietary data here ---</font>
</td>
</tr>
<TR>
<TD bgcolor="#BFBFBF" colspan="2" nowrap align="left">
<img src="/bugz/images/dot.gif" width="5">
<TEXTAREA rows="6" cols="95" wrap="virtual" name="comments" align="left" bgcolor="white">
Name: nt126004 Date: 08/24/2001
(company - iO Software GmbH , email - ###@###.###)
===============
old Synopsis: String.getBytes("ISO-8859-1") doesn't work as expected in Unicode environment
I couldn't reproduce this exactly here. When I ran this program I got a bunch
of negative values for some of the bytes, but no null values at the end.
Looks similar to bug 4179049, but this one has the null characters at the
end.
added script.zip.Z which contains the sample file for the bug.
Tested against build 1.4.0-beta2-b77, so changed the release from 1.3.1 to 1.4b2
======================================================================
###@###.### 2001-08-24
###@###.### 2003-01-09
java version "1.3.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1-b24)
Java HotSpot(TM) Client VM (build 1.3.1-b24, mixed mode)
We have created a document in Germany (ISO-8859-1 encoding) that contains German
"umlautz" like ???????. On an NT box in Korea, this file is read into a string.
Then, on the string a .getBytes("ISO-8859-1") is called. The resulting byte
array contains ? for the German umlautz, and at it's end, it contains as many
null-bytes at it contains umlautz in the body. The same code works well in
Germany.
Ok : Having "?" in the bytes.
Bug: Having null bytes at the end. This should never happen in my opinion.
This is the sample source:
1) Contents of the script file to read:
-----------------------------------------
My script contains umlautz like ??? or ??? or ? and more.
-----------------------------------------
2) The test case:
-----------------------------------------
package unibug;
import java.io.*;
/**
* Demonstrate a .getBytes bug in Unicode environments.
* @author Peter Holzwarth
* @version 1.0
*/
public class UniBug {
public static void main(String[] args) {
try {
String script= readFile("script.txt");
byte[] buf= script.getBytes("ISO-8859-1");
// doesn't change anything:
// byte[] buf= script.getBytes();
for (int i= 0; i<buf.length; i++) {
System.out.println("byte ["+i+"] is "+buf[i]);
}
} catch (java.io.UnsupportedEncodingException uee) {
// won't happen for "ISO-8859-1"
} catch (java.io.IOException ioe) {
// test file not there
System.err.println("Couldn't read script: "+ioe);
}
}
protected static String readFile(String filename) throws IOException {
FileInputStream stream = new FileInputStream(filename);
InputStreamReader isr = new InputStreamReader(stream);
// partial fix, if we knew the locale of the source file:
// InputStreamReader isr = new InputStreamReader(stream, "ISO-8859-1");
char[] data = new char[stream.available()];
isr.read(data);
isr.close();
stream.close();
return new String(data);
}
}
-----------------------------------------
3) The log output in Korea:
-----------------------------------------
byte [0] is 77
byte [1] is 121
byte [2] is 32
byte [3] is 115
byte [4] is 99
byte [5] is 114
byte [6] is 105
byte [7] is 112
byte [8] is 116
byte [9] is 32
byte [10] is 99
byte [11] is 111
byte [12] is 110
byte [13] is 116
byte [14] is 97
byte [15] is 105
byte [16] is 110
byte [17] is 115
byte [18] is 32
byte [19] is 117
byte [20] is 109
byte [21] is 108
byte [22] is 97
byte [23] is 117
byte [24] is 116
byte [25] is 122
byte [26] is 32
byte [27] is 108
byte [28] is 105
byte [29] is 107
byte [30] is 101
byte [31] is 32
byte [32] is 63
byte [33] is 63
byte [34] is 111
byte [35] is 114
byte [36] is 32
byte [37] is 63
byte [38] is 63
byte [39] is 111
byte [40] is 114
byte [41] is 32
byte [42] is 63
byte [43] is 97
byte [44] is 110
byte [45] is 100
byte [46] is 32
byte [47] is 109
byte [48] is 111
byte [49] is 114
byte [50] is 101
byte [51] is 46
byte [52] is 13
byte [53] is 10
byte [54] is 13
byte [55] is 10
byte [56] is 0
byte [57] is 0
byte [58] is 0
byte [59] is 0
byte [60] is 0
-----------------------------------------
(Review ID: 130521)
======================================================================</TEXTAREA>
</td>
</tr>
<TR>
<TD colspan="2" bgcolor="#BFBFBF"> </td>
</tr>
<a name="comments"></a>
<!-- COMMENTS -->
<TR>
<TD bgcolor="#BFBFBF" align="left" valign="bottom" height="24">
<img src="/bugz/images/dot.gif" width="10">Comments
</td>
<TD bgcolor="#BFBFBF" align="left" valign="bottom" height="24">
<!-- BEGIN:TBR Mohan
<A href="javascript:doDateStampSubmit(document.editbug_general, 'comments');"><font size="-1">[ Date Stamp ]</font></A>
<img src="/bugz/images/dot.gif" width="18">
END:TBR -->
<A href="javascript:doFullPageSubmit(document.editbug_general, 'comments');"><font size="-1">[ Full Page ]</font></a>
<img src="/bugz/images/dot.gif" width="22">
<FONT size="-1" color="darkblue">--- Enter SUN Proprietary data here ---</font>
</td>
</tr>
<TR>
<TD bgcolor="#BFBFBF" colspan="2" nowrap align="left">
<img src="/bugz/images/dot.gif" width="5">
<TEXTAREA rows="6" cols="95" wrap="virtual" name="comments" align="left" bgcolor="white">
Name: nt126004 Date: 08/24/2001
(company - iO Software GmbH , email - ###@###.###)
===============
old Synopsis: String.getBytes("ISO-8859-1") doesn't work as expected in Unicode environment
I couldn't reproduce this exactly here. When I ran this program I got a bunch
of negative values for some of the bytes, but no null values at the end.
Looks similar to bug 4179049, but this one has the null characters at the
end.
added script.zip.Z which contains the sample file for the bug.
Tested against build 1.4.0-beta2-b77, so changed the release from 1.3.1 to 1.4b2
======================================================================
###@###.### 2001-08-24
###@###.### 2003-01-09