Name: rlT66838 Date: 08/16/99
The readLine() method of BufferedReader & DataInputStream
is defective in the face of lines ending with a single CR
from input sources that do blocking reads, such as sockets
or console input.
Also see bugs:
4094049
4072575 -- "closed, not reproducible"
The following bug is completely reproducible on any
platform, if done correctly. However, it is necessary to
ensure that the data appearing on the input stream has not
been massaged or converted along the way. A Socket is the
best way to ensure this.
The root of the problem is that the algorithm used in both
defective readLine() methods uses the "lookahead" technique
to determine if the line ends with a CR alone or a CRLF. In
the face of non-file lines ending with CR alone, this causes
the readLine() method to block after reading a CR, waiting
for the subsequent character to determine the fate of the
line it has already collected. On a socket, or with
interactive input, this can wait forever, or at least until
the data source or user gets fed up and sends any additional
character.
The correct way to handle line-ends can be found,
ironically enough, in the algorithm embodied in
LineNumberReader.readLine(), where the skipLF flag is used
to determine the fate of LF's received at any point.
Unfortunately, because LineNumberReader extends
BufferedReader, this algorithm is ineffective at preventing
the problem for LineNumberReader, since super.readLine() is
the source of LineNumberReader's data, and is thus at the
mercy of BufferedReader's defective algorithm.
Note that the problem WILL NOT appear when reading lines
from a file, nor from a sufficiently buffered source where
the post-CR character is already available. It WILL appear
when reading from a Socket's input stream, from
console-input on platforms where the line-end is CR alone,
and will also appear on any stream reading a serial-port or
parallel-port, or any other type of may-block-on-read input
stream.
The problem also WILL NOT appear on a Socket if the
input-stream is filtering CR's to LF's or CRLF's, nor if
the remote source of socket-input is filtering its output
to LF's or CRLF's. I suspect this is why 4072575 could not
be reproduced. If the problem cannot be reproduced with a
Telnet program, then a test client that sends exact CR-only
line-data would be needed.
## The Poor Way to Detect Line-Ends ##
1) Read a byte.
2) If it's '\n' it's the end of the line, so
return the received line.
3) If it's '\r', it might be the first of a CR-LF pair,
so **READ ONE MORE BYTE** (the defect).
4) If that next char is '\n', it's a CR-LF pair, so
return the received line.
5) If that next char is NOT '\n', either push it
back or somehow keep it for the next line.
6) Otherwise append this byte to the line being received.
7) Continue at 1.
## The Better Way to Detect Line-Ends ##
1) Read a byte.
2) If it's '\r', it's a line-end, so set a flag
that any immediately following LF should be eaten,
and return the received line.
3) If it's '\n' and the "eat LF" flag is set,
a line was previously returned, so eat
this LF and clear the flag.
4) If it's '\n' and the "eat LF" flag is clear,
then it's a line-end, so return the received line.
5) Otherwise handle this byte normally,
clearing the "eat LF" flag.
6) Continue at 1.
The "Poor Way" is the algorithm used in BufferedReader and
DataInputStream, although somewhat obscured by the internal
buffer management. The "Better Way" is the algorithm used
in LineNumberReader (but not LineNumberInputStream).
The "Better Way" would also have to be integrated with the
other methods of DataInputStream so that it would be
possible to read interspersed data types.
By the way, a subtle secondary latent bug is the
following...
If a line of data on a DataInputStream terminated only by a
CR is immediately followed by any data-type whose first byte
happened to be 0x0A, that byte will erroneously be consumed
as a CRLF pair, when it is really the first byte of the
subsequent data item. All subsequent data read from the
stream will be out of sync.
There is no fix for this, since both the Poor and Better
algorithms aggressively read more bytes than they realize
are necessary. That is, the data itself and the aggressive
LF-consumption result in a context and data-sensitive
ambiguity.
The best work-around is to always write CRLF or LF-only
line-ends to DataOutput streams that mix lines with binary
data. At least this prevents the ambiguity from
appearing. There is no fix, short of overriding readLine()
with a CR-only version, for reading existing data that
consists of non-counted text-lines followed by binary data.
(Review ID: 25509)
======================================================================
- duplicates
-
JDK-4151072 java.io.BufferedReader, DataInputStream block unnecessarily on \r line breaks
-
- Resolved
-