Name: dgC58589 Date: 01/26/98
java.io.StreamTokenizer should be able to parse
"/" as a word constituent and strip C and/or C++
comments simultaneously.
My application is parsing ascii files containing
market data with "/"-delimited dates; I can think
of others. The documentation for the
StreamTokenizer class is inadequate and gives no
hint that this won't work.
"/" dhould be allowed to be a word constituent, so date
strings like "1/16/98" get parsed as words. Otherwise, if "/" is an
ordinary character (and " " is white space), there's no way to tell the
difference between "1/1" and "1 / 1", "1/ 1", or "1 /1".
The clause in the main loop of the tokenizer that begins
"if ((ctype & CT_ALPHA) != 0)", which parses words, appears before the
one that
begins "if c == '/' && (slashSlashCommentsP...", so if "/" is set to be
a word constituent, there's no way the tokenizer can possibly parse C or
C++ comments. Anyway, I've already got a fix for the source code. I can
send it to you if you're interested.
Fix diff against the JDK 1.1.5 FCS sosurce code
567,578c567
<
< // +++ Modified segment begins here:
< //
< if (specialSlash(c)) {
< return nextToken();
< }
< buf[0] = (char) c;
< c = peekc;
< int i = 1;
< //
< // +++ Modified segment ends here.
<
---
> int i = 0;
667,678d655
<
< // +++ Modified code segment begins here:
< //
< if (specialSlash(c)) {
< return nextToken();
< }
< //
< // +++ Modified code segment ends here.
< return ttype = c;
< }
<
< private boolean specialSlash(int c) throws java.io.IOException {
696,700c673,674
< if (c < 0) {
< String s =
< "reached eof while parsing C-style comment";
< throw new RuntimeException(s);
< }
---
> if (c < 0)
> return ttype = TT_EOF;
704c678
< return true;
---
> return nextToken();
708c682
< return true;
---
> return nextToken();
711c685
< return false;
---
> return ttype = '/';
713,715d686
< } else {
< peekc = read();
< return false;
716a688,689
> peekc = read();
> return ttype = c;
(Review ID: 23516)
======================================================================
mircea.oancea@canada 1998-02-25
More information from the client receieved on Fri, 20 Feb 1998 18:41:39
We've been parsing lots of files with my modified version of
StreamTokenizer, and we found a bug. My "fix" resulted in a failure to
parse one-character words. (The character after the first word
constituent was always treated as a word constituent.) The attached
version adds one more line and changes a "do" to a "while". It could
still use more testing (especially since I've only tested the features I
need).
THe following diff is from the patched version obtained from the original
one with the above patch applied.
574d573
< c = peekc;
576,577c575,576
< //
< // +++ Modified segment ends here.
---
> c = peekc;
> ctype = c < 0 ? CT_WHITESPACE : c < 256 ? ct[c] : CT_ALPHA;
579c578
< do {
---
> while ((ctype & (CT_ALPHA | CT_DIGIT)) != 0) {
588c587,589
< } while ((ctype & (CT_ALPHA | CT_DIGIT)) != 0);
---
> }
> //
> // +++ Modified segment ends here.