-
Bug
-
Resolution: Fixed
-
P3
-
5.0
-
beta
-
x86
-
linux_redhat_8.0, windows_nt
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-2116943 | 5.0u4 | Peter Zhelezniakov | P3 | Resolved | Fixed | b03 |
I wrote a small program (below) to parse an html document. I observe different results with JDK 1.4.2_04-b05 and 1.5.0-beta2-b51.
To me, the result given by 1.4.2 seems to be the correct one.
--------------------------------------------------------------------------------------------------
import java.io.*;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class HtmlParser {
private static String translation = null;
public static void main(String[] unused) throws Exception {
BufferedReader in =
new BufferedReader(
new InputStreamReader(
new FileInputStream("/root/test_programs/saaj/temp.html")));
ParserDelegator parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback =
new HTMLEditorKit.ParserCallback() {
// the translation will be in the div tag
private boolean end_search = false;
private boolean found_first_textarea = false;
public void handleText(char[] data, int pos) {
if (found_first_textarea) {
translation = new String(data);
}
}
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet attrSet, int pos) {
if (tag == HTML.Tag.DIV && end_search != true) {
found_first_textarea = true;
}
}
public void handleEndTag(HTML.Tag t, int pos) {
if (t == HTML.Tag.DIV && end_search != true) {
end_search = true;
found_first_textarea = false;
}
}
};
parser.parse(in, callback , true);
in.close();
System.out.println("Result: " + translation);
}
}
--------------------------------------------------------------------------------------------------
The html document used will be attached.
Here's a portion of an email conversation relevant to this issue:
--------------------------------------------------------------------------------------------------
Subject:
[Fwd: [Fwd: Re: Fwd: Regression in HTML Parsing?]]
From:
Rakesh Menon <###@###.###>
Date:
Wed, 26 May 2004 16:21:37 +0530
To:
Sreejith A K <###@###.###>
CC:
Anita Jindal <###@###.###>, ###@###.###, Rakesh Menon <###@###.###>
Hi,
I confirmed. Its a regression bug.
Igor(###@###.###) and Scott (###@###.###) is investigating further on this.
Thanks,
Rakesh
-------- Original Message --------
Subject: Re: Fwd: Regression in HTML Parsing?
Resent-Date: Wed, 26 May 2004 01:00:34 -0700
Resent-From: ###@###.###
Date: Tue, 25 May 2004 11:24:13 -0700
From: Scott Violet <###@###.###>
To: Anton Nashatyrev <###@###.###>
CC: ###@###.###
References: <20040525172747.GI12475@zaz>
<###@###.###>
Yuck. Any idea what fix cased this regression?
Thanks,
-Scott
On Tue, May 25, 2004 at 10:18:16PM +0400, Anton Nashatyrev wrote:
Hello Scott,
here is simplified HTML case :
<table border=1>
<tr><td>
aaa
<style> </style> bbb
</tr></td>
</table>
This HTML is displayed as follows :
-------
| aaa |
-------
| bbb |
-------
Though changing <style></style> tag to any invalid one fixes this :
-----------
| aaa bbb |
-----------
It looks like we use incorrect recovering policy in Parser. <STYLE></style>
tag shouldn't appear in <BODY> context so in this case we should just
ignore it and don't make any tag adjustments.
I think a bug should be filed on this.
Thank you.
Anton.
--------------------------------------------------------------------------------------------------
###@###.### 2004-05-26
###@###.### 2004-05-26
To me, the result given by 1.4.2 seems to be the correct one.
--------------------------------------------------------------------------------------------------
import java.io.*;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class HtmlParser {
private static String translation = null;
public static void main(String[] unused) throws Exception {
BufferedReader in =
new BufferedReader(
new InputStreamReader(
new FileInputStream("/root/test_programs/saaj/temp.html")));
ParserDelegator parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback =
new HTMLEditorKit.ParserCallback() {
// the translation will be in the div tag
private boolean end_search = false;
private boolean found_first_textarea = false;
public void handleText(char[] data, int pos) {
if (found_first_textarea) {
translation = new String(data);
}
}
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet attrSet, int pos) {
if (tag == HTML.Tag.DIV && end_search != true) {
found_first_textarea = true;
}
}
public void handleEndTag(HTML.Tag t, int pos) {
if (t == HTML.Tag.DIV && end_search != true) {
end_search = true;
found_first_textarea = false;
}
}
};
parser.parse(in, callback , true);
in.close();
System.out.println("Result: " + translation);
}
}
--------------------------------------------------------------------------------------------------
The html document used will be attached.
Here's a portion of an email conversation relevant to this issue:
--------------------------------------------------------------------------------------------------
Subject:
[Fwd: [Fwd: Re: Fwd: Regression in HTML Parsing?]]
From:
Rakesh Menon <###@###.###>
Date:
Wed, 26 May 2004 16:21:37 +0530
To:
Sreejith A K <###@###.###>
CC:
Anita Jindal <###@###.###>, ###@###.###, Rakesh Menon <###@###.###>
Hi,
I confirmed. Its a regression bug.
Igor(###@###.###) and Scott (###@###.###) is investigating further on this.
Thanks,
Rakesh
-------- Original Message --------
Subject: Re: Fwd: Regression in HTML Parsing?
Resent-Date: Wed, 26 May 2004 01:00:34 -0700
Resent-From: ###@###.###
Date: Tue, 25 May 2004 11:24:13 -0700
From: Scott Violet <###@###.###>
To: Anton Nashatyrev <###@###.###>
CC: ###@###.###
References: <20040525172747.GI12475@zaz>
<###@###.###>
Yuck. Any idea what fix cased this regression?
Thanks,
-Scott
On Tue, May 25, 2004 at 10:18:16PM +0400, Anton Nashatyrev wrote:
Hello Scott,
here is simplified HTML case :
<table border=1>
<tr><td>
aaa
<style> </style> bbb
</tr></td>
</table>
This HTML is displayed as follows :
-------
| aaa |
-------
| bbb |
-------
Though changing <style></style> tag to any invalid one fixes this :
-----------
| aaa bbb |
-----------
It looks like we use incorrect recovering policy in Parser. <STYLE></style>
tag shouldn't appear in <BODY> context so in this case we should just
ignore it and don't make any tag adjustments.
I think a bug should be filed on this.
Thank you.
Anton.
--------------------------------------------------------------------------------------------------
###@###.### 2004-05-26
###@###.### 2004-05-26
- backported by
-
JDK-2116943 Regression in html parsing in tiger beta 2
-
- Resolved
-
- duplicates
-
JDK-5053319 REGRESSION: <style> tag in body content breaks HTML structure
-
- Closed
-