Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-5053272

Regression in html parsing in tiger beta 2

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P3 P3
    • 6
    • 5.0
    • client-libs
    • beta
    • x86
    • linux_redhat_8.0, windows_nt

        I wrote a small program (below) to parse an html document. I observe different results with JDK 1.4.2_04-b05 and 1.5.0-beta2-b51.
        To me, the result given by 1.4.2 seems to be the correct one.

        --------------------------------------------------------------------------------------------------
        import java.io.*;

        import javax.swing.text.MutableAttributeSet;
        import javax.swing.text.html.HTML;
        import javax.swing.text.html.HTMLEditorKit;
        import javax.swing.text.html.parser.ParserDelegator;

        public class HtmlParser {

            private static String translation = null;

            public static void main(String[] unused) throws Exception {

                BufferedReader in =
                    new BufferedReader(
                        new InputStreamReader(
                            new FileInputStream("/root/test_programs/saaj/temp.html")));

                ParserDelegator parser = new ParserDelegator();

                HTMLEditorKit.ParserCallback callback =
                    new HTMLEditorKit.ParserCallback() {

                        // the translation will be in the div tag
                        private boolean end_search = false;
                        private boolean found_first_textarea = false;

                        public void handleText(char[] data, int pos) {
                            if (found_first_textarea) {
                                translation = new String(data);
                            }
                        }

                        public void handleStartTag(HTML.Tag tag,
                        MutableAttributeSet attrSet, int pos) {
                            if (tag == HTML.Tag.DIV && end_search != true) {
                                found_first_textarea = true;
                            }
                        }

                        public void handleEndTag(HTML.Tag t, int pos) {
                            if (t == HTML.Tag.DIV && end_search != true) {
                                end_search = true;
                                found_first_textarea = false;
                            }
                        }
                    };

                parser.parse(in, callback , true);
                in.close();

                System.out.println("Result: " + translation);

            }
        }

        --------------------------------------------------------------------------------------------------

        The html document used will be attached.

        Here's a portion of an email conversation relevant to this issue:

        --------------------------------------------------------------------------------------------------
        Subject:
        [Fwd: [Fwd: Re: Fwd: Regression in HTML Parsing?]]
        From:
        Rakesh Menon <###@###.###>
        Date:
        Wed, 26 May 2004 16:21:37 +0530
        To:
        Sreejith A K <###@###.###>
        CC:
        Anita Jindal <###@###.###>, ###@###.###, Rakesh Menon <###@###.###>

        Hi,

        I confirmed. Its a regression bug.
        Igor(###@###.###) and Scott (###@###.###) is investigating further on this.

        Thanks,
        Rakesh

        -------- Original Message --------
        Subject: Re: Fwd: Regression in HTML Parsing?
        Resent-Date: Wed, 26 May 2004 01:00:34 -0700
        Resent-From: ###@###.###
        Date: Tue, 25 May 2004 11:24:13 -0700
        From: Scott Violet <###@###.###>
        To: Anton Nashatyrev <###@###.###>
        CC: ###@###.###
        References: <20040525172747.GI12475@zaz>
        <###@###.###>


        Yuck. Any idea what fix cased this regression?
        Thanks,

            -Scott

        On Tue, May 25, 2004 at 10:18:16PM +0400, Anton Nashatyrev wrote:
        Hello Scott,

           here is simplified HTML case :

        <table border=1>
          <tr><td>
         aaa
        <style> </style> bbb
          </tr></td>
        </table>

        This HTML is displayed as follows :
        -------
        | aaa |
        -------
        | bbb |
        -------

        Though changing <style></style> tag to any invalid one fixes this :
        -----------
        | aaa bbb |
        -----------

        It looks like we use incorrect recovering policy in Parser. <STYLE></style>
        tag shouldn't appear in <BODY> context so in this case we should just
        ignore it and don't make any tag adjustments.

        I think a bug should be filed on this.

        Thank you.
        Anton.
        --------------------------------------------------------------------------------------------------

        ###@###.### 2004-05-26
        ###@###.### 2004-05-26

              idk Igor Kushnirskiy (Inactive)
              duke J. Duke
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: