Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8167002

JAXP schema validator: Use HashSet instead of ArrayList for tracking XML IDs

XMLWordPrintable

    • b140
    • Not verified

      A colleague reported:

      ValidationState is used to validate XML ID and IDREF elements (among other types). To do so, it keeps data structures containing all IDs and IDREFs that have been seen in a document. The only methods ever called on the ID container are add() and contains(), so substituting HashSet for ArrayList makes no difference in behavior, while improving performance by orders of magnitude in large documents. No change is necessary/possible, however, for the IDREF container.This one is only ever iterated over -- never searched -- and order matters, so ArrayList is appropriate.

      On a test document with ~1.5M elements, ~330K IDs, and ~430K IDREFs, this
      change speeds up parsing (with validation enabled) by a factor of 26 (from
      21.4 minutes, ~800ms/element, to 49 seconds, ~31µs/element).

      There are further obvious algorithmic improvements possible for additional
      constant-factor gains, but this is simple, safe, and brings schema validation
      from O(n^2+mn) to O(n+m), where n is the number of IDs and m is the number of
      IDREFs.

            martin Martin Buchholz
            martin Martin Buchholz
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: