Streaming XML Filtering

Kristoffer Vinther
Christopher Mosses
Jørgen Iversen

Januar 2003

Abstract: This page contains a describtion of small XPath query implementation, a comparison with two other systems and some bencmarks.

Reader requirements: The reader is expected to have experience with XML in general, and XPath expressions in particular

Table of contents

  1. Motivation
  2. XML parsing and stream generation
  3. XPath Filter
  4. Comparing with similar tools
  5. Performance

Motivation

The motivation for using XML streaming can be found in the section SAX of the streaming survey. Our motivation for implementing a tool that handles simple XPath queries on XML streams is that it allows us to do queries very efficiently because of the simplicity of the algorithm.

XML parsing and stream generation

The basic_xmlistream class of our class library implements an XML parser, which generates a stream of XML tokens. A basic_xmlistream object wraps around an std::basic_istream object, that is assumed to be a textual version of the XML document. This object could be e.g. the std::cin object or an std::ifstream object.

Other than a link to the wrapped stream, a basic_xmlistream has a parser state, which defines which part of the document is currently being read, and a stack of unmatched start tags. xmlistreams are expected to be XML processors in the sense of the specification, and so should check that start and end tags match. We have decided against requiring xmlistreams to be validating parsers for several reasons. First of all many XML documents are known to be valid, so the extra overhead of actually checking validity is a wasted effort. Secondly validation functionality is easily implemented as a filter: when extracting elements from such a filter, it would simply pass-through the initial <?xml…?> declaration and keep extracting elements until it encounters the DOCTYPE element. Then parse the declaration (which is available through the get_doctype()->rest() function of the element) into a simple data structure and then for each of the rest of the elements in the stream, verify that they are allowed where they are, that they have the correct attributes, normalize them, translate entities and so on. If the source is invalid, simply throw an exception. No new elements needs to be created nor old ones deleted. This also provides great flixibility; one could have several different validation filters (DTD, XML Schema, DSD…) and apply one, some or all where appropriate. Finally, we obviously did not have the time to code it. The state parser state of the stream is one of the following:

The stream is initially in the PREXML state. It simply reads character by character from the wrapped stream, determining which of the ten types of elements is being read. If it is the begindoc element (corresponding to the <?xml…?> declaration) and the state is not PREXML a form_exception is thrown. Otherwise the state is changed to PREDOCTYPE. Similarly DOCTYPE declarations are only allowed when the state is PREDOCTYPE and the state is changed to PREDOC. When reading a start tag, the name of the tag is pushed on the stack. If the stack was empty, the state must be PREDOC and is changed to INDOC, if the state is not, a form_exception is thrown. End tags are similarly only allowed when the state is INDOC. The stack is popped and the result checked against the name of the end tag. If the stack became empty the state is changed to POSTDOC. If what is read is a <…/> tag the name is pushed on the stack and a starttag element is returned, just like with <…> tags, but the state is changed to EMPTYTAG. When the extraction operator is called and the state is EMPTYTAG the stack is popped and an endtag element is constructed with that name and returned. Comments, characters and processing instructions are allowed everywhere, except when in the PREXML state. On EOF the enddoc element is returned.

XPath Filter

Our XPath query tool is based on an XPath Filter. It is used in connection with xmlstreams as described in XML Streaming class library. The filter accepts the XPath expressions described by the following grammar:

Path ::= Path / Step | / Step
Step ::= Node [ Predicate ]
Node ::= text() | comment() | processing-instruction() | node() | Name | *
Predicate ::= Natural | @att | @att = value

The algorithm works as follows. While traversing the input stream, the state is updated and matching nodes are output. The state consists of

The state is updated every time we encounter a new node. If the XPath pointer is at the end of the XPath expression then the current node matches and is output.

Comparison with similar tools

We know of two other algorithms for doing streaming XPath Queries: XSQ and XAOS. There are probably other systems, but these are two of the most prominent.

XSQ

This system is more advanced than ours, since it also allows descendant-or-self axes, and a wider range of predicates. Some examples

The algorithm buffers part of the stream which is necessary because of the descendant-or-self axes and predicates stating that a node has a certain child node. To decide when to insert, query, remove from or clear the buffer, the algorithm uses hierarchically structured push down transducers (HPDT), which consist of smaller PDTs each with their own buffer.

XAOS

In this system one uses XPaths containing the backward axes parent and ancestor together with the forward axes child and descendant. Here are some examples of legal XPath expressions:

The algorithm uses X-dags and X-trees to represent XPaths. Their main trick is to rewrite the expressions to avoid the backward axes. For more info about XAOS the reader should consult the streaming survey

Comparison

The two systems are more advanced than our system since they allow more complicated XPath expressions. Unfortunately we have not been able to compare their performance, since the developers of the two systems would not provide us with an implementation and we didn't have time to implement the algorithms ourselves. Our guess is that our system is more efficient because it doesn't have to do any buffering or make complicated representations of the XPath expression. Furthermore our system also has a smaller memory footprint (no more than 1MB is ever used during streaming).

Performance

We've made benchmarks of XMLS compared to Apache's Xalan/Xerces and DOM4J (only available in Java) DOM-based implementations of XPath filters. We ran the following queries:

Q1: /dblp/inproceedings/title
Q4: /datasets/dataset[@subject="astronomy"]/reference/source/other/name
Q5: /root/Entry/Ref/Cite
Q6: /ProteinDatabase/ProteinEntry/protein/name

QueryData sourceInput sizeRel. outputsizeDOM (Ap.)DOM (D4J)Streaming
Q1 DBLP Computer Science Bibliography 133.862.701 12.5% 13:35.55 5:54.39 3:03.59
Q4 GSFC/NASA 25.050.310 0.06% 0:25.26 0:12.54 0:32.36
Q5 SwissProt 114.820.233 3.9% 5:15.12 8:30.07 2:27.69
Q6 Georgetown Protein Information Resource 716.853.016 1.8% [fault] [fault] 14:15.50

With these results:

Benchmark results

With queries Q1 and Q5 we clearly outperform the other implementations. The query Q6 was performed on a very large document which caused the computer to run out of memory when using Xalan/Xerces and DOM4J. With the query Q4, which is performed on a relatively small document, we are surprisingly outperformed by the other implementations. This is probably because we don't benefit from using streaming technology when the document is small document.