Kristoffer Vinther
Christopher Mosses
Jørgen Iversen
Januar 2003
Abstract: This page contains a describtion of small XPath query implementation, a comparison with two other systems and some bencmarks.
Reader requirements: The reader is expected to have experience with XML in general, and XPath expressions in particular
The motivation for using XML streaming can be found in the section SAX of the streaming survey. Our motivation for implementing a tool that handles simple XPath queries on XML streams is that it allows us to do queries very efficiently because of the simplicity of the algorithm.
The basic_xmlistream class of our class library implements an XML
parser, which generates a stream of XML tokens. A basic_xmlistream
object wraps around an std::basic_istream object, that is assumed
to be a textual version of the XML document. This object could be e.g. the std::cin
object or an std::ifstream object.
Other than a link to the wrapped stream, a basic_xmlistream has a
parser state, which defines which part of the document is currently being read,
and a stack of unmatched start tags. xmlistreams are expected to
be XML processors in the sense of the specification, and so should check that
start and end tags match. We have decided against requiring xmlistreams
to be validating parsers for several reasons. First of all many XML documents
are known to be valid, so the extra overhead of actually checking
validity is a wasted effort. Secondly validation functionality is easily
implemented as a filter: when extracting elements from such a filter, it would
simply pass-through the initial <?xml…?> declaration and keep
extracting elements until it encounters the DOCTYPE element. Then parse the
declaration (which is available through the get_doctype()->rest() function
of the element) into a simple data structure and then for each of the rest of
the elements in the stream, verify that they are allowed where they are, that
they have the correct attributes, normalize them, translate entities and so on.
If the source is invalid, simply throw an exception. No new elements needs to
be created nor old ones deleted. This also provides great flixibility; one
could have several different validation filters (DTD, XML Schema, DSD…)
and apply one, some or all where appropriate. Finally, we obviously did not
have the time to code it. The state parser state of the stream is one of the
following:
The stream is initially in the PREXML state. It simply reads character
by character from the wrapped stream, determining which of the ten types of
elements is being read. If it is the begindoc element
(corresponding to the <?xml…?> declaration) and the state is not PREXML
a form_exception is thrown. Otherwise the state is changed to PREDOCTYPE.
Similarly DOCTYPE declarations are only allowed when the state is PREDOCTYPE
and the state is changed to PREDOC. When reading a start tag, the name
of the tag is pushed on the stack. If the stack was empty, the state must be PREDOC
and is changed to INDOC, if the state is not, a form_exception
is thrown. End tags are similarly only allowed when the state is INDOC.
The stack is popped and the result checked against the name of the end tag. If
the stack became empty the state is changed to POSTDOC. If what is
read is a <…/> tag the name is pushed on the stack and a starttag
element is returned, just like with <…> tags, but the state is
changed to EMPTYTAG. When the extraction operator is called and the
state is EMPTYTAG the stack is popped and an endtag element
is constructed with that name and returned. Comments, characters and processing
instructions are allowed everywhere, except when in the PREXML state.
On EOF the enddoc element is returned.
Our XPath query tool is based on an XPath Filter. It is used in connection with xmlstreams as described in XML Streaming class library. The filter accepts the XPath expressions described by the following grammar:
Path ::= Path / Step | / Step
Step ::= Node [ Predicate ]
Node ::= text() | comment() | processing-instruction() | node() | Name | *
Predicate ::= Natural | @att | @att = value
The algorithm works as follows. While traversing the input stream, the state is updated and matching nodes are output. The state consists of
The state is updated every time we encounter a new node. If the XPath pointer is at the end of the XPath expression then the current node matches and is output.
We know of two other algorithms for doing streaming XPath Queries: XSQ and XAOS. There are probably other systems, but these are two of the most prominent.
This system is more advanced than ours, since it also allows descendant-or-self axes, and a wider range of predicates. Some examples
The algorithm buffers part of the stream which is necessary because of the descendant-or-self axes and predicates stating that a node has a certain child node. To decide when to insert, query, remove from or clear the buffer, the algorithm uses hierarchically structured push down transducers (HPDT), which consist of smaller PDTs each with their own buffer.
In this system one uses XPaths containing the backward axes parent and ancestor together with the forward axes child and descendant. Here are some examples of legal XPath expressions:
The algorithm uses X-dags and X-trees to represent XPaths. Their main trick is to rewrite the expressions to avoid the backward axes. For more info about XAOS the reader should consult the streaming survey
The two systems are more advanced than our system since they allow more complicated XPath expressions. Unfortunately we have not been able to compare their performance, since the developers of the two systems would not provide us with an implementation and we didn't have time to implement the algorithms ourselves. Our guess is that our system is more efficient because it doesn't have to do any buffering or make complicated representations of the XPath expression. Furthermore our system also has a smaller memory footprint (no more than 1MB is ever used during streaming).
We've made benchmarks of XMLS compared to Apache's Xalan/Xerces and DOM4J (only available in Java) DOM-based implementations of XPath filters. We ran the following queries:
Q1: /dblp/inproceedings/title
Q4: /datasets/dataset[@subject="astronomy"]/reference/source/other/name
Q5: /root/Entry/Ref/Cite
Q6: /ProteinDatabase/ProteinEntry/protein/name
| Query | Data source | Input size | Rel. outputsize | DOM (Ap.) | DOM (D4J) | Streaming |
|---|---|---|---|---|---|---|
| Q1 | DBLP Computer Science Bibliography | 133.862.701 | 12.5% | 13:35.55 | 5:54.39 | 3:03.59 |
| Q4 | GSFC/NASA | 25.050.310 | 0.06% | 0:25.26 | 0:12.54 | 0:32.36 |
| Q5 | SwissProt | 114.820.233 | 3.9% | 5:15.12 | 8:30.07 | 2:27.69 |
| Q6 | Georgetown Protein Information Resource | 716.853.016 | 1.8% | [fault] | [fault] | 14:15.50 |
With these results:

With queries Q1 and Q5 we clearly outperform the other implementations. The query Q6 was performed on a very large document which caused the computer to run out of memory when using Xalan/Xerces and DOM4J. With the query Q4, which is performed on a relatively small document, we are surprisingly outperformed by the other implementations. This is probably because we don't benefit from using streaming technology when the document is small document.