<?xml version="1.0" encoding="iso-8859-1"?>
<?xml-stylesheet type="text/xsl" href="xhtml.xsl"?>
<!DOCTYPE slidecollection SYSTEM "slides.dtd">

<s:slidecollection xmlns="http://www.w3.org/1999/xhtml" xmlns:s="http://www.daimi.au.dk/~kv/2002/iws/slides">
	<s:home href="index.html"/>
	<s:info href="info.html"/>
	<s:copyright>2002 Jørgen Iversen, Christopher Mosses &amp; Kristoffer Vinther</s:copyright>
	<s:title>The XML Revolution&#8212;Streaming Technologies for the Future Web</s:title>
	<s:h>XML Streaming</s:h>
	<s:p> - working with XML documents as streams</s:p>

	<s:section>
		<s:h>Overview</s:h>
		<s:slide>
			<s:h>The Project</s:h>
			<s:contents>XML Streaming</s:contents>
			<s:body>

			<p>Our project has two parts:</p>
			<ul>
				<li>Some lines of code<br/>- A class library for C++
				implementing streaming XML.</li>
				<li>Survey<br/>- A look at existing technologies</li>
			</ul>

			</s:body>
		</s:slide>
		<!--s:slide>
			<s:h>Working with XML streams</s:h>
			<s:contents>DOM, lazy trees, lazy flattened trees, XML tokens and fragments</s:contents>
			<s:body>
			</s:body>
		</s:slide>
		<s:slide>
			<s:h>Parsing strategies</s:h>
			<s:contents>push vs. pull</s:contents>
			<s:body>
			</s:body>
		</s:slide-->
		<s:slide>
			<s:h>When to use XML Streaming</s:h>
			<s:contents>is this for you?</s:contents>
			<s:body>

				<p>- whenever possible!</p>
				<p>Not always possible, however. Here are some guidelines traditionally used:</p>
				<p>You should probably use <strong>streams</strong> if:</p>
				<ul>
					<li>You are parsing large files.</li>
					<li>You want to be able to abort parsing.</li>
					<li>You only need to collect small amounts of data.</li>
					<li>You cannot afford DOM</li>
				</ul>
				<p>You should probably use <strong>DOM</strong> if:</p>
				<ul>
					<li>You need random access to the whole file.</li>
					<li>You need to perform XSLT transformations.</li>
					<li>You need to do complex XML queries.</li>
					<li>You want to modify and save XML</li>
				</ul>
				<p>You should probably use <strong>lazy DOM</strong> if:</p>
				<ul>
					<li>You need random access to small part of the file.</li>
					<li>You are dealing with somewhat large files.</li>
					<li>You would like to use SAX, but you have to use DOM.</li>
				</ul>

			</s:body>
		</s:slide>
	</s:section>
	<s:section>
		<s:h>Our approach</s:h>
		<s:slide>
			<s:h>XMLS</s:h>
			<s:contents>a streaming XML class library implemented in C++</s:contents>
			<s:body>

				<p>
					Our streaming library implements a pull
					parser; the user initiates the reading and
					extraction of stream elements. This strategy is
					contrary to most current parsers, which implement
					push parsers, that have total control of what code
					is executed. Once started, the push parser
					controls the control flow of the program.
				</p>
				<p>
					A push parser is easily implemented using a pull parser.
				</p>
				<p>
					Streams are used to extract elements. In case of
					XML streams the elements consists of several
					different kinds of tokens:
				</p>
				<ul>
					<li>begining of document</li>
					<li>processing instructions</li>
					<li>DOCTYPE declaration</li>
					<li>comments</li>
					<li>start tag</li>
					<li>character data</li>
					<li>entities</li>
					<li>CDATA sections</li>
					<li>end tags</li>
					<li>end of document</li>
				</ul>

			</s:body>
		</s:slide>
		<s:slide>
			<s:h>The class hierarchy</s:h>
			<s:contents>the structure of XMLS</s:contents>
			<s:body>

				<p style="text-align: center"><img src="classes.gif" alt="Class hierarchy"/></p>
				<p>Examples of other XML input streams:</p>
				<ul>
					<li>Socket/network based</li>
					<li>Source of binary objects</li>
				</ul>
				<p>Examples of other XML output streams:</p>
				<ul>
					<li>Socket/network based</li>
					<li>Binary object generator</li>
					<li>SAX event generator</li>
					<li>DOM tree builder</li>
					<li>GSM WAP configuration</li>
				</ul>
				<p>Examples of other XML filters:</p>
				<ul>
					<li>Validation</li>
					<li>Multithreaded buffer</li>
					<li>Sorting</li>
					<li>XML Encryption</li>
				</ul>

			</s:body>
		</s:slide>
		<s:slide>
			<s:h>Example</s:h>
			<s:contents>an XPath filter example</s:contents>
			<s:body>

			    <h2>Using the classes</h2>
				<table class="code">
					<tr><td><pre><![CDATA[
using namespace xmls;

int main()
{
    xmlistdstream xin(std::cin);
    xmlostdstream xout(std::cout);
    element elmt;

    try
    {
        for( xin >> elmt; !elmt.is_enddoc(); xin >> elmt )
            xout << elmt;
        xout << elmt; // Output the enddoc element
    }
    catch( exception e )
    {
        std::cerr << "Exception caught in main(): " << e << std::endl;
        return 1;
    }

    return 0;
}
					]]></pre></td></tr></table>

			    <h2>XPath algorithm</h2>
				<p>We can recognize some XPath expressions</p>

				<tt>Path ::= Path / Step | / Step</tt><br/>
				<tt>Step ::= Node [ Predicate ]</tt><br/>
				<tt>Node ::= text() | comment() | processing-instruction() | node() |Name | *</tt><br/>
				<tt>Predicate ::= Natural | @att | @att = value</tt><br/>

				<p>While traversing the input stream
				the state is updated and matching nodes are
				send to the output stream.</p>

				<p>The state consists of </p>
				<ul>
					<li>A pointer to a step in the XPath</li>
                    <li>A nesting level counter</li>
					<li>A stack of counters to keep track of positions at the different levels</li>
				</ul>
				<p>The state is updated every time we encounter a new node.</p>

				<p>If the XPath pointer is at the end of the expression the current node matches.</p> 
                                
				<p>Here is some code using the XPath filter (and the | operator):</p>
				<table class="code">
					<tr><td><pre><![CDATA[
using namespace xmls;

int main()
{
    xmlistdstream xin(std::cin);
    xmlostdstream xout(std::cout);
    XPath<char> xp("/slidecollection/section[2]/slide[@title=\"XMLS\"]");

    xin | xp | xout;

    return 0;
}
					]]></pre></td></tr>
				</table>

			</s:body>
		</s:slide>
		<s:slide>
			<s:h>Preliminary performance measurements</s:h>
			<s:contents>the true benefit of XML Streaming</s:contents>
			<s:body>

				<p>
					We've made <em>preliminary</em> benchmarks of XMLS
					compared to <a href="http://apache.org/">Apache</a>'s
					<a href="http://xml.apache.org/">Xalan/Xerces</a> and
					<a href="http://dom4j.org">DOM4J</a> (only available in Java) DOM-based
					implementation of XPath filters. We ran the
					following queries:
				</p>
				<table style="width: 85%; vertical-align: center">
					<tr><th>Query</th><th>Data source</th><th>Input size</th><th>Rel. outputsize</th><th>DOM (Ap.)</th><th>DOM (D4J)</th><th>Streaming</th></tr>
					<tr style="text-align: right"><td>Q1</td><td style="text-align: left"><a href="http://www.cs.washington.edu/research/xmldatasets/www/repository.html#dblp">DBLP Computer Science Bibliography</a></td><td>133.862.701</td><td>12.5%</td><td>13:35.55</td><td>5:54.39</td><td>3:03.59</td></tr>
					<tr style="text-align: right"><td>Q4</td><td style="text-align: left"><a href="http://www.cs.washington.edu/research/xmldatasets/www/repository.html#nasa">GSFC/NASA</a></td><td>25.050.310</td><td>0.06%</td><td>0:25.26</td><td>0:12.54</td><td>0:32.36</td></tr>
					<tr style="text-align: right"><td>Q5</td><td style="text-align: left"><a href="http://www.cs.washington.edu/research/xmldatasets/www/repository.html#SwissProt">SwissProt</a></td><td>114.820.233</td><td>3.9%</td><td>5:15.12</td><td>8:30.07</td><td>2:27.69</td></tr>
					<tr style="text-align: right"><td>Q6</td><td style="text-align: left"><a href="http://www.cs.washington.edu/research/xmldatasets/www/repository.html#pir">Georgetown Protein Information Resource</a></td><td>716.853.016</td><td>1.8%</td><td>[fault]</td><td>[fault]</td><td>14:15.50</td></tr>
				</table>
				<p style="padding-left: 25pt">
					Q1: <tt>/dblp/inproceedings/title</tt><br/>
					Q4: <tt>/datasets/dataset[@subject=&quot;astronomy&quot;]/reference/source/other/name</tt><br/>
					Q5: <tt>/root/Entry/Ref/Cite</tt><br/>
					Q6: <tt>/ProteinDatabase/ProteinEntry/protein/name</tt><br/>
				</p>
				<p>With these results:</p>
				<p style="text-align: center"><img src="benchchart.gif" alt="Benchmark results"/></p>

			</s:body>
		</s:slide>
	</s:section>
	<s:section>
		<s:h>Existing technologies</s:h>
		<s:slide>
			<s:h>Streaming APIs</s:h>
			<s:contents>what can we do now?</s:contents>
			<s:body>

				<p>Most streaming implementations are based on the SAX standard API.</p>
				<ul>
					<li>Xerces from Apache implements SAX and SAX2, and is a mature library</li>
					<li>Xalan, also from Apache, builds on top of
					Xerces (or some other SAX parser) to provide
					streaming XPath processing and XSLT transformation</li>
					<li>XAOS specializes in streaming XPath
					processing, and claims to perform better than
					Xalan - But no XSLT, and only part of XPath</li>
					<li>MSXML4 from Microsoft implements SAX and SAX2 and is very mature</li>
					<li>Joost - Implementation of STX (Streaming
					Transformations for XML) which is a high speed,
					low memory alternative to XSLT</li>
					<li>XMLTK - UNIX command-line XML processing tools</li>
				</ul>

			</s:body>
		</s:slide>
		<s:slide>
			<s:h>Streaming XPath</s:h>
			<s:contents>how to handle XPaths in one path</s:contents>
			<s:body>
			
				<h2>XSQ: Streaming XPath Queries</h2>
				<p>Allows child and descendant-or-self axes, and a wide range of predicates.</p>
				<ul>
					<li><tt>/Entry/Ref/Cite</tt></li>
					<li><tt>//article[@key]//title/text()</tt></li>
					<li><tt>/pub[year]/book[@id=3]/author</tt></li>
				</ul>
				<p>They use Hierarchically structured Push Down Transducers.</p>
				<ul>
					<li>Parts of the stream is buffered</li>
					<li>An automaton decides when to insert, query, remove from or clear the buffer</li>
					<li>a HDPT consists of smaller PDTs each with their own buffer</li>
				</ul>
				<h2>An Algorithm for Streaming XPath Processing with Forward and Backward Axes</h2>
				<p>
					They allow the backward axes <em>parent</em> and <em>ancestor</em> together
					with the forward axes <em>child</em> and <em>descendant</em>. Predicates can
					only query existence of descendant nodes.
				</p>
				<ul>
					<li><tt>/Person/Adress/City</tt></li>
					<li><tt>//pub[year]/book/title</tt></li>
					<li><tt>//listitem/ancestor::category//name</tt></li>
					<li><tt>/X[parent::Y/Z]/W//Z</tt></li>
				</ul>
                <p>They use X-dags and X-trees to represent XPaths.</p>
				<ul>
					<li>X-tree nodes represent node tests and edges represent axis</li>
					<li>X-dags are used to convert backwards constraints into forward constraints</li>
				</ul>

			</s:body>
		</s:slide>
		<!--s:slide>
			<s:h>XMLTK</s:h>
			<s:contents>UNIX command-line XML processing tools</s:contents>
			<s:body>
			</s:body>
		</s:slide>
		<s:slide>
			<s:h>NQXML, STX, WebLogic and others</s:h>
			<s:contents>works in progress</s:contents>
			<s:body>
			</s:body>
		</s:slide-->
	</s:section>
	<s:section>
		<s:h>Selected links</s:h>
		<s:slide>
			<s:h>Links to articles and more information</s:h>
			<s:body>

			    <p>General Parsers</p>
			    <ul>
			        <li><a href="http://www.alphaworks.ibm.com/tech/xml4j">alphaWorks  XML Parser for Java</a></li>
				    <li><a href="http://msdn.microsoft.com/library/default.asp?url=/nhp/default.asp?contentid=28000438">MSXML 4.0 SDK</a></li>
				    <li><a href="http://sourceforge.net/projects/nqxml/">SourceForge.net Project Info - NQXML</a></li>
					<li><a href="http://sourceforge.net/projects/streamdom/">SourceForge.net Project Info - Streaming XML -- DOM event processing</a></li>
					<li><a href="http://xml.apache.org/xerces2-j/">Xerces2 Java Parser Readme</a></li>
					<li><a href="http://sourceforge.net/projects/javaxmlstream/">XML Streaming Library (serialize objects)</a></li>
					<li><a href="http://sourceforge.net/projects/xmltk/">XML Toolkit</a></li>
					<li><a href="http://www.xmlhack.com/read.php?item=1594">xmlhack Streaming API for XML (StAX)</a></li>
				</ul>
				<p>Info</p>
				<ul>
			    	<li><a href="http://msdn.microsoft.com/library/default.asp?url=/nhp/default.asp?contentid=28000438">Introduction to the DOM (MSDN, with diagrams of typical DOM and SAX parsers)</a></li>
					<li><a href="http://www.saxproject.org/">Official SAX Website</a></li>
					<li><a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/xmlsdk/htm/sax_starter_4fg1.asp">What is SAX (MSDN)</a></li>
					<li><a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/xmlsdk/htm/sax_concepts_8kaa.asp">When Should I Use SAX (MSDN)</a></li>
					<li><a href="http://xml.apache.org/">xml.apache.org</a></li>
				</ul>
				<p>STX</p>
				<ul>
				    <li><a href="http://sourceforge.net/projects/joost/">SourceForge.net Project Info - Joost STX processor</a></li>
					<li><a href="http://sourceforge.net/projects/stx/">SourceForge.net Project Info - Streaming Transformations for XML (STX)</a></li>
				</ul>
				<p><a href="http://www.research.ibm.com/xaos/applications.html">IBM Research - XAOS (XML Analysis, Optimization, and Stuff)</a></p>

			</s:body>
		</s:slide>
	</s:section>
</s:slidecollection>

