XPath query tips

My new query page invites you to try writing your own queries, and a few adventurous souls have been doing just that. As I've mentioned before, I'm no world-class expert on this subject, but as I build up a corpus of searchable data on the one hand, and a set of canned and modifiable queries on the other, I'm learning. Indeed, one of my goals for the query page is to serve as a tutorial and playground, a place where folks (me included) can get ideas about what kinds of XHTML elements they might include in their own content, and how those elements could interact with XPath queries.

In the spirit of exploration and learning, here's a first installment of the tutorial. First, some background. The XPath expressions used in this search engine are embedded in an XSLT stylesheet. The stylesheet includes two XSLT templates. Here's the one that counts the number of results:

<xsl:template match="/">
<div>Results:
<xsl:value-of select="count(__QUERY__)"/>
</div>
<xsl:apply-templates />
<br clear="all"/>
<p>Entries searched: <xsl:value-of 
       select="count(//item)" /></p>
<p>Date of oldest entry searched: <xsl:value-of 
       select="//item[position()=last()]/date" /></p>
<p>Date of newest entry searched: <xsl:value-of 
       select="//item[position()=1]/date" /></p>
</xsl:template>

And here's the one that reduces the whole XML file to just matching elements:

<xsl:template match="__QUERY__" >
<p><b>
<a>
<xsl:attribute name="href">
http://weblog.infoworld.com/udell/<xsl:value-of 
    select="ancestor::item/date" />.html#<xsl:value-of 
    select="ancestor::item/@num"/>
</xsl:attribute>
<xsl:value-of select="ancestor::item/title" />
</a> (<xsl:value-of select="ancestor::item/date" />)
</b>
<div>
<xsl:copy-of select="."/>
<xsl:if test="local-name(.)='blockquote' and @cite != ''">
Source: <xsl:value-of select="@cite"/>
</xsl:if>
</div>
<hr align="left" width="20%" />
</p>
</xsl:template>

In my forthcoming O'Reilly Network column I publish the script that implements the search engine, and discuss it in detail. But from the perspective of writing queries, here's what you need to know. First, the search script replaces __QUERY__, in both XSLT templates, with the text of an XPath pattern -- either a canned one, or one that you supply. Second, the XML file matched against the pattern has this simple structure:

<item num="a883">
<title>Server-based XPath search</title>
<date>2004/01/10</date>
<body>
<p>
...arbitrary XHTML content...
</p>
</body>
</item>

Third, the pattern is used, in the XSLT transformation, in two different ways. The counting template uses it in an XSLT select (<xsl:value-of select="count(__QUERY__)"/>), but the data-reduction template uses it in an XSLT match (<xsl:template match="__QUERY__" >).

When I first wrote this entry, I used the term expression rather than pattern -- but really, the latter is correct. What's the difference between the two? Writing for MSDN Magazine, Aaron Skonnard explains:

Select does indeed expect an XPath expression, which is used to select a nodeset for further processing.
...
The match attribute, on the other hand, takes what's called a pattern. A pattern looks like an XPath expression because it shares the same syntax, but it's treated differently by the XSLT processor. A pattern is used for matching nodes in the tree against the specified criteria. [MSDN Magazine]

The XPath syntax you can use in a match pattern is more restrictive than the syntax you can use in a select expression. Since my XSLT stylesheet uses the syntax you supply in both contexts, it is limited to the more restrictive flavor -- that is, it must be a pattern, not a full-blown expression.

Watching my search logs, I notice that the most common error is to supply something like this:

count(//blockquote)
This fails because only some XPath functions can appear in the pattern syntax, and count() isn't one of them.

Why restrict the XPath syntax to only what's valid for the match attribute of an XSLT template? Because that's what my little search engine does. It matches and displays a subset of the elements contained in my blog.


Former URL: http://weblog.infoworld.com/udell/2004/01/19.html#a890