Next page Previous page Start of chapter End of chapter

Trailer

A nice trailer to start: Web 2.0... The Machine is Us/ing Us by Michael Wesch, Assistant Professor of Cultural Anthropology at Kansas State University.

The XML language

XML is the acronym of Extensible Markup Language. This means that:

XML is simply a flexible format to mark up data with human-readable tags. It is worth stressing that an XML document is a text document and can be read and modified with any text editor. In particular, XML is not a:

An XML parser is a software that reads the XML document and determines whether it is well-formed. A well-formed document adheres to the XML grammar rules. It is worth noticing that an XML parser cannot accept a malformed document and it is not allowed to try to fix the document. It it required to report the errors.

XML is useful both for human beings and computers. Common scenarios in which XML can be used by people include:

The resulting documents are called text-centric documents. These are XML documents usually written by humans for other humans to read. They are semipermanent XML documents with a lot of text and a poor structure.

Common scenarios in which XML can be used by computers include:

The resulting documents are called data-centric documents. These are XML documents usually written by computers for other computers to read. They are transitory XML documents with a rich structure and a lot of raw data.

For reasons of interoperability, organizations may agree to use only certain tags. These tag sets are called XML applications. Examples are:

Namespaces are a mechanism that allows an XML document to mix different XML applications. They have two purposes in XML:

  1. To distinguish between elements and attributes from different vocabularies with different meaning and that happen to share the same name.
  2. To group all the related elements and attributes from a single XML application together so that software can recognize them.

Example

The following XML document contains a small bibliography:

<?xml version="1.0" encoding="ISO-8859-1"?>
<bibliography>
  <article key="G03">
    <author>
      <name>Georg</name>
      <surname>Gottlob</surname>
    </author>
    <title>XPath processing in a nutshell</title>
    <year>2003</year>
    <journal>SIGMOD Records</journal>
  </article>
  <book key="HM04" isbn="0-596-00764-7">
    <author>
      <name>Elliotte</name>
      <name>Rusty</name>
      <surname>Harold</surname>
    </author>
    <author>
      <surname>Means</surname>
    </author>
    <title>XML in a nutshell</title>
    <year>2004</year>
    <cite item="G03"/>
    <publisher>O'Reilly</publisher>
  </book>
</bibliography>

Here is an extended version that includes the XTHML application and that uses namespaces.

The simplest way to parse a document is by loading it into a web browser that knows XML. The browser will display the document whenever it is well-formed or it will report the errors otherwise. As an alternative, one can use a standalone XML parser like the xmllint command line tool, which is part of the XML library libxml developed for the Gnome project (but usable outside of the Gnome platform).

Schema languages: DTD

The markup permitted in a particular XML application can be documented in a schema. The most broadly supported schema language and the only one defined by the XML 1.0 specification is the Document Type Definition (DTD).

A DTD allows you to place some constraints on the structure an XML document takes. It lists all the elements, attributes, and entities the document uses and the context in which it uses them. DTDs never say anything about the type of content of an element or of the value of an attribute. For instance, you cannot say that price is a real number, or name is a string, or born is a date.

An XML document is said to be valid if it adheres to the definitions of the associated DTD. As a general rule, web browsers do not validate documents but only check them for well-formedness. If you are developing an application, you can use the parser's API to validate the document. If you are writing documents at hand, you can either use an online validator or download and run a local program.

Example

The following DTD contains a DTD for the bibliography example:

<!ELEMENT bibliography     (article | book)*>

<!ELEMENT article          (author+, title, year, cite*, journal)>
<!ATTLIST article          key ID #REQUIRED>

<!ELEMENT book             (author+, title, year, cite*, publisher)>
<!ATTLIST book             key ID #REQUIRED
                           isbn CDATA #REQUIRED>

<!ELEMENT cite             EMPTY>
<!ATTLIST cite             item IDREF #REQUIRED>

<!ELEMENT author           (name*, surname+)>

<!ELEMENT name             (#PCDATA)>
<!ELEMENT surname          (#PCDATA)>
<!ELEMENT title            (#PCDATA)>
<!ELEMENT year             (#PCDATA)>
<!ELEMENT journal          (#PCDATA)>
<!ELEMENT publisher        (#PCDATA)>

Here is a DTD for the extended version.

An example of online validator is the XML Validation Form of the Brown University Scholarly Technology Group. A command line XML parser and validator is xmllint. For instance, you can validate our bibliography example bib.xml against a DTD bib.dtd with the following syntax:

xmllint --dtdvalid bib.dtd bib.xml

Schema languages: W3C XML Schema

DTD has some major drawbacks, in particular:

  1. it has no notion of type. The contents of leaf elements or attributes can be any character data or none at all. The assignment of a type to an element or an attribute adds semantics to that element or attribute;
  2. the referencing mechanism is too simple, for instance it is not possible to restrict the scope of uniqueness for ID attributes to a fragment of the entire document. Also, only individual attributes can be used as keys;
  3. it is not described in XML notation, which would have been handy to manipulate schemas with XML tools, e.g., to check that a DTD is well-formed or to query the schemas.

XML Schema is a proposal from the W3C that solves these problems. In particular, it contains a powerful type system that allows to define simple and complex types and also to inherit types from other types in the style of object-oriented programming languages. Types can be attached to elements and attributes, adding meaning to their interpretations. Unfortunately, this comes at a price: XML Schema is generally complicated to understand and hard to use for non-experts (in fact, the W3C specification is difficult to read also for XML experts!).

Example

The following XML Schema contains an XML Schema for the bibliography example:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  
  <!-- Element declarations   -->
  <xs:element name="author" type="authorType"/>
  <xs:element name="name" type="xs:string"/>
  <xs:element name="surname" type="xs:string"/>
  <xs:element name="title" type="xs:string"/>
  <xs:element name="year" type="xs:gYear"/>
  <xs:element name="cite" type="citeType"/>
  <xs:element name="journal" type="xs:string"/>
  <xs:element name="publisher" type="xs:string"/>
  <xs:element name="coreItem" type="coreItemType"/>
  <xs:element name="article" type="articleType"/>
  <xs:element name="book" type="bookType"/>  
  <xs:element name="bibliography" type="bibliographyType">
    <!-- Key constraints -->
    <xs:key name="primaryKey">
      <xs:selector xpath="*"/>
      <xs:field xpath="@key"/>
    </xs:key>
    <xs:keyref name="foreignKey" refer="primaryKey">
      <xs:selector xpath="*/cite"/>
      <xs:field xpath="@item"/>
    </xs:keyref> 
  </xs:element>
  
  <!-- Attribute declarations   -->
  <xs:attribute name="isbn" type="isbnType"/>
  <xs:attribute name="key" type="xs:string"/>
  <xs:attribute name="item" type="xs:string"/>
  
  <!-- Type definitions   -->
  <xs:simpleType name="isbnType">
    <xs:restriction base="xs:string">
      <xs:pattern value="\d-\d\d\d-\d\d\d\d\d-\d"/>
    </xs:restriction>
  </xs:simpleType>
  
  <xs:complexType name="citeType">
    <xs:attribute ref="item"/>
  </xs:complexType> 
  
  <xs:complexType name="authorType">
    <xs:sequence>    
      <xs:element ref="name" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="surname" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
  
  <xs:complexType name="coreItemType">
    <xs:sequence>    
      <xs:element ref="author" maxOccurs="unbounded"/>
      <xs:element ref="title"/>
      <xs:element ref="year"/>
      <xs:element ref="cite" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute ref="key"/>
  </xs:complexType>
  
  <xs:complexType name="articleType">
    <xs:complexContent>
      <xs:extension base="coreItemType">
        <xs:sequence>
          <xs:element ref="journal"/>
        </xs:sequence>
      </xs:extension>
    </xs:complexContent>
  </xs:complexType>
  
  <xs:complexType name="bookType">
    <xs:complexContent>
      <xs:extension base="coreItemType">
        <xs:sequence>
          <xs:element ref="publisher"/>
        </xs:sequence>
        <xs:attribute ref="isbn"/>
      </xs:extension>
    </xs:complexContent>
  </xs:complexType>
  
  <xs:complexType name="bibliographyType">
    <xs:choice minOccurs="0" maxOccurs="unbounded">
      <xs:element ref="article"/>
      <xs:element ref="book"/>
    </xs:choice>
  </xs:complexType>
  
</xs:schema>

Here is a Schema for the extended version.

Validating a document against an XML schema requires a validating parser that supports XML Schema such as the open source Xerces parser from the Apache Xerces Project. This is written in Java and includes, in the archive xercesSamples.jar, a command-line program jaxp.SourceValidator that can be used to validate. The syntax for jaxp.SourceValidator follows:

java jaxp.SourceValidator -i bib.xml -a bib.xsd

Query languages: XPath

XML path language (XPath) is a simple language to retrieve XML elements from a single XML document. XPath can be exploited in different XML technologies: by itself as a simple query language for XML, in XQuery to retrieve XML elements that may be further processed in order to solve a query, in XSLT to retrieve the elements to which template rules are applied in order to transform an XML document, in W3C XML Schema to locate keys and key references, and finally in XPointer to point to particular XML elements in the linked XML document.

XPath views an XML document as a tree structure: elements are mapped to nodes, and subelements corresponds to child nodes. You can convert your XML document into its tree representation using XMLTree tool combined with GraphViz. For instance, our bibliography document is mapped to this tree.

Some examples of XPath queries follow:

All books in the bibliography

/bibliography/book

All books published in 2004

/descendant::book[year = "2004"]

All articles written by Georg Gottlob

/descendant::article[author[name = "Georg" and surname = "Gottlob"]]

All articles that follows in the document the one written by Georg Gottlob

/descendant::article[author[name = "Georg" and surname = "Gottlob"]]/following-sibling::article

The first author of the last bib item

/bibliography/*[position() = last()]/author[position() = 1]

The number of authors of the bib item with key HM04

count(/id("HM04")/author)

The bib items cited in bib item with key HM04

id(/id("HM04")/cite/@item)

You can try all the above queries with any XPath processor. An XPath processor is a software that evaluates XPath queries. BaseX is a complete Java-based XPath processor with a nice graphic web interface. It shows results in different formats including text, tree and tree map. Saxon is an XSLT and XQuery command line processor. Since XPath is both used in XSLT and XQuery, you can try XPath queries with Saxon as well. For instance, to evaluate the query /descendant::article on the XML document bib.xml, run the following command:

java net.sf.saxon.Query -s bib.xml "{/descendant::article}"

The option -s sets the initial context node to the root of the given XML document. If you prefer to store the query in the file articles.xpl, then you can type the following command:

java net.sf.saxon.Query -s bib.xml articles.xpl

Query languages: XQuery

The XML query language (XQuery) is a complete query language for XML databases. It stands to XML databases as SQL stands to relational ones. An XML database is a collection of (related) XML documents.

XQuery works on sequences, not on node sets as XPath. A sequence contains items which are either XML elements or atomic values (like a string or a number). The relationship between XPath and XQuery consists in the fact that XPath expressions are used in XQuery queries. Hence, we may consider XPath as a syntactic fragment of XQuery.

A typical expression in XQuery works as follows:

  1. load: one or more XML documents are loaded from the database;
  2. retrieve: XPath expressions are used to retrieve sequences of tree nodes from the loaded documents;
  3. process: the retrieved node sequences are processed with XQuery operations like filtering (creating a new sequence by selecting some of the items of the original one) and ordering (sorting the items of the sequence according to some criteria);
  4. construct: new sequences may be constructed and combined with the retrieved ones;
  5. output: a final sequence is returned as output.

Some examples of XQuery statements follow:

The title and the year of publication of all bib items with more than one author sorted by year

for $item in doc("bib.xml")/bibliography/*
where count($item/author) > 1
order by $item/year
return <item key = "{$item/@key}">
         {$item/title}
         {$item/year}
       </item>

The bib items that are cited by at least one other item

let $doc := doc("bib.xml")
for $item in $doc/bibliography/*
let $citation := for $c in $doc/descendant::cite
                 where $c/@item = $item/@key
                 return $c
where count($citation) > 0
return <item key = "{$item/@key}"/>

An XQuery processor is a software that evaluates queries in XQuery. See the XQuery resources page for a list of XQuery processors. Saxon is a good example. In order to evaluate with Saxon the query contained in the file xquery.xql, type the following:

java net.sf.saxon.Query xquery.xql

Unfortunately, at the moment there is no standard language to update an XML document.

Stylesheet languages: XSLT

Extensible Stylesheet Language Transformations (XSLT) is an XML application to transform one XML document into a another document in some format (such as XML, HTML, plain text). Since one typical application of XSLT is to render the information contained in the XML document by mapping the XML document into an HTML one, XSLT documents is also called XSLT stylesheets.

Example

Here is an example of XSLT stylesheet to map our bibliography XML document into an HTML document:

<?xml version="1.0"?> 
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  
  <xsl:template match="/">
    <html>
      <head>
        <title>A small bibliography</title>
      </head>
      <body>
        <h1>A small bibliography</h1>
        
        <xsl:if test="bibliography/article">
          <h2>Articles</h2>
          <ol>
            <xsl:apply-templates select="bibliography/article">                  
              <xsl:sort select="year" order="descending"/>
            </xsl:apply-templates>
          </ol>
        </xsl:if>
        
        <xsl:if test="bibliography/book">
          <h2>Books</h2>
          <ol>
            <xsl:apply-templates select="bibliography/book">                  
              <xsl:sort select="year" order="descending"/>
            </xsl:apply-templates>
          </ol>
        </xsl:if>
        
      </body>
    </html>
  </xsl:template>
    
  <xsl:template match="bibliography/article">
    <a name="{@key}"/>
    <li>
      <xsl:apply-templates select="author">
        <xsl:sort select="surname"></xsl:sort>
      </xsl:apply-templates>
      <b><xsl:apply-templates select="title"/></b>.
      <xsl:apply-templates select="journal"/>,
      <xsl:apply-templates select="year"/>.
      <xsl:if test="cite">
        References: <xsl:apply-templates select="cite"/>
      </xsl:if>
    </li>
  </xsl:template>
  
  <xsl:template match="bibliography/book">
    <a name="{@key}"/>
    <li>
      <xsl:apply-templates select="author">
        <xsl:sort select="surname"></xsl:sort>
      </xsl:apply-templates>
      <b><xsl:apply-templates select="title"/></b>,
      <xsl:apply-templates select="year"/>.
      <xsl:apply-templates select="publisher"/>.
      <xsl:apply-templates select="@isbn"/>.
      <xsl:if test="cite">
        References: <xsl:apply-templates select="cite"/>
      </xsl:if>
    </li>
  </xsl:template>
  
  <xsl:template match="author">
    <xsl:apply-templates select="name"/>
    <xsl:apply-templates select="surname"/> 
    <xsl:if test="position() != last()">, </xsl:if>  
    <xsl:if test="position() = last()">. </xsl:if>  
  </xsl:template>
  
  <xsl:template match="name">
    <xsl:apply-templates/>
    <xsl:value-of select="string(' ')"/>
  </xsl:template>
  
  <xsl:template match="cite">
    <a href="#{@item}">
      <xsl:apply-templates select="@item"/>
    </a>
  </xsl:template>

</xsl:stylesheet>

Here is an XSLT that works for the extended version.

An XSLT stylesheet is associated to an XML document using the processing instruction with target xml-stylesheet. For instance, we can associate the stylesheet contained in the file bib.xsl to our bibliography XML document by adding the following instruction to the XML document:

<?xml-stylesheet type="application/xml" href="bib.xsl"?>

An XSLT processor is a software that inputs an XML document and a corresponding XSLT stylesheet and outputs the result document by applying the stylesheet to the XML document. An XSLT processor can be built into a web browser, like Mozilla TransforMiiX. Or it can be a standalone program like Saxon. With Saxon, we can apply bib.xsl to bib.xml with the following syntax:

java net.sf.saxon.Transform bib.xml bib.xsl

If the XML document contains the stylesheet processing instruction, you can use this syntax:

java net.sf.saxon.Transform -a bib.xml

Moreover, use option -snone to preserve the whitespace text nodes in the XML document and option -o file to send output to the named file.

Stylesheet languages: CSS

The names of XML elements describe the meaning of their contents. However, they say nothing about the presentation of the content. CSS is a language for describing the appearance of elements in a document. It does not change the markup of an XML document but simply applies the presentation rules to the existing content.

Example

Here is an example of an XML document styled with CSS. In fact, multiple stylesheets have been associated to the document. If you are using Mozilla Firefox, you can alternate different stylesheets with the menu option View/Page Style.

The stylesheet for an XML document is specified with the processing instruction xml-stylesheet in the prolog of the XML document. For instance, a CSS stylesheet for our bibliography can be associated to the bibliography XML document with the following processing instruction:

<?xml-stylesheet type="text/css" href="bib.css"?>

Programming languages: JAXP

The Java platform (from version 5.0) offers powerful XML processing features, including:

org.xml.sax
This package defines the de facto standard Simple API for XML (SAX). This is an API for parsing XML documents that exploits the event-based model for XML processing.
org.w3c.dom
This package defines the W3C standard Document Object Model (DOM). This is an API for parsing XML documents that exploits the tree-based model for XML processing.
javax.xml.parsers
This package provides high-level interfaces for instantiating SAX and DOM parsers for parsing and, optionally, validating XML documents against DTD.
javax.xml.validation
This package provides support for validation of XML documents against W3C XML Schema.
javax.xml.transform
This package supports the evaluation of XSLT programs for transforming XML documents.
javax.xml.xpath
This package supports the evaluation of XPath expressions for selecting nodes from XML documents.

Notice that SAX and DOM packages are endorsed standards with respect to Java. This means that they are part of the Java platform by are not defined by Sun, which is why they have the org prefix.

Unfortunately, at the moment there is no support for XQuery in the Java platform.

For instance, you might build a Java application that exports our XML bibliography into a relational database and vice versa. Or you might create a Java web interface to search and possibly update the XML bibliography. These applications may take advantage of some XML technologies described above.

Books

Next page Previous page Start of chapter End of chapter
Caffè XML - Massimo Franceschet