FastInfoset

FastInfoset Tutorial

Welcome

Welcome to the FastInfoset library tutorial. This tutorial will familiarize you with the basic concepts and techniques required for working with the Applied Informatics FastInfoset C++ library.

This tuturial assumes that you are familiar with basic POCO C++ Libraries programming techniques. You should also have read the FastInfoset Overview and be familiar with basic Fast Infoset and XML concepts.

Parsing Fast Infoset Documents

XML And Fast Infoset Parsing API Basics

There are two major types of XML (and therefore, Fast Infoset) parsing APIs: tree-based and event-based APIs.

Tree-based APIs map an XML or Fast Infoset document into an internal tree structure, then allow an application to navigate that tree. The Document Object Model (DOM) working group at the World-Wide Web Consortium (W3C) maintains a recommended tree-based API for XML and HTML documents, and there are many such APIs from other sources.

An event-based API, on the other hand, reports parsing events (such as the start and end of elements) directly to the application through callbacks, and does not usually build an internal tree. The application implements handlers to deal with the different events, much like handling events in a graphical user interface. SAX is the best known example of such an API.

Tree-based APIs are useful for a wide range of applications, but they normally put a great strain on system resources, especially if the document is large. Furthermore, many applications need to build their own strongly typed data structures rather than using a generic tree corresponding to an XML document. It is inefficient to build a tree of parse nodes, only to map it onto a new data structure and then discard the original.

In both of those cases, an event-based API provides a simpler, lower-level access to an XML document: you can parse documents much larger than your available system memory, and you can construct your own data structures using your callback event handlers.

Consider, for example, the following task: Locate the record element containing the word "Ottawa".

If your XML document were 20MB large (or even just 2MB), it would be very inefficient to construct and traverse an in-memory parse tree just to locate this one piece of contextual information; an event-based interface would allow you to find it in a single pass using very little memory.

To understand how an event-based API can work, consider the following sample document:

<?xml version="1.0"?>
<doc>
<para>Hello, world!</para>
</doc>

An event-based interface will break the structure of this document down into a series of linear events, such as these:

start document
start element: doc
start element: para
characters: Hello, world!
end element: para
end element: doc
end document

An application handles these events just as it would handle events from a graphical user interface: there is no need to cache the entire document in memory or secondary storage.

Finally, it is important to remember that it is possible to construct a parse tree using an event-based API, and it is possible to use an event-based API to traverse an in-memory tree.

The SAX2 API

SAX2, the Simple API for XML Version 2 is a common event-based interface implemented for many different XML parsers (and things that pose as XML parsers), just as the ODBC is a common interface implemented for many different relational databases (and things that pose as relational databases). SAX2 was originally specified for the Java programming language. However, the API can be implemented in other programming languages as well. The POCO XML library provides a C++ binding of the SAX2 API, and the same API is also used by the FastInfoset library.

Setting Up The Parser

To parse a Fast Infoset document using the SAX2 API, start by creating a subclass of Poco::XML::DefaultHandler:

class MyHandler: public Poco::XML::DefaultHandler
{
};

Next, we'll create an instance of Poco::FastInfoset::FastInfosetParser and pass it an instance of the MyHandler class.

MyHandler myHandler;
Poco::FastInfoset::FastInfosetParser parser;
parser.setContentHandler(&myHandler);

With the Fast Infoset parser set up, we can now parse a Fast Infoset document. The parser can read a Fast Infoset document from a stream, from a file (given its path), or from a buffer in memory. In our sample, we'll parse a Fast Infoset file, so we'll simply pass the path of the file (given on the command line) to the Fast Infoset parser. In the case of a malformed Fast Infoset document (or some other I/O related error), the parser will throw an exception, so we'll wrap the parsing code in a try ... catch block.

try
{
    for (int i = 1; i < argc; i++)
    {
        parser.parse(argv[i]);
    }
}
catch (Poco::Exception& exc)
{
    std::cerr << exc.displayText() << std::endl;
}

Handling Events

Things get interesting when you start implementing methods to respond to XML parsing events (remember that we registered our class to receive XML parsing events in the previous section). The most important events are the start and end of the document, the start and end of elements, and character data. To find out about the start and end of the document, the client application implements the startDocument() and endDocument() methods in the MyHandler class:

void startDocument()
{
    std::cout << "Start document" << std::endl;
}

void endDocument()
{
    std::cout << "End document" << std::endl;
}

The startDocument() and endDocument() event handlers take no arguments. When the Fast Infoset parser finds the beginning of the document, it will invoke the startDocument() method once; when it finds the end, it will invoke the endDocument() method once.

These examples simply print a message to standard output, but your application can contain any arbitrary code in these handlers: most commonly, the code will build some kind of an in-memory tree, produce output, populate a database, or extract information from the Fast Infoset stream.

The parser will signal the start and end of elements in much the same way, except that it will also pass some parameters to the startElement() and endElement() methods:

void startElement(const std::string& uri, 
                  const std::string& localName, 
                  const std::string& qname, 
                  const Poco::XML::Attributes& attrs)
{
    std::cout << "Start element" << std::endl;
    std::cout << "  uri:       " << uri << std::endl
              << "  localName: " << localName << std::endl
              << "  qname:     " << qname << std::endl;
    std::cout << "  Attributes: " << std::endl;
    for (int i = 0; i < attrs.getLength(); ++i)
    {
        std::cout << "    " << attrs.getLocalName(i) << " = " << attrs.getValue(i) << std::endl;
    }
}

void endElement(const std::string& uri, 
                const std::string& localName, 
                const std::string& qname)
{
    std::cout << "End element" << std::endl;
}

These methods print a message every time an element starts or ends, showing the element's namespace URI, local name and qualified name (qname). The qname contains the raw XML 1.0 name, which you must use for all elements that don't have a namespace URI.

Finally, SAX2 reports regular character data through the characters method; the following implementation will print all character data to the screen:

void characters(const char ch[], int start, int length)
{
    std::cout << std::string(ch + start, length) << std::endl;
}

Note that a SAX parser is free to chunk the character data any way it wants, so you cannot count on all of the character data content of an element arriving in a single characters event.

The Complete Application

Here is the complete source code for the sample application:

#include "Poco/FastInfoset/FastInfosetParser.h"
#include "Poco/SAX/Attributes.h"
#include "Poco/SAX/DefaultHandler.h"
#include "Poco/Exception.h"
#include <iostream>


class MyHandler: public Poco::XML::DefaultHandler
{
    void startDocument()
    {
        std::cout << "Start document" << std::endl;
    }

    void endDocument()
    {
        std::cout << "End document" << std::endl;
    }

    void startElement(const std::string& uri, 
                      const std::string& localName, 
                      const std::string& qname, 
                      const Poco::XML::Attributes& attrs)
    {
        std::cout << "Start element" << std::endl;
        std::cout << "  uri:       " << uri << std::endl
                  << "  localName: " << localName << std::endl
                  << "  qname:     " << qname << std::endl;
        std::cout << "  Attributes: " << std::endl;
        for (int i = 0; i < attrs.getLength(); ++i)
        {
            std::cout << "    " << attrs.getLocalName(i) << " = " << attrs.getValue(i) << std::endl;
        }
    }

    void endElement(const std::string& uri, 
                    const std::string& localName, 
                    const std::string& qname)
    {
        std::cout << "End element" << std::endl;
    }

    void characters(const char ch[], int start, int length)
    {
        std::cout << std::string(ch + start, length) << std::endl;
    }
};


int main(int argc, char* argv[])
{
    MyHandler myHandler;
    Poco::FastInfoset::FastInfosetParser parser;
    parser.setContentHandler(&myHandler);

    try
    {
        for (int i = 1; i < argc; i++)
        {
            parser.parse(argv[i]);
        }
    }
    catch (Poco::Exception& exc)
    {
        std::cerr << exc.displayText() << std::endl;
        return 1;
    }
    return 0;
}

Creating Fast Infoset Documents

Creating a Fast Infoset document is way simpler than parsing one. The programming interface for creating a Fast Infoset document is actually the same as the one for parsing, except that the direction of the events is reverse. Instead of registering an event handler class and waiting for callbacks from the parser, we "send" events to the Fast Infoset writer by calling the writer's methods.

First, we have to create an instance of Poco::FastInfoset::FastInfosetWriter. The writer always writes the generated Fast Infoset document to a stream, which we have to pass to the writer's constructor.

std::ofstream ostr("sample.fis", std::ios::binary);
Poco::FastInfoset::FastInfosetWriter writer(ostr);

We can now create a Fast Infoset document by calling startDocument(), startElement(), endElement(), characters() and the various other methods defined by Poco::FastInfoset::FISContentHandler. It is important to provide a matching call to endElement() for every call to startElement().

writer.startDocument();
writer.startElement("", "", "greeting");
writer.characters("Hello, world!");
writer.endElement("", "", "greeting");
writer.endDocument();

The example application will generate the Fast Infoset equivalent of the following XML document:

<greeting>Hello, world!</greeting>

In this example we do not use XML namespaces. Therefore we only pass the qname (third argument) to startElement() and endElement() and leave the first two arguments namespaceURI and localName empty.

Following is the complete source code for the sample application:

#include "Poco/FastInfoset/FastInfosetWriter.h"
#include <fstream>


int main(int argc, char* argv[])
{
    std::ofstream ostr("sample.fis", std::ios::binary);
    Poco::FastInfoset::FastInfosetWriter writer(ostr);
    writer.startDocument();
    writer.startElement("", "", "greeting");
    writer.characters("Hello, world!");
    writer.endElement("", "", "greeting");
    writer.endDocument();

    return 0;
}

To add namespaces to our Fast Infoset document, we need to change the calls to startElement() and endElement() to include a namespaceURI and localName argument, as follows:

writer.startElement("http://www.appinf.com/sample/greeting", "greeting", "");
writer.characters("Hello, world!");
writer.endElement("http://www.appinf.com/sample/greeting", "greeting", "");

Furthermore, we can add attributes to an element by creating a Poco::XML::AttributesImpl object and passing it to startElement().

Poco::XML::AttributesImpl attrs;
attrs.addAttribute("", "language", "", "", "English");
writer.startElement("http://www.appinf.com/sample/greeting", "greeting", "", attrs);

Following is the complete sample with namespaces and attributes generation:

#include "Poco/FastInfoset/FastInfosetWriter.h"
#include "Poco/SAX/AttributesImpl.h"
#include <fstream>


int main(int argc, char* argv[])
{
    std::ofstream ostr("sample.fis", std::ios::binary);
    Poco::FastInfoset::FastInfosetWriter writer(ostr);
    writer.startDocument();
    Poco::XML::AttributesImpl attrs;
    attrs.addAttribute("", "language", "", "", "English");
    writer.startElement("http://www.appinf.com/sample/greeting", "greeting", "", attrs);
    writer.characters("Hello, world!");
    writer.endElement("http://www.appinf.com/sample/greeting", "greeting", "");
    writer.endDocument();

    return 0;
}

Converting Between XML And Fast Infoset

The FastInfoset library contains utility functions for converting an XML document into a Fast Infoset document and vice versa.

To convert an XML document into an equivalent Fast Infoset document, use the convertToFIS() method of the Poco::FastInfoset::Converter class.

std::ifstream istr("sample.xml");
std::ofstream ostr("sample.fis");
Poco::FastInfoset::Converter::convertToFIS(istr, ostr);

Similarly, the convertToXML() method is used to convert a Fast Infoset document into the corresponding XML document:

std::ifstream istr("sample.fis");
std::ofstream ostr("sample.xml");
Poco::FastInfoset::Converter::convertToXML(istr, ostr);

You can influence the conversion process and the outcome by passing various flags to the conversion methods. See the Poco::FastInfoset::Converter documentation for more information.

Acknowledgements

The text in this tutorial is partly based on the documentation from the SAX website, which has been placed in the public domain by its author.