FastInfoset

FastInfoset User Guide

Introduction

This document contains detailed technical information for users of the Applied Informatics FastInfoset library. Basic familiarity with XML and XML programming concepts and techniques such as SAX2 and DOM APIs is expected from the reader.

Implementation Features

The Applied Informatics FastInfoset library supports the following Fast Infoset features:

Basic Features

  • Reading Fast Infoset documents from streams, files and memory
  • Writing Fast Infoset documents to streams and files
  • Converting XML documents into Fast Infoset documents
  • Converting Fast Infoset documents into XML documents

Dictionaries (Vocabularies)

  • Encoding Algorithm
  • Prefix
  • Namespace
  • Local Name
  • Other NCName
  • Other URI
  • Attribute Value
  • Other String
  • Character Content Chunk
  • Element Name
  • Attribute Name
  • Restricted Alphabet

Character Encodings

  • UTF-8
  • UTF-16 (parsing only)
  • Restricted Alphabet (parsing only)
  • Encoding Algorithm

Encoding Algorithms

  • hex
  • dword (base64)
  • short
  • int
  • long
  • bool
  • float
  • double
  • CDATA
  • UUID

Parsing Fast Infoset Documents

For parsing a Fast Infoset document, the Poco::FastInfoset::FastInfosetParser class is used. The class implements the Poco::XML::XMLReader interface, therefore the parser generates SAX2 events for the Fast Infoset document it parses.

XML Namespace Support

Namespace support is always enabled and cannot be disabled. Thus attempting to set the "http://xml.org/sax/features/namespaces" feature to a value other than true (the default) will fail. It is, however, possible to set the "http://xml.org/sax/features/namespace-prefixes" feature to false. In this case, no qualified name (qname) will be passed to the startElement() and endElement() methods of the Poco::XML::ContentHandler interface, resulting in a slightly better parsing speed.

Parser Performance

Parsing performance is also affected by the choice of the document vocabulary implementation used by the parser. The parser defaults to using a simple vector-based document vocabulary, which provides the best parsing performance. The only disadvantage of using a vector-based document vocabulary is that the vector-based vocabulary has some implementation restrictions so that it cannot be subsequently used for creating a Fast Infoset document with the Poco::FastInfoset::FastInfosetWriter. This is a rare use case though, so it is a good idea to always use the vector-based document vocabulary with the parser.

The following code fragment shows how to configure the parser for best parsing performance, by disabling namespace prefixes and using the vector-based document vocabulary.

using Poco::FastInfoset::FastInfosetParser;
using Poco::FastInfoset::DocumentVocabulary;
using Poco::XML::XMLReader;

FastInfosetParser parser(DocumentVocabulary::VOC_VECTOR);
parser.setFeature(XMLReader::FEATURE_NAMESPACE_PREFIXES, false);

External Document Vocabulary

The parser can be configured to use an external document vocabulary, by passing a Poco::FastInfoset::DocumentVocabulary object to the constructor. The document vocabulary object must have an URI set. When the parser processes a Fast Infoset document containing a reference to an external document vocabulary with the same URI, the supplied document vocabulary will be used.

If no document vocabulary has been given to the parser, the parser attempts to fetch a Fast Infoset or XML document from the external document vocabulary URI (using the default Poco::URIStreamOpener), and uses the document vocabulary obtained from parsing that document.

Handling Encoded Data

A Fast Infoset document can contain encoded character data. This means that, for example, a space-separated list of integer values in the document (either an attribute value or element content) is stored as a sequence of binary integer values in the Fast Infoset document, as opposed to a plain text representation. When such a document is parsed, the encoded values are decoded into XML-compatible character data and passed as such to the content handler. This can be changed by configuring the parser with a subclass of Poco::FastInfoset::FISContentHandler instead of Poco::XML::ContentHandler. The FISContentHandler class extends the plain ContentHandler class with special handler methods for encoded data. Therefore, encoded data is passed directly in binary form to the application, avoiding the overhead of decoding that data into a textual representation.

The next sample shows how to use the FISContentHandler class to handle encoded data in a Fast Infoset document.

#include "Poco/FastInfoset/FastInfosetParser.h"
#include "Poco/FastInfoset/FISContentHandler.h"
#include "Poco/SAX/Attributes.h"
#include "Poco/Exception.h"
#include <iostream>


class MyHandler: public Poco::FastInfoset::FISContentHandler
{
public:
    // ContentHandler
    void setDocumentLocator(const Poco::XML::Locator* loc)
    {
        // not used by FastInfoset parser
    }

    void startDocument()
    {
        std::cout << "Start document" << std::endl;
    }

    void endDocument()
    {
        std::cout << "End document" << std::endl;
    }

    void startElement(const std::string& uri, 
                      const std::string& localName, 
                      const std::string& qname, 
                      const Poco::XML::Attributes& attrs)
    {
        std::cout << "Start element" << std::endl;
        std::cout << "  uri:       " << uri << std::endl
                  << "  localName: " << localName << std::endl
                  << "  qname:     " << qname << std::endl;
        std::cout << "  Attributes: " << std::endl;
        for (int i = 0; i < attrs.getLength(); ++i)
        {
            std::cout << "    " << attrs.getLocalName(i) << " = " << attrs.getValue(i) << std::endl;
        }
    }

    void endElement(const std::string& uri, 
                    const std::string& localName, 
                    const std::string& qname)
    {
        std::cout << "End element" << std::endl;
    }

    void characters(const char ch[], int start, int length)
    {
        std::cout << std::string(ch + start, length) << std::endl;
    }

    void ignorableWhitespace(const char ch[], int start, int length)
    {
    }

    void processingInstruction(const std::string& target, const std::string& data)
    {
    }

    void startPrefixMapping(const std::string& prefix, const std::string& uri)
    {
    }

    void endPrefixMapping(const std::string& prefix)
    {
    }

    void skippedEntity(const std::string& name)
    {
    }

    // FISContentHandler
    void binaryData(const char* data, std::size_t size)
    {
        std::cout << "binary data (size " << size << ")" << std::endl;
    }

    void encodedData(Poco::Int16 value)
    {
        std::cout << "short: " << value << std::endl;
    }

    void encodedData(Poco::Int32 value)
    {
        std::cout << "int: " << value << std::endl;
    }

    void encodedData(Poco::Int64 value)
    {
        std::cout << "long: " << value << std::endl;
    }

    void encodedData(bool value)
    {
        std::cout << "bool: " << value << std::endl;
    }

    void encodedData(float value)
    {
        std::cout << "float: " << value << std::endl;
    }

    void encodedData(double value)
    {
        std::cout << "double: " << value << std::endl;
    }

    void encodedData(const Poco::UUID& value)
    {
        std::cout << "UUID: " << value.toString() << std::endl;
    }
};


int main(int argc, char* argv[])
{
    MyHandler myHandler;
    Poco::FastInfoset::FastInfosetParser parser;
    parser.setContentHandler(&myHandler);

    try
    {
        for (int i = 1; i < argc; i++)
        {
            parser.parse(argv[i]);
        }
    }
    catch (Poco::Exception& exc)
    {
        std::cerr << exc.displayText() << std::endl;
        return 1;
    }
    return 0;
}

Creating a DOM Tree

Creating a DOM tree from a Fast Infoset document is straightforward. All that is to do is to wire up the FastInfosetParser to a Poco::XML::DOMBuilder, which will the build the DOM tree from the SAX events received from the Fast Infoset parser.

Poco::FastInfoset::FastInfosetParser parser;
Poco::XML::DOMBuilder domBuilder(parser);
Poco::AutoPtr<Poco::XML::Document> pDoc = domBuilder.parse("sample.fis");

Creating Fast Infoset Documents

Creating a Fast Infoset document is way simpler than parsing one. The programming interface for creating a Fast Infoset document is actually the same as the one for parsing, except that the direction of the SAX events is reverse. Instead of registering an event handler class and waiting for callbacks from the parser, events are sent to the Fast Infoset writer by calling the writer's methods.

The FastInfosetWriter

Fast Infoset documents are created with the Poco::FastInfoset::FastInfosetWriter class. The writer always writes the generated Fast Infoset document to a stream, which is passed to the writer's constructor.

std::ofstream ostr("sample.fis", std::ios::binary);
Poco::FastInfoset::FastInfosetWriter writer(ostr);

The content of the Fast Infoset document is created by calling startDocument(), startElement(), endElement(), characters() and the various other methods defined by Poco::FastInfoset::FISContentHandler. It is important to provide a matching call to endElement() for every call to startElement(), otherwise the generated Fast Infoset document will be invalid.

writer.startDocument();
writer.startElement("", "", "greeting");
writer.characters("Hello, world!");
writer.endElement("", "", "greeting");
writer.endDocument();

The above example will generate the Fast Infoset equivalent of the following XML document:

<greeting>Hello, world!</greeting>

The generated Fast Infoset document does not use XML namespaces. To add namespaces to a Fast Infoset document, the calls to startElement() and endElement() must be changed to include a namespaceURI and localName argument, as follows:

writer.startElement("http://www.appinf.com/sample/greeting", "greeting", "");
writer.characters("Hello, world!");
writer.endElement("http://www.appinf.com/sample/greeting", "greeting", "");

Furthermore, attributes can be added to an element by creating a Poco::XML::AttributesImpl object and passing it to startElement().

Poco::XML::AttributesImpl attrs;
attrs.addAttribute("", "language", "", "", "English");
writer.startElement("http://www.appinf.com/sample/greeting", "greeting", "", attrs);

Writing Encoded And Binary Data

Encoded data (using the encoding algorithms defined by Fast Infoset) can be written to a Fast Infoset document by using the binaryData() and encodedData() member functions of FastInfosetWriter.

Raw binary data (octet strings) can be written to a Fast Infoset document by calling binaryData(), passing in a pointer to a character buffer and the buffer's length. The data is written using the dword encoding algorithm.

Other data, like signed integers (16-bit, 32-bit and 64-bit), single and double precision floating point number, as well as UUIDs can be written to a Fast Infoset document with the encodedData() member function of the writer.

The following example demonstrates this.

#include "Poco/FastInfoset/FastInfosetWriter.h"
#include "Poco/Exception.h"
#include "Poco/Path.h"
#include <fstream>
#include <cmath>


int main(int argc, char** argv)
{
    std::ofstream ostr("sample.fis", std::ios::binary);

    Poco::FastInfoset::FastInfosetWriter writer(ostr);
    writer.startDocument();
    writer.startElement("", "", "root");
    writer.startElement("", "", "string");
    writer.characters("Hello, world!");
    writer.endElement("", "", "string");
    writer.startElement("", "", "integer");
    writer.encodedData(42);
    writer.endElement("", "", "integer");
    writer.startElement("", "", "float");
    writer.encodedData(4*std::atan(1.0));
    writer.endElement("", "", "float");
    writer.endElement("", "", "root");
    writer.endDocument();

    return 0;
}

Character Data Indexing

The writer supports automatic indexing of character data element and attribute content. Indexing means that only one copy of a character string that occurs more than once in the document will be actually stored. Further occurrences of the character string will be replaced with a reference to the string, thus reducing the size of the resulting Fast Infoset document.

Only character data element content with a given maximum length (default is 7) will be indexed. The length can be set with setMaxIndexedStringLength().

The writer defaults to using a hash table-based document vocabulary. While this is in many cases the fastest choice, it has some limitations due to hash table size restrictions. For very large Fast Infoset documents, the hash table could overflow. In this case it is a better choice to use the map-based vocabulary, by passing the appropriate value to the writer constructor.

Converting Between XML And Fast Infoset

The FastInfoset library contains utility functions for converting an XML document into a Fast Infoset document and vice versa.

To convert an XML document into an equivalent Fast Infoset document, the convertToFIS() method of the Poco::FastInfoset::Converter class is used.

std::ifstream istr("sample.xml");
std::ofstream ostr("sample.fis");
Poco::FastInfoset::Converter::convertToFIS(istr, ostr);

Similarly, the convertToXML() method is used to convert a Fast Infoset document into the corresponding XML document:

std::ifstream istr("sample.fis");
std::ofstream ostr("sample.xml");
Poco::FastInfoset::Converter::convertToXML(istr, ostr);

The conversion process and the resulting document can be controlled by passing various flags to the conversion methods. See the Poco::FastInfoset::Converter documentation for more information.