[dmh2000] - Building A Data Structure With XML and Expat Part 1

daveh at dmh2000 dot com

[home]

Get Firefox!

Developed with jEdit

bricks

Expat (2)

comment on this essay

Building A Data Structure With Expat

Part 1 : Introduction

Expat is Jim Clarks C based XML parser. It was one of the first XML parser implementations and has a solid reputation as a fast and efficient parser. The implementation has been kept up to date and the current versions are available on Sourceforge.

Expat is an event driven parser, like SAX. When it parses a document, it fires 'events' as entities in the XML input file are encountered. If you aren't familiar with expat, read about it at XML Programming with C++ before proceeding with this article.


You could say that there are two different types of application that would use expat for inputting an XML document:

  1. An XML centric application where you really care about the XML itself and the application is meant to do XML things. For example, a DOM (document object model) builder might use expat to read a document and build an in memory DOM representation of the document. Or an application might have a requirement to validate an XML document against a schema, and expat might be used to read the document and the schema. Such an application would be relatively complex and general purpose.
  2. An application that uses XML documents for data transfer, but doesn't really care about XML beyond its use as a data transport mechanism. This application wants to read the XML, extract its data and move on. This is the type of application that this article addresses.

There seem to be plenty of web resources about the mechanics of of how to use the expat parser API, but not a whole lot of tutorial on how to apply it to actually use an XML document. This article describes a design and implementation approach to using expat to parse specific XML documents for use in application.


XML is a way to represent structured data. As such, it is analogous to a 'struct' in C or C++, but not a class, since it has data only, not methods. However, XML isn't restricted to fixed format structures so it really is analogous to a tree. Depending on the schema (the rules for the document) one instance of a document will be different from another and still be 'proper'. In C or C++ one might implement a tree with nodes and pointers.

So overall, an XML document will have a 'struct' at the top level, and members of the top-level struct will be either values or dynamic data structures in themselves. This is where the DOM comes in with Nodes, Nodelists etc.

But expat doesn't build the data structure for you. You only get events and the data associated with events. So the trick is to go from the the events to the internal data structure. And for the type of application I am addressing, you don't need to go to a general purpose data structure like a DOM. Instead you can go right to whatever specific internal data structures that match the specific documents you care about.


What about schemas and DTD's? Since expat isn't a validating parser, it won't do the work for you. My approach in writing applications that use XML files for data transfer is to always write a schema first, and offline validate some samples of the documents against the schema using a validating parser such as Apache Xerces. You have to define the documents before you start writing them (you do don't you?) and the XML schema language works pretty well for that and maps to C structures quite nicely. DTD's seem to me a little harder to use for this purpose.


A little computer science refresher is necessary before going to the next step. In language theory, there is something known as the Chomsky language hierarchy. It defines the structure of languages and what algorithms are required to parse them. It turns out that an XML schema can nicely define either a Regular language or a Context-Free language. A regular language is parsable by a finite-state machine. It essentially has a 'flat' structure with no recursive nestings. A context-free language can contain recursive elements (NODES nested within NODES of the same type) and as such needs a stack to parse it. Regular languages are a subset of context-free languages, so if you can handle a context-free language, you can handle regular languages for free.

An XML document can be either flat or recursive (my terminology). A flat document would be handled by a finite-state machine. It would not have any recursive elements, but it may have lists.

In the next section I discuss the actual implementation for a 'flat' document.


comment on this essay