[dmh2000] - Building A Data Structure With XML and Expat Part 2

daveh at dmh2000 dot com

[home]

Get Firefox!

Developed with jEdit

bricks

Expat (1)

comment on this essay

Building A Data Structure With Expat

Part 2 : Flat Documents To Data Structures

A regular language is parsable by a finite-state machine. It essentially has a 'flat' structure with no recursive nestings. Therefore it can be represented in C/C++ as a 'simple' data structure with no pointer cycles. The data structure can still have structs, pointers and arrays in it. The following illustrate some flat documents of increasing complexity, and how to use expat to read the XML representation and de-serialize it to a concrete data item.

Sample Document 1

<doc1> 
    <a>abc</a>
    <b>1.0</b>
    <c>0</c>
</doc1>

This data could be represented by the C struct:

struct doc1_t {
    char  *a;
    double b;
    int    c;
};

 


Sample Document 2

A more complex example might have lists of elements.

<doc2>
    <a>abc</a>
    <b>1.0</b>
    <c>0</c>
    <array>
        <e>0</e>
        <e>7</e>
        <e>8</e>
        <e>3</e>
        <e>2</e>
   </array>
</doc2>

This could be represented by the C struct

struct doc2_t {
    char  *a;
    double b;
    int    c;
    int array[10]; 
};

The 'array' member could be a a fixed array if you know the size bound or dynamic data structure of any type suitable for containing a variable length vector of integers. It might be a linked list, a dynamically allocated array or a vector<int> in C++.


Sample Document 3

The next level (but still flat) might have a struct as a sub-element.

<doc3>
    <a>abc</a>
    <b>1.0</b>
    <c>0</c>
    <array>
        <e>0</e>
        <e>7</e>
        <e>8</e>
        <e>3</e>
        <e>2</e>
   </array>
   <sub>
       <x>0</x>
       <y>1</y>
       <z>xyz</z>
   </sub>
</doc3>

This could be represented by the C struct(s)

struct sub_t {
   int x;
   int y;
   char *z;
};
struct doc3_t {
   char *a;
   double b;
   int c;
   int array[10]; 
   struct sub_t sub;

   // or 'struct sub_t *sub' with dynamic allocation of the sub_t data at run time
};

Document 3

Using Attributes vs. Elements

An alternate XML representation of doc3, and in my opinion easier and more efficient to use in this type of application, would use XML attributes for the scalar values represented in the structure, and use elements for the structural elements. With the equivalent data and mapping to the same C structs, doc3 could look like this:

<doc3 a="abc" b="1.0" c="0">
   <array>
       <e val="0"/>
       <e val="7"/>
       <e val="8"/>
       <e val="3"/>
       <e val="2"/>
   </array>
   <sub x="0" y="1" z="xyz"/>
</doc3>

One of the newbie questions one asks when starting out with XML is 'when do I use attributes and when do I use elements?'. The answer is always, 'it depends'. Some of it is a matter of taste. Some of it is what the data looks like. Go to Google.com and search for 'xml attribute vs. element' and read some of the entries. There are a variety of opinions.

When using expat, a very pragmatic reason to use attributes rather than elements is that the expat StartElement event hands you all the scalar attributes in one event, so for a structure that has only scalar elements you can process it all in the StartElement event and do nothing much in the EndElement event. If you have elements for each scalar value, you need a Start and and End event and parsing code for each scalar element.

In fact, the attribute vs. element opinions I see on the web are divided pretty well into the pragmatists vs. the purists.


No Recursive Structures

Anyway, you can continue on in this fashion defining sub structures, arrays, values etc with one limitation : no direct or indirect recursion with arbitrary levels of nesting. XML will let you do it but the implementation technique shown in this section doesn't support it. That implementation would require a push-down automaton rather than a finite state machine. .

<doc4>
   <sub1 a="1">
       <sub2>
           <sub1 a="2">
           ... arbitrary level of recursion
           </sub1>
       </sub2>
   </sub1>
<doc4>
// C representation   
struct sub1;
struct sub2;

struct sub1 {
   int a;
   struct sub2 *s;
};

struct sub2 {
   struct sub1 *s;
};

struct doc4 {
   struct *sub1;
};

With this type of structure you need to build a tree and you need a stack to do it with.. We'll get to how to handle this in the next section. So at this point assume no recursion in the data structures at all.


HOW TO IMPLEMENT THE CODE!

Finally we get to the implementation part. So lets proceed with the structure from 'doc3' above. The following describes a step by step process (a design pattern) to implement the code. I find that a very systematic approach is important to generating correct implementations over time.


A. Write the schema OR code up a representative example.

First you need to design your document independent of the implementation. There are two ways : design by example and format design by schema (or DTD).

An easy but loose design mechanism is to design by example..

 DOCUMENTATION-BY-EXAMPLE


<doc3 a="abc" b="1.0" c="0">
    <!-- elements must be in the order specified -->
    <!-- one instance only of element 'array'-->
    <array>
    <!-- at least one and up to 256 instances of element 'e' -->
        <e val="0"/>
        <e val="7"/>
        <e val="8"/>
        <e val="3"/>
        <e val="2"/>
    </array>
    <!-- one instance only of element 'sub'-->
    <sub x="0" y="1" z="xyz"/>
</doc3>

If you want to stay more formal (and have a way to validate your documents), here is a schema representation of the same thing for doc3. A schema is a format representation of what constitutes a valid document. A schema is defined in the XML schema language, which itself is XML. I find schema's easier than DTD's but that's just me.

If you are not familiar with schemas, you should read up on them at the w3.org site.

The schema definition language is very broad and allows a lot of capability to define strange and wonderful document structures. For this type of application, I find it useful to avoid generality and be restrictive in how the documents are defined. For example, if you have a set of elements, with a schema you can say 'in any order' or 'in the specified order' and the XML parser will be happy. By restricting it to a specified order, you make the finite state machine simpler and easier to understand. More choices mean more states.


B. Design the finite state machine

Now we have a document definition lets figure out the finite state machine that will parse it. Looking at the document, we can see the following steps that will occur (coinciding with the expat events)

    start a doc3 and get its attributes a,b and c
       start an array and loop
            start element v and get its attribute e
            end element v
            insert element v into array
       until end of array (greater than 256 is an error)
       start a sub and get its attributes x,y and z
       end of sub
    end of doc3

The state diagram would look like this

Once you do this design, the state diagram tells you exactly what to do.


C. Implement it

There are two approaches to building a state machine to run under expat: input centered or state centered. The approach chosen determines how you divide up the work into functions. Either approach implements the same thing.

In general, a state machine is implemented by having a state variable, typically an integer that represents the current state. Then inputs come in. Based on the current state and the current input, an action is taken and a new state is determined for the next iteration.

You can implement state machines in several different ways. One approach is to use a table driven state machine. I like table driven state machines when they are automatically generated but not when I have to construct and maintain them by hand. I find that maintenance is tough, especially when you come back to it later. Usually other people have a hard time figuring it out from the code. The state centered design can be implemented using a table if that is desired. So I am proceeding with a non data-driven approach.

The classical switch implementation approach looks like this:

int state_machine(int state,int input) 
{
    // the current state determines the main section to execute
    switch(state) {
       case state1:
	       // within a state, the input determines the specific action
           switch(input) {
           case input1:
               state = state2;
               break;
		   ...
           }
       break;
       case state1:
       ...
       }
   return state;
}
   
state = begin
...
// iterate the state machine once
state = state_machine(state,getInput());
// repeat in a loop until done
...

The outer switch works on the state, the inner switch or if then else in each case works on the input. This models the state diagram most closely. This would be the state centered approach.

In the input centered approach, the outer switch is based on the input, and the inner switch on the current state. In a loop/switch environment, the resulting code appears awkward. However, in the expat situation, inputs come as expat 'event's, each with its own subroutine, so it naturally is input centered.

void StartEvent(...) 
{
    // you already know the input is a start event
    switch(state) {    
    case state1:
       state = state2;
       ...
       break;
    ...
    }
}

void EndEvent(...)
{
   // you already know the input is an end event
   switch(state) {
   case state1:
       ...
   }
}

This is efficient and fits the expat model well, but it doesn't provide an intuitive view to the state diagram that a state centered model does. So what I do is convert the events to inputs and call a central state machine

void StartEvent(...)
{
    // send to the main state machine with the event type specified
    state = state_machine(start_event,state);
}


void EndEvent(...) 
{
    // send to the main state machine with the event type specified
    state = state_machine(end_event,state);
}


// all work is done in the main state machine function
int state_machine(int input,int state)
{
    switch(state) {
    case state1:
       switch(input) {
       case input1:
       ...
       }
    case state2:
       switch(intput) {
       case input1:
       }
    ...
    }
return state;
}

Ok, on to the actual implementation. Looking at the state diagram, we see that we need 9 states and 2 events. I usually set up an enum or const ints.

enum state_t {
   start,
   start_doc3,
   start_array,
   start_v,
   end_v,
   end_array,
   start_sub,
   end_sub,
   end
};


enum event_t {
   start_element,
   end_element
   // .. add more here if additional expat events are used
};

Define a 'userData' type that has the data required by the state machine

typedef struct state_data_t {
   int     state;    // state variable    
   doc3_t *doc;      // build up the data structure here
};

Implement the StartElement and EndElement event functions according to the expat documentation:

static void startElement(void *userData, const char *name, const char **atts)
{ 
    // call common state machine
    state_machine(start,userData,name,atts);
}

static void endElement(void *userData, const char *name)
{
    // call common state machine 
    state_machine(end,userData,name,0);
}

Implement the state machine

static void state_machine(int event,void *userData,const char*name,const char    **atts)
   {
   state_data_t *sd = (state_data_t *)userData;
   int newstate;
   
   switch(sd->state) {
   // START STATE
   case start:
       switch(event) {
       case start_element:
            if (strcmp(name,"doc3") == 0) {
                // allocate a doc3 data element
                sd->doc = (doc3 *)malloc(sizeof(doc3)); // or new doc3, whatever
                // get attributes 
                doc3_attributes(sd->doc,atts);
                // set new state
                newstate = start_doc3;
            }
            else {
               // error or ignore
            }
           break;
       case end_element:
            // error or ignore
            break;
       default:
            // error or ignore
            break;
       }

    // I did the above section formally, with all the cases,
    // but it is easy to see you could condense it to
    /*
    if ((strcmp(name,"doc3")==0)&&(event == start_element)) {
        // good, process
        }
    else {
        // error or ignore
    }
    */
    break;

    // .. same structure for all other cases
	case end_sub:
        if ((strcmp(name,"doc3")==0)&&(event == end_element)) {
           // DONE. Expat will exit the parse function
           // and the user can use the built up data structure
           sd->state = end;
           }
        else {
           // error or ignore
        }
	break;	
    // default case is a fatal error, someone set the state to an invalid value
    default:
        // error
        break;
    }
    sd->state 
}

D. Extend it to another doc type

In the above implementation, there is only one document type. To handle more than one document type, you modify the state diagram to branch at the beginning event to separate state machines for each document type. A good way to handle it is to actually implement separate state machines for each document type and vector to them based on the initial tag.


E. Automate the Process?

I have tried to describe a systematic approach to building an expat based document to data structure parser. This methodology cries out for automation. It is eminently possible to parse the schema for a document and generate the code for the state machine automatically. Thats the homework for this paper. That approach is what is done at some level in marshalling objects as XML with SOAP, but instead of a schema it starts with the data structure. In Java its pretty easy to use reflection to figure out how to represent a data structure in XML.


comment on this essay