Jul 1

Boost Spirit Part I (Validating Against a Grammar)

Category: C++, Programming

I haven’t done a post in a while, but I thought the boost spirit library was worth an entry. Spirit provides a way of specifying and parsing a custom language grammar right inline in C++ without the need for a code generator like lex/yacc. Spirit makes it so easy to specify and use a grammar that you should never hand write even the most trivial parser again.

The full code for this example can be found here.

Although spirit is a very impressive tool, the provided documentation can be a bit schizophrenic because there are so many ways of processing the same input. As a result I’m going to do three or four articles that demonstrate the major features of the library and then let you get the details from the reference.

This article will demonstrate how to specify a grammar and then validate text against it. The following articles will show how to do something useful with the parsed data.

The BNF Grammar

The first step in using spirit is to describe the grammar you wish to parse. Spirit uses a combination of templates and operator overloading to create a syntax within C++ that looks a lot like BNF notation. So the first step in defining a spirit grammar is to describe the language you wish to parse in BNF. For the purposes of this article I’m going to use the following:

<prop_decl> ::= <identifier>, '=', <string_literal>
<prop_list> ::= <prop_decl>, {',', <prop_decl>}
<record_list> ::= <record>, {',', <record>}
<record> ::= <identifier>, <identifier>, '{',
            [',', <record_list>]
        ) |
<language> ::= [<record_list>]

A file matching this grammar would look something like this:

Person Bob{
    height="6 ft.",
    Address home{
        street="1234 applecourt rd",
        city="New York",
Person Sally{

The Spirit Grammar

Once you’ve defined you grammar you’ll need to translate it into a spirit grammar class. Spirit is similar to the BNF, but not identical. For our purposes the following apply (for a full listing of operators check out this documentation):

operator BNF Spirit
grouping (<expression>) (expression)
optional [<expression>] !expression
repeat {<expression>} *expression
follows <expression> , <expression> expression >> expression
or <expression> | <expression> expression | expression

Now create a class derived from boost::spirit::grammar and specify the rules as follows (full example here):

struct record_grammar:public grammar<record_grammar>{
    template<class ScannerT>
    struct definition{
        rule<ScannerT> identifier;
        rule<ScannerT> string_literal;
        rule<ScannerT> prop_decl;
        rule<ScannerT> prop_list;
        rule<ScannerT> record_list;
        rule<ScannerR> record;
        definition(const record_grammar &self){
            //identifier and string_literal are explained
            //in the next section
            identifier = ...;
            string_literal = ...;
            prop_decl = identifier >> ch_p('=') >> string_literal;
            prop_list = prop_decl >> *(ch_p(',') >> prop_decl);
            record_list = record >> *(ch_p(',') >> record);
            record = identifier >> identifier >> ch_p('{') >>
                        prop_list >> 
                        [ch_p(',') >> record_list)]
                    ) | record_list
                ) >>
        //the top level rule to use when
        //parsing input
        rule<ScannerT> const &start(){return record_list;}

Identifier and String Literal

The previous definition covers everything except for the identifier and string_literal rule definitions. These rules are special because they define tokens in terms of raw input. Specifically, whitespace is significant when parsing both.

For instance, “a b c” should be parsed as three separate identifiers rather than one identifier “abc”. Conversely identifier >> ch_p('{') should match "b {", "b {", and so on regardless of whitespace.

In the older lex/yacc style parsers there are actually two separate code generators: one to parse the raw input into tokens and one to verify the grammar in terms of those tokens. In modern tools like spirit though, both steps are specified in the same place.

The key to handling whitespace-sensitive rules is the use of the parser directive lexeme_d.

//match anything with a leading letter followed by 
identifier = lexeme_d[
    alpha_p >> *(alnum_p >> ch_p('_') )
//match anything starting with "
//and containing any characters until
//the second " is found (allow for the escape
//sequence \" so that "'s may be embedded
//in the string.
string_literal = lexeme_d[
    ch_p('"') >>
        (anychar_p - ch_p('"')) |
    ) >>

When the parser hits the lexeme_d[code] directive it stops ignoring any skip characters (typically whitespace) in the contained rule.

Using the Grammar

The final step is to actually use this grammar to parse and validate an input. In order to do this, you need an input, a grammar, and a rule describing what characters to skip when not in lexeme_d mode.

The following code reads in all input on STDIN into a string which is then validated against the record_grammar grammar class, skipping any whitespace between tokens:

int main(){
    string input;
    string temp;
    while(getline(cin, temp)){
        input += temp;
    record_grammar g;
    //space_p is a built-in rule that matches whitespace
    parse_info<> info = parse(input.c_str(), g, space_p);
        cout << "Error found at location: " << info.stop << endl;
        cout << "Success!" << endl;

Where Spirit Falls Short

Spirit is a great tool, but it does have some shortcomings. The first is that, like many domain specific languages, spirit's error messages are often very unintuitive. It can be extremely frustrating to figure out why your grammar won't compile. Additionally, and for much the same reason, boost will not tell you if you have defined an ambiguous grammar (i.e. where the same text can be matched by multiple rules).

The other big issue with spirit is the compile time. Even for moderately complicated grammars gcc will grind practically to a halt.

As a result I would suggest the following strategies when using spirit:

  1. Do not write the whole grammar first before compiling. Write it a rule at a time and check the compile at each step.
  2. Do the due diligence of writing out the BNF first rather than just jumping in.
  3. Unit test (not a bad idea in general) your grammar to make sure the edge conditions are correct.
  4. If you are going to use the same grammar in multiple places wrap it in a single class or function rather than re-compiling the boost::parse call in every instance.
  5. Move the boost::parse code into it's own cpp with only required headers included. This will reduce unnecessary recompiles.

Attaching Actions to the Grammar

Go on to the next article in order to learn how to do something useful with your grammar.

The Author

Michael Smit is a software engineer in Seattle, Washington who works for amazon


5 Comments so far

  1. [...] the last article we covered how to define a grammar in the boost spirit library. This article will show you how to [...]

  2. Rockin' Pete August 24th, 2008 4:26 pm

    Does your grammar compile? Or should the term ‘property_list’ actually be ‘prop_list’?

  3. mike August 24th, 2008 5:59 pm

    You are correct. The source code linked to from the top of the article works, but sometimes I tweak the excerpts and then mess them up:)

  4. [...] article expands the example from the previous two, here and here, to add more specific error [...]

  5. [...] Boost Spirit Part I (Validating Against a Grammar) Boost Spirit Part II (Attaching Actions to Your Grammar) Boost Spirit III (Adding Error Handling) [...]