Feb 19

Using Javacc

Category: Java, Programming

If you want to parse a custom language in java then Javacc is your tool. At the moment this is probably more a reminder to myself than a great article for anyone else to read, but as you find it useful feel free to take a look. More after the link…

What is Javacc?

Javacc is a parser/scanner generator. In plain English, javacc is what you use when you want to write a parser for some custom language. It basically replaces the abandoned jflex and cup tools which themselves are a re-tread of the venerable lex and yacc toolset.

Both these tools are comprised of a lexical analyzer which breaks up test input into a string of tokens based on regular expressions and a grammar parser which maps particular token patterns to fragments of code. The developer writes a set of configuration files which are in turn used to generate efficient source code that parses the specified language.

Javacc takes things to the next level by combining the lexical analysis and grammar parsing into a single tool with a single configuration file. It also replaces the old grammar specification used by these tools with something more closely approximating BNF making it possible to describe complex grammars in a concise manner.

In addition, unlike antler and jflex/cup, javacc creates a self-contained set of classes that do not have to be linked to a library.

An Example

All the source code for the following example is Here.

For an example I’m going to create a fictional language called the Example language. A snippet of code in the Example language looks something like this:

RecordType RecordName{
  property1="value"
  property2=12
  property3=0xFF
  property4=12.3
  RecordType RecordName{
     property="value"
  }
}
RecordType RecordName2{
  ...
}

Basically it is a simplified, human-readable XML. What I want to do is create a parser that can read a file in this format and create an object tree that describes it using two objects: RecordInfo and PropertyInfo

RecordInfo

package com.fourthmouse.grammar_example;
import java.util.List;
import java.util.LinkedList;
public class RecordInfo{
  private List<RecordInfo> children;
  private List<PropertyInfo> properties;
  private String name; 
  private String type; 
  public void setName(String name){
    this.name = name;
  }
  public String getName(){
    return name;
  }
  public void setType(String type){
    this.type = type;
  }
  public String getType(){
    return type;
  }
  public void setChildren(List<RecordInfo> children){
    this.children = children; 
  }
  public List<RecordInfo> getChildren(){
    return children;
  }
  public void setProperties(List<PropertyInfo> properties){
    this.properties = properties; 
  }
  public List<PropertyInfo> getProperties(){
    return properties; 
  }
};

PropertyInfo

package com.fourthmouse.grammar_example;
class PropertyInfo{
  private String name; 
  private Object value;
  public void setName(String name){
    this.name = name;
  }
  public String getName(){
    return name;
  }
  public void setValue(Object value){
    this.value= value;
  }
  public Object getValue(){
    return value;
  }
}

The Grammar File

Now we need to define a parser that will create a tree of the RecordInfo and PropertyInfo objects. The basic format of the grammar file is something like this:

PARSER_BEGIN(ParserNameHere)
//code to add to the generated parser class here
PARSER_END(ParserNameHere)
SKIP:
{
    //regular expressions describing text
    //to strip from input before parsing
}
TOKEN:
{
   //a list of regular expressions mapped to token names
}
ReturnType methodName() :
{
   //BNF definition
}

Defining the Tokens

The first step is to define the token types we are going to use. Basically we need to be able to identify the following:

  1. Names – bob, sally, _fran_
  2. Number literals – 5, 13.5, 0xFF
  3. String literals – “bob”, “blah de blah”
SKIP :
{
  "\r"|"\n"|"\t"|"\f"|" "
}
TOKEN :
{
  < HexNumber           : "0"["x","X"](["0"-"9","a"-"f","A"-"F"])+ >  |
  < DecimalNumber       : (["-","+"])?(["0"-"9"])+ > |
  < FloatingPointNumber : 
        (["+","-"])?(["0"-"9"])*(".")?(["0"-"9"])+((["E","e"])(["+","-"])?(["0"-"9"])+)? > |
  < StringLiteral       : "\""(~["\""])*"\"" > |
  < Null                : "null" > |
  < NameToken           : ["a"-"z","A"-"Z","_"] (["a"-"z","A"-"Z","_","0"-"9"])* >
}

Note that the tokens are matched in the order declared. This means that, for instance, the string “null” will be matched by the Null token expression and not the NameToken expression.

Defining the Grammar

Now that we’ve defined our basic tokens, it is time to define the grammar. This is done in terms of what look very much like method declarations:

returnType methodName():
{
  //block of java code which runs before any grammar defined in this method is parsed
}
{
  //BNF expression {
    code to execute when BNF expression is matched
  }
}

In our case we need to define what a Document looks like. The straight BNF would be something like:

Document:
RecordList
RecordList:
( RecordDeclaration )*
RecordDeclaration:
<NameToken> <NameToken> "{"
    PropertyList
    RecordList
"}"
PropertyList:
( PropertyDeclaration )*
PropertyDeclaration:
(<NameToken> "=" (<HexNumber> | <DecimalNumber> | <FloatingPointNumber> | <StringLiteral> | <Null>) )*

Which translates to something like this:

Document

List<RecordInfo> Document():
{
  List<RecordInfo> ret;
}
{
  ret = RecordList(){return ret;}
 <EOF>
 {
    return ret;
  }
}

RecordList

List<RecordInfo> RecordList():
{
  List<RecordInfo> ret = new LinkedList<RecordInfo>();
  RecordInfo record;
}
{
  ( record = RecordDeclaration(){ret.add(record);} )*
  {return ret;}
}

RecordDeclaration

RecordInfo RecordDeclaration():
{
  RecordInfo ret = new RecordInfo();
  List<PropertyInfo> propertyList;
  List<RecordInfo> recordList;
  Token t;
}
{
  t = <NameToken> {ret.setType(t.image);}
  t = <NameToken> {ret.setName(t.image);}
  "{"
  propertyList = PropertyList(){ret.setProperties(propertyList);}
  recordList = RecordList(){ret.setChildren(recordList);}
  "}"
  {
     return ret;
   }
}

PropertyList

List<PropertyInfo> PropertyList():
{
  List<PropertyInfo> ret = new LinkedList<PropertyInfo>();
  PropertyInfo property;
}
{
  (property = PropertyDeclaration(){ret.add(property);})*
  {
    return ret;
  }
}

PropertyDeclaration

PropertyInfo PropertyDeclaration():
{
  PropertyInfo ret = new PropertyInfo();
  Token t;
  Object obj;
}
{
  t = <NameToken>{ret.setName(t.image);}
  "="
  ( 
    t = <StringLiteral> {ret.setValue(t.image);} |
    <Null> {ret.setValue(null);} |
    obj = IntegerNumber() {ret.setValue(obj);} | //IntegerNumber converts integer string to Long object
    obj = FloatNumber(){ret.setValue(obj);}         //FloatNumber conversion float string to Double object
  )
  {
     return ret;
  }
}

Defining the Parser Object

The final step is to define the parser object itself. This allows the programmer to define the package, any imports used by the java code in the grammar, and any extra methods, etc that may be useful in the generated class.

PARSER_BEGIN(RecordParser)
package com.fourthmouse.grammar_example;
import java.util.List;
import java.util.LinkedList;
public class RecordParser{
  public static void main(String args[]) throws ParseException{
    RecordParser parser = new RecordParser(System.in);
    List<RecordInfo> records = parser.Document();
    //do something with the records here.....
  }
}
PARSER_END(RecordParser)

Building the Parser

As much as it pains me to say this, maven is probably the easiest way to build everything. I used the following pom:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.4thmouse</groupId>
  <artifactId>grammar_example</artifactId>
  <packaging>jar</packaging>
  <version>CURRENT</version>
  <name>grammar example</name>
  <url>http://4thmouse.com</url>
  <build>
    <plugins>
      <!-- support java 1.5 -->
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>1.5</source>
          <target>1.5</target>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.codehaus.mojo</groupId>
        <artifactId>javacc-maven-plugin</artifactId>
        <executions>
          <execution>
            <id>javacc</id>
            <goals>
              <goal>javacc</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

Lookahead

Now if you do build the previous you will see the following warning:

Warning: Choice conflict in (...)* construct at line 83, column 9.
         Expansion nested within construct and expansion following construct
         have common prefixes, one of which is: <NameToken>
         Consider using a lookahead of 2 or more for nested expansion.

If you are familiar with grammars then this will be obvious. If you aren’t this will probably be confusing. Basically what’s happening is that the most efficient grammar parsers only have to look at the very next token to figure out what bit of code to execute next. In this case if you have this:

RecordType RecordName{
    property="value"
    RecordType RecordName{
    }
}

It becomes the following symbols:

<NameToken> <NameToken> "{"
    <NameToken>"="<StringLiteral>
    <NameToken> <NameToken>"{"
    "}"
"}"
<EOF>

When the parser gets to the property list it just doesn’t know when the property list ends and the record list starts because in both cases the first token is so until it sees either an "=" or a it doen’t know which bit of grammar parser code to execute.

There are two ways to solve this problem. The first is to just modify the grammar so the problem goes away. The second is to tell the grammar parser to look ahead two tokens instead of one. In general the second option is frowned upon, but frankly I’m not too worried about efficiency here so here is what you do:

List<PropertyInfo> PropertyList():
{
  List<PropertyInfo> ret = new LinkedList<PropertyInfo>();
  PropertyInfo property;
}
{
  ( LOOKAHEAD(2) property = PropertyDeclaration(){ret.add(property);})*
  {
    return ret;
  }
}

The LOOKAHEAD directive tells the grammar parser to look ahead 2 tokens instead of one while parsing the property list. Problem solved.

Running the Parser

So now for the final step. Run the parser!

prompt$ mvn assembly:assembly
prompt$ cat somefile.rec | java -cp target/grammar_example-CURRENT-jar-with-dependencies.jar com.fourthmouse.grammar_example.RecordParser

The Author

Michael Smit is a software engineer in Seattle, Washington who works for amazon

Comments are off for this post

Comments are closed.