Saturday, March 05, 2011

4. Syntactic analysis of TL

This is part 4/9 of Creating your own programming language with ANTLR.



In this part, we're going to extend our combined grammar file from part 3 with parser rules. After the lexer has provided us with tokens, we're going to let the parser validate the order of these tokens, which is called the syntactic analysis of the language.

Project structure

Before we continue, we're going to setup a very basic project structure to keep things a bit orderly, and to make generating, compiling and running code easier.

The project will look like:

   +-- build/
   |     |
   |     +-- classes/
   +-- lib/
   |    |
   |    +-- antlr-3.2.jar
   +-- src/
   |    |
   |    +-- grammar/
   |    |      |
   |    |      +-- TL.g
   |    +-- main/
   |          | 
   |          +-- tl/  
   |               |
   |               +-- parser/
   |               +--
   +-- build.xml

The names that end with a / are directories, the others, not surprisingly, are files. You can re-use the files we used in 3. Lexical analysis of TL. There are some small modifications however:

In the combined grammar file TL.g we will tell ANTLR to put both the generated lexer and parser in the package lt.parser by adding this package declaration in their respective header sections:
grammar TL;

@parser::header {
  package tl.parser;

@lexer::header {
  package tl.parser;

  :  .* EOF
// the lexer rules here

The class will be in the package tl and it will import lt.parser.*, which will be the place where our generated lexer and parser are going to be placed:
package tl;

import tl.parser.*;
import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    // create an instance of the lexer
    TLLexer lexer = new TLLexer(new ANTLRFileStream(""));
    // wrap a token-stream around the lexer
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    // create the parser
    TLParser parser = new TLParser(tokens);
    // invoke the entry point of our parser

We'll use Ant to do all the boring tasks such as generating, compiling and running (source) code. I assume you, as a Java developer (or Java enthusiast), either already have Ant installed, or are able to do so on your own.

The Ant build script we'll be using looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<project name="TinyLanguage" default="run">

  <path id="classpath">
    <pathelement location="build/classes/" />
    <pathelement location="src/main/" />
    <fileset dir="lib">
      <include name="*.jar" />

  <target name="clean">
    <delete dir="build/classes/" />
    <mkdir dir="build/classes/" />

  <target name="compile" depends="clean">
    <javac srcdir="src/main/" destdir="build/classes/" includeantruntime="false">
      <classpath refid="classpath" />

  <target name="generate" depends="clean">
    <echo>Generating the lexer and parser...</echo>
    <java classname="org.antlr.Tool" fork="true" failonerror="true">
      <arg value="-fo" />
      <arg value="src/main/tl/parser/" />
      <arg value="src/grammar/TL.g" />
      <classpath refid="classpath" />
    <antcall target="compile" />

  <target name="run" depends="generate">
    <echo>Running the main class...</echo>
    <java classname="tl.Main">
      <classpath refid="classpath" />


That's is. Now open a command prompt, navigate to the project root, and test if everything works by executing the run target:

ant run

which should produce the following output:

/Path/To/TinyLanguage$ ant run
Buildfile: /Path/To/TinyLanguage/build.xml

   [delete] Deleting directory /Path/To/TinyLanguage/build/classes
    [mkdir] Created dir: /Path/To/TinyLanguage/build/classes

     [echo] Generating the lexer and parser...

   [delete] Deleting directory /Path/To/TinyLanguage/build/classes
    [mkdir] Created dir: /Path/To/TinyLanguage/build/classes

    [javac] Compiling 3 source files to /Path/To/TinyLanguage/build/classes

     [echo] Running the main class...
     [java] Done!

Total time: 1 second

Parser rules

Let's start defining the parser rules at the highest level. A TL script is essentially a block of of code. A block of code is zero or more statements or function declarations optionally ending with return statement. A statement is either an assignment, a function call, an if-statement, a for-statement or a while-statement. Assignments, function calls and return statements all must end with a semi colon.

If we translate the above into ANTLR parser rules, we could end up with this:
  :  block EOF

  :  (statement | functionDecl)* (Return expression ';')?

  :  assignment ';'
  |  functionCall ';'
  |  ifStatement
  |  forStatement
  |  whileStatement

  :  Def Identifier '(' idList? ')' block End

  :  Identifier (',' Identifier)*

You can paste the rules into the TL.g file. They should come after the @header { ... } sections. You can mix lexer- and parser rules, but your grammar is more readable if you keep them separate. People usually write lexer rules at the end of the grammar file (remember that the order of the lexer rules is important!).

An assignment can be just a simple a = 12*5; (on the left an identifier, and on the right an expression), but, it could also look like: X[0][1] = false; where X points to a 2 dimensional list where the first list's second element must be set to false. Such a rule would look like this:
  :  Identifier indexes? '=' expression

  :  ('[' expression ']')+

An expression has a bit more rules. We start at the operators with the lowest precedence: the ternary condition (a = 1+2==3 ? true : false) and the in operator ('foo' in ['bar','foo']) and we then trickle down towards the operators with the highest precedence (the unary - and !).
  :  condExpr

  :  orExpr ( '?' expression ':' expression
            | In expression

  :  andExpr ('||' andExpr)*

  :  equExpr ('&&' equExpr)*

  :  relExpr (('==' | '!=') relExpr)*

  :  addExpr (('>=' | '<=' | '>' | '<') addExpr)*

  :  mulExpr (('+' | '-') mulExpr)*

  :  powExpr (('*' | '/' | '%') powExpr)*

  :  unaryExpr ('^' unaryExpr)*
  :  '-' atom
  |  '!' atom
  |  atom

  :  Null
  |  Number
  |  Bool
  |  lookup

The lookup rule can be various other rules with optional indexes after it:
  :  functionCall indexes?
  |  '(' expression ')' indexes?
  |  list indexes?
  |  Identifier indexes?
  |  String indexes?

The list rule shouldn't pose too many problems:
  :  '[' exprList? ']'

  :  expression (',' expression)*

A function call would look like:
  :  Identifier '(' exprList? ')'
  |  Println '(' expression? ')'
  |  Print '(' expression ')'
  |  Assert '(' expression ')'
  |  Size '(' expression ')'

As you can see, all of the function calls print(...), assert(...) and size(...) take exactly one parameter (or expression), println(...) takes an optional expression as a parameter, and a call to a custom defined function takes zero or more expressions as parameter.

Lastly, there are the if-, for-, and while statements. I've divided the if statement into a couple of separate sub-rules otherwise the entire rule would have been too big for my taste:
  :  ifStat elseIfStat* elseStat? End 

  :  If expression Do block

  :  Else If expression Do block

  :  Else Do block

  :  For Identifier '=' expression To expression Do block End 

  :  While expression Do block End

That's the complete parser grammar of TL! If you execute your the Ant target run so that the file gets parsed, there shouldn't appear any error messages on your screen, which means that the contents of the test script is first properly tokenized, and after that, the parser has correctly parsed these tokens (the file is syntactically valid).


So that's all the parser does: it checks if the token stream is syntactically valid. But our tokens are still just a flat, 1 dimensional list. It also contains tokens that are only needed to check the validity of the source but are making it hard to evaluate the expression(s). Let's say out input source is true && (false || true && (true || false));, our parser now just validates it as syntactic valid input, but the token stream is just a 1 dimensional list of tokens while it would be easier to evaluate such input if the token stream was structured like this:

I.e. the parenthesis and semi colon are gone, and every operator node has exactly two children that can be evaluated more easily.

The structuring of our token stream into an AST (abstract syntax tree) will be handled in part 5. Building an AST.

P.S. You can download the project containing all source code, build script, ANTLR jar etc. here.

8 comments: said...

Hi Bart

Many thanks for your great work in explaining ANTLR by example. I am new to ANTLR and I came across your work by searching for solution of my own parsing grammar problem (Excel-like expressions). I tried your sources from this page with ANTRLWorks 1.4.2 interpreter and it seems that this expression is not fully parsable (ran it starting from "expression" rule:


The result was that after the second closing parenthesis parsing was stopped so *4^5 was just ignored.

Still looking for solution... :)


Bart said...

Hi Taras, thanks for your kind words. It's hard to diagnose your problem without seeing the grammar you're working on. I suggest you post a question on Stackoverflow including the grammar and test-classes and explain what's going wrong.

Good luck!

Bart. said...

Besides my own grammar, what I actually meant was that grammar from the link above ( had problems interpreting this expression:

So, my question is if TL supports such syntax of expressions or not. What I could understand from reading the grammar, it should. But ANTRLWorks 1.4.2 Interpreter fails in the way I described above.


Bart said...

Hi Taras, no, (1+(2+-3^2))*4^5 by itself is not a valid statement in TL. It should be a return statement (return (1+(2+-3^2))*4^5;), or an assignment (n = (1+(2+-3^2))*4^5;), for example. Don't forget the semi-colon at the end! said...


return (1+(2+-3^2))*4^5;

and it works just fine if I start Interpreter from "block" or "parse" rules. Interestingly,


which normally should correspond to "expression" rule if run in Interpreter starting from "expression" doesn't work properly. Weird...

Bart said...

This is exactly why I didn't introduce ANTLRWorks in my tutorial: granted, it's a neat tool, but it has its own peculiarities, and a somewhat buggy interpreter, to be frank.

Yes, you are right, the expression is a valid one as you can see when you run this snippet:

TLLexer lexer = new TLLexer(new ANTLRStringStream("(1+(2+-3^2))*4^5"));

TLParser parser = new TLParser(new CommonTokenStream(lexer));


ANTLRWorks should also accept it... said...

Bart said...
This is exactly why I didn't introduce ANTLRWorks in my tutorial: granted

very valuable comment for newcomers...

Thanks, Bart

Anonymous said...

thank you