Thursday, March 10, 2011

9. Room for improvement


This is part 9/9 of Creating your own programming language with ANTLR.


Index





Introduction


Welcome to the final part of the tutorial. If you want to try to finish (some of) the implementation of the TL-node classes on your own, make sure you're up to date with the latest source files developed so far: download them here. You'll want to test your implementation using the test script below.

Complete source


In order to test all functionality of TL, replace the contents of the test.tl file with the following source:

/*
    A script for testing Tiny Language syntax.
*/

// boolean expressions
assert(true || false);
assert(!false);
assert(true && true);
assert(!true || !false);
assert(true && (true || false));

// relational expressions
assert(1 < 2);
assert(666 >= 666);
assert(-5 > -6);
assert(0 >= -1);
assert('a' < 's');
assert('sw' <= 'sw');

// add
assert(1 + 999 == 1000);
assert([1] + 1 == [1,1]);
assert(2 - -2 == 4);
assert(-1 + 1 == 0);
assert(1 - 50 == -49);
assert([1,2,3,4,5] - 4 == [1,2,3,5]);

// multiply
assert(3 * 50 == 150);
assert(4 / 2 == 2);
assert(1 / 4 == 0.25);
assert(999999 % 3 == 0);
assert(-5 * -5 == 25);
assert([1,2,3] * 2 == [1,2,3,1,2,3]);
assert('ab'*3 == "ababab");

// power
assert(2^10 == 1024);
assert(3^3 == 27);

// for- and while statements
a = 0;
for i=1 to 10 do
  a = a + i;
end
assert(a == (1+2+3+4+5+6+7+8+9+10));

b = -10;
c = 0;
while b < 0 do 
  c = c + b;
  b = b + 1;
end
assert(c == -(1+2+3+4+5+6+7+8+9+10));

// if
a = 123;
if a > 200 do
  assert(false);
end

if a < 100 do
  assert(false);
else if a > 124 do
  assert(false);
else if a < 124 do
  assert(true);
else do
  assert(false);
end

if false do
  assert(false);
else do
  assert(true);
end

// functions
def twice(n)
  temp = n + n; 
  return temp; 
end

def squared(n) 
  return n*n; 
end

def squaredAndTwice(n) 
  return twice(squared(n)); 
end

def list()
  return [7,8,9];
end

assert(squared(666) == 666^2);
assert(twice(squared(5)) == 50);
assert(squaredAndTwice(10) == 200);
assert(squared(squared(squared(2))) == 2^2^2^2);
assert(list() == [7,8,9]);
assert(size(list()) == 3);
assert(list()[1] == 8);

// naive bubble sort
def sort(list)
  while !sorted(list) do
  end
end
def sorted(list)
  n = size(list);
  for i=0 to n-2 do
    if list[i] > list[i+1] do
      temp = list[i+1];
      list[i+1] = list[i];
      list[i] = temp;
      return false;
    end
  end
  return true;
end
numbers = [3,5,1,4,2];
sort(numbers);
assert(numbers == [1,2,3,4,5]);

// resursive calls
def fib(n)
  if n < 2 do
    return n;
  else do
    return fib(n-2) + fib(n-1);
  end
end
sequence = [];
for i = 0 to 10 do
  sequence = sequence + fib(i);
end
assert(sequence == [0,1,1,2,3,5,8,13,21,34,55]);

// lists and lookups, `in` operator
n = [[1,0,0],[0,1,0],[0,0,1]];
p = [-1, 'abc', true];

assert('abc' in p);
assert([0,1,0] in n);
assert(n[0][2] == 0);
assert(n[1][1] == n[2][2]);
assert(p[2]);
assert(p[1][2] == 'c');

The script above is just a number of assert calls that fail whenever the expression provided as parameter evaluates to false. So, after running the script above, you'll know that everything went okay if you didn't see anything printed to the console.

If you don't want to implement all node classes yourself, download the complete implementation of TL here. Execute the Ant task run to execute the test.tl file.

And when C# is more to your liking, Bruce (see the comments below) translated/implemented TL in C#:

Bruce wrote: Hi Bart, this is great. I converted the whole thing to C#; if you are interested I can send you the project (VS2010, C# 4.0) so that others can download it from your blog.

The "test.tl" file is in the bin/Debug folder under the console app project, so that the executable can find it without a path.

One other note: I had to add a couple of custom list classes so that the interpreter would understand the value-semantics of list equality comparisons. Apparently Java considers List equality on an item-by-item basis, whereas .NET uses reference equality on the list object itself. The two classes are TLNodeList and TLValueList. They override ToString() and Equals() to match Java equality comparisons.

Download Bruce's VS project here


Improvements


First of all, realize that all source files posted here are for educational purposes only. In no way are they production worthy: there are no unit tests, I haven't re-factored anything and there are very little comments in the source.

That said, where to go from here?

Imagine the code was properly unit tested, there's proper exception handling etc. Then the next step would be to provide better/friendlier (error) messages to the user (programmer) if some illegal syntax is used. To find out more about that, please see: ANTLR 3.0 Error Reporting and Recovery on the official ANTLR Wiki.

After that, the language could be extended by some sort of structs/classes. To find out how to do that, get a hold of Language Implementation Patterns of the same author of the (almost) mandatory book The Definitive ANTLR Reference, Terence Parr, creator of ANTLR.

That's it for now. If you have questions about the source of this tutorial, or found an error, please post a message here below any of my blogs. If you are writing your own language using ANTLR and are stuck on that, please use the ANTLR mailing list, or post a question on StackOverflow where I frequently answer ANTLR related questions.

Bye!

PS. For clarity, download the complete sources here.

For an implementation of TL using ANTLR 4, see this Github project: https://github.com/bkiers/tiny-language-antlr4

Wednesday, March 09, 2011

8. Interpreting and evaluating TL II


This is part 8/9 of Creating your own programming language with ANTLR.


Index





Introduction


Okay, welcome back. In this part, we're going to create custom tree nodes that will perform the actual evaluation of our TL source.

Before continuing, make sure your source files are up to date. If they're not, or you're unsure of it, download the most recent source files of this tutorial here.



Evaluation II


I won't handle all different nodes types, but I'll explain a couple different ones instead. If I explain the AddNode, there is no need to handle SubNode, for example. In the final part of this tutorial, I'll put a link to a zip file containing the entire project including all custom nodes.

AddNode


Okay, the very first node we're going to handle is the AddNode. An AddNode always has two children: a lhs (left hand side) and a rhs (right hand side). This operation not only adds two numerical values together, but we'll also support string concatenation and add elements to a list using the + operator. Okay, this is how an implementation of such a node could look like:
// place this file in: src/main/tl/tree/
package tl.tree;

import tl.TLValue;
import java.util.List;

public class AddNode implements TLNode {

  private TLNode lhs;
  private TLNode rhs;

  public AddNode(TLNode lhs, TLNode rhs) {
    this.lhs = lhs;
    this.rhs = rhs;
  }

  @Override
  public TLValue evaluate() {

    TLValue a = lhs.evaluate();
    TLValue b = rhs.evaluate();

    // number + number
    if(a.isNumber() && b.isNumber()) {
      return new TLValue(a.asDouble() + b.asDouble());
    }

    // list + any
    if(a.isList()) {
      List<TLValue> list = a.asList();
      list.add(b);
      return new TLValue(list);
    }

    // string + any
    if(a.isString()) {
      return new TLValue(a.asString() + "" + b.toString());
    }

    // any + string
    if(b.isString()) {
      return new TLValue(a.toString() + "" + b.asString());
    }

    throw new RuntimeException("illegal expression: " + this);
  }

  @Override
  public String toString() {
    return String.format("(%s + %s)", lhs, rhs);
  }
}

As you can see, inside the evaluate method, the left- and right children are first evaluated, and based on the type they return, the add returns the end result. And if we "fall" all the way through, it must mean it is an illegal expression, and throw an exception. Pretty straight forward (I hope!).

Okay, now we'll need to edit our tree grammar file, and make sure the node variable that is being returned, is assigned to an AddNode. This can be done by doing:
expression returns [TLNode node]
  :  ^(TERNARY expression expression expression)
  |  ^(In expression expression)
  |  ^('||' expression expression)
  |  ^('&&' expression expression)
  |  ^('==' expression expression)
  |  ^('!=' expression expression)
  |  ^('>=' expression expression)
  |  ^('<=' expression expression)
  |  ^('>' expression expression)
  |  ^('<' expression expression)
  |  ^('+' a=expression b=expression) {node = new AddNode($a.node, $b.node);}
  |  ^('-' expression expression)
  |  ^('*' expression expression)
  |  ^('/' expression expression)
  |  ^('%' expression expression)
  |  ^('^' expression expression)
  |  ^(UNARY_MIN expression)
  |  ^(NEGATE expression)
  |  Number
  |  Bool
  |  Null
  |  lookup           
  ;

Since there is more than one expression inside ^('+' expression expression), we can't reference a particular rule using $expression. That's why I named the left expression variable a and the right expression variable b. After the sub-rule, in plain code, I then created an instance of a AddNode and assigned that to node which is the object the expression rule returns.

In the entire grammar, I named the variable that is being returned node, but you can use anything you want there. Let's say I would have named it TREE:

expression returns [TLNode TREE]
  :  ...
  ;

I'd simply needed to make sure that inside the custom code I also used that variable name, like this:
expression returns [TLNode TREE]
  :  ...
  |  ^('+' a=expression b=expression) {TREE = new AddNode($a.TREE, $b.TREE);}
  |  ...         
  ;

AtomNode.java


Okay, to be able to perform a first small test, we'll also need to create a node that represents the "smallest" part of the tree (a leaf node) containing one of our datatypes: number, string, boolean or list (or null). Instead of creating four (or five) different nodes, we'll simply create a single node called AtomNode that wraps a TLValue around the object we pass to it.
package tl.tree;

import tl.TLValue;

public class AtomNode implements TLNode {

  private TLValue value;

  public AtomNode(Object v) {
    value = (v == null) ? TLValue.NULL : new TLValue(v);
  }

  @Override
  public TLValue evaluate() {
    return value;
  }

  @Override
  public String toString() {
    return value.toString();
  }
}

Let's see how that works for a Number token:
expression returns [TLNode node]
  :  ^(TERNARY expression expression expression)
  |  ^(In expression expression)
  |  ^('||' expression expression)
  |  ^('&&' expression expression)
  |  ^('==' expression expression)
  |  ^('!=' expression expression)
  |  ^('>=' expression expression)
  |  ^('<=' expression expression)
  |  ^('>' expression expression)
  |  ^('<' expression expression)
  |  ^('+' a=expression b=expression) {node = new AddNode($a.node, $b.node);}
  |  ^('-' expression expression)
  |  ^('*' expression expression)
  |  ^('/' expression expression)
  |  ^('%' expression expression)
  |  ^('^' expression expression)
  |  ^(UNARY_MIN expression)
  |  ^(NEGATE expression)
  |  Number {node = new AtomNode(Double.parseDouble($Number.text));}
  |  Bool
  |  Null
  |  lookup           
  ;

BlockNode.java


We can almost do a first test of a script that actually evaluates something: we only need to implement a node that represents a block of code:
block returns [TLNode node]
  :  ^(BLOCK ^(STATEMENTS statement*) ^(RETURN expression?))
  ;

That's a bit different than our previous two nodes. This node is actually a collection of nodes. Here's how that could look like:
package tl.tree;

import tl.TLValue;

import java.util.ArrayList;
import java.util.List;

public class BlockNode implements TLNode {

  private List<TLNode> statements;
  private TLNode returnStatement;

  public BlockNode() {
    statements = new ArrayList<TLNode>();
    returnStatement = null;
  }

  public void addReturn(TLNode stat) {
    returnStatement = stat;
  }

  public void addStatement(TLNode stat) {
    statements.add(stat);
  }

  @Override
  public TLValue evaluate() {
    for(TLNode stat : statements) {
      TLValue value = stat.evaluate();
      if(value != TLValue.VOID) {
        // return early from this block if value is a return statement
        return value;
      }
    }

    // return VOID or returnStatement.evaluate() if it's not null
    return returnStatement == null ? TLValue.VOID : returnStatement.evaluate();
  }
}

As you can see, it starts as an empty code block and can have statements added to it through its addStatement(...) method, and a return statement can optionally be set by using its addReturn(...) method. Let's integrate that in our tree grammar:
walk returns [TLNode node]
  :  block {node = $block.node;}
  ;

block returns [TLNode node]
@init {
  BlockNode bn = new BlockNode();
  node = bn;
  Scope scope = new Scope(currentScope);
  currentScope = scope;
}
@after {
  currentScope = currentScope.parent();
}
  :  ^(BLOCK 
        ^( STATEMENTS (statement  {bn.addStatement($statement.node);})* ) 
        ^( RETURN     (expression {bn.addReturn($expression.node);  })? )
      )
  ;

Before anything from the block rule is iterated over, a BlockNode is instantiated in an @init section, and assigned to the node being returned. After that, the zero or more statements are added to this BlockNode, as well as the optional return statement (expression actually).

Note that I squeezed in the pushing and popping of scopes. Whenever we enter a new code block, we create a new current scope, and when we leave the code block, we pop back to the parent scope again.

Notice how I also assigned the return node of the walk rule to become the node of the block rule.

Okay, edit the test.tl file to contain the contents:
return 5 + 2 + 3;

and execute the run Ant target. You will see the following being printed to the console:

$ ant run 

...

run:
     [echo] Running the main class...
     [java] 10.0

Yay! It works! :)

IfNode.java


Let's implement a slightly more difficult node: the IfNode. Such a node might look like this:
package tl.tree;

import tl.TLValue;
import java.util.ArrayList;
import java.util.List;

public class IfNode implements TLNode {

  private List<Choice> choices;

  public IfNode() {
    choices = new ArrayList<Choice>();
  }

  public void addChoice(TLNode e, TLNode b) {
    choices.add(new Choice(e, b));
  }

  @Override
  public TLValue evaluate() {

    for(Choice ch : choices) {
      TLValue value = ch.expression.evaluate();

      if(!value.isBoolean()) {
        throw new RuntimeException("illegal boolean expression " + 
            "inside if-statement: " + ch.expression);
      }

      if(value.asBoolean()) {
        return ch.block.evaluate();
      }
    }

    return TLValue.VOID;
  }

  private class Choice {

    TLNode expression;
    TLNode block;

    Choice(TLNode e, TLNode b) {
      expression = e;
      block = b;
    }
  }
}

As you can see, the contents of an IfNode is simply a list of choices, where a choice is a (boolean) expression with a block of code attached to it. When evaluating an if-statement, we simply iterate over these choices, and as soon as we encounter an expression that evaluates to true, the corresponding code block is executed and its return value is returned.

Let's simplify the tree grammar rules a bit. Currently, our ifStatement consists of 4 rules:
ifStatement returns [TLNode node]
  :  ^(IF ifStat elseIfStat* elseStat?)
  ;

ifStat
  :  ^(EXP expression block)
  ;

elseIfStat
  :  ^(EXP expression block)
  ;

elseStat
  :  ^(EXP block)
  ;

Although we can keep this into four rules, it's easier to integrate them all in just one rule. Notice that in the tree grammar, the rules ifStat and elseIfStat are the same, so ifStat elseIfStat* could also be written as simply: ifStat+. The four rules combined would now look like:
ifStatement returns [TLNode node]
  :  ^(IF 
       (^(EXP expression block))+ 
       (^(EXP block))?
     )
  ;

And with the code mixed in, it looks like:
ifStatement returns [TLNode node]
@init  {
  IfNode ifNode = new IfNode();
  node = ifNode;
}
  :  ^(IF 
       (^(EXP expression b1=block){ifNode.addChoice($expression.node,$b1.node);})+ 
       (^(EXP b2=block)           {ifNode.addChoice(new AtomNode(true),$b2.node);})?
     )
  ;

Notice that in the else sub-rule, a true-atom is inserted.

LTNode.java


To be able test the if statement in our language, we'll need to implement an expression node that evaluates to a boolean value. We'll implement the less-than comparison node, which looks like:
package tl.tree;

import tl.TLValue;

public class LTNode implements TLNode {

  private TLNode lhs;
  private TLNode rhs;

  public LTNode(TLNode lhs, TLNode rhs) {
    this.lhs = lhs;
    this.rhs = rhs;
  }

  @Override
  public TLValue evaluate() {

    TLValue a = lhs.evaluate();
    TLValue b = rhs.evaluate();

    if(a.isNumber() && b.isNumber()) {
      return new TLValue(a.asDouble() < b.asDouble());
    }

    if(a.isString() && b.isString()) {
      return new TLValue(a.asString().compareTo(b.asString()) < 0);
    }

    throw new RuntimeException("illegal expression: " + this);
  }

  @Override
  public String toString() {
    return String.format("(%s < %s)", lhs, rhs);
  }
}

As you can see, only 2 numerical values or 2 string values can be compared to each other. Edit the tree grammar file to include this new node:
expression returns [TLNode node]
  :  ...
  |  ^('<' a=expression b=expression) {node = new LTNode($a.node, $b.node);}
  | ...
  ;

PrintlnNode.java


Let's also create one of the built-in functions: the PrintlnNode. We'll place all the functions in their own package: tl.tree.functions, so create a folder in src/main/tl/tree/ called functions an put the following class in it:
package tl.tree.functions;

import tl.TLValue;
import tl.tree.TLNode;
import java.io.PrintStream;

public class PrintlnNode implements TLNode {

  private TLNode expression;
  private PrintStream out;

  public PrintlnNode(TLNode e) {
    this(e, System.out);
  }

  public PrintlnNode(TLNode e, PrintStream o) {
    expression = e;
    out = o;
  }

  @Override
  public TLValue evaluate() {
    TLValue value = expression.evaluate();
    out.println(value);
    return TLValue.VOID;
  }
}

And adjust the tree grammar accordingly:
functionCall returns [TLNode node]
  :  ...
  |  ^(FUNC_CALL Println expression?) {node = new PrintlnNode($expression.node);}
  |  ...
  ;

Also add the import statement: import tl.tree.functions.*; to the @header { ... } section of our tree grammar.

And edit the rule statement from the tree grammar to look like this:
statement returns [TLNode node]
  :  assignment     {node = $assignment.node;}
  |  functionCall   {node = $functionCall.node;}
  |  ifStatement    {node = $ifStatement.node;}
  |  forStatement   {node = $forStatement.node;}
  |  whileStatement {node = $whileStatement.node;}
  ;

Now edit the test.tl file with the followinf source:
if 4 < 3 do
  println(1);
else if 4 < 5 do
  println(2);
else do
  println(3);
end

And if you now execute the run target again:
$ ant run 

...

run:
     [echo] Running the main class...
     [java] 2.0
     [java] VOID

You see that the block inside else if is executed and 2.0 is printed to the console. The VOID is printed since the entire script did not specifically return a value, so that is correct.

AssignmentNode.java


Now a trickier node: the assignment node. As you can see by looking at the tree grammar rule:
assignment returns [TLNode node]
  :  ^(ASSIGNMENT Identifier indexes? expression)
  ;

If there are no indexes, it's easy: we'd just let the scope assign (or re-assign) the identifier to its new value, the expression.

But an identifier can have one or more indexes after it. For example, this is valid syntax:
arr = [[1,2,3], [4,5,6], [7,8,9]];
arr[2][0] = arr[2][0] * 6;
assert(arr == [[1,2,3], [4,5,6], [42,8,9]]);

in which case, we'll first have to iterate over the indexes to get a hold of the value [7,8,9] (the before last index) and then reassign index 0 to become arr[2][0] * 6, which equals 7 * 6.

The code could look like this:
package tl.tree;

import tl.Scope;
import tl.TLValue;

import java.util.ArrayList;
import java.util.List;

public class AssignmentNode implements TLNode {

  protected String identifier;
  protected List<TLNode> indexNodes;
  protected TLNode rhs;
  protected Scope scope;

  public AssignmentNode(String i, List<TLNode> e, TLNode n, Scope s) {
    identifier = i;
    indexNodes = (e == null) ? new ArrayList<TLNode>() : e;
    rhs = n;
    scope = s;
  }

  @Override
  public TLValue evaluate() {

    TLValue value = rhs.evaluate();

    if (value == TLValue.VOID) {
      throw new RuntimeException("can't assign VOID to " + identifier);
    }

    if (indexNodes.isEmpty()) { // a simple assignment
      scope.assign(identifier, value);
    }
    else { // a possible list-lookup and reassignment

      TLValue list = scope.resolve(identifier);

      // iterate up to `foo[x][y]` in case of `foo[x][y][z] = 42;`
      for (int i = 0; i < indexNodes.size() - 1 && list != null; i++) {
        TLValue index = indexNodes.get(i).evaluate();

        if (!index.isNumber() || !list.isList()) { // sanity checks
          throw new RuntimeException("illegal statement: " + this);
        }

        int idx = index.asLong().intValue();
        list = list.asList().get(idx);
      }
      // list is now pointing to `foo[x][y]` in case of `foo[x][y][z] = 42;`

      // get the value `z`: the last index, in `foo[x][y][z] = 42;`
      TLValue lastIndex = indexNodes.get(indexNodes.size() - 1).evaluate();

      if (!lastIndex.isNumber() || !list.isList()) { // sanity checks
        throw new RuntimeException("illegal statement: " + this);
      }

      // re-assign `foo[x][y][z]`
      List<TLValue> existing = list.asList();
      existing.set(lastIndex.asLong().intValue(), value);
    }

    return TLValue.VOID;
  }

  @Override
  public String toString() {
    return String.format("(%s[%s] = %s)", identifier, indexNodes, rhs);
  }
}

And some adjustments inside the tree grammar:
assignment returns [TLNode node]
  :  ^(ASSIGNMENT i=Identifier x=indexes? e=expression) 
     {node = new AssignmentNode($i.text, $x.e, $e.node, currentScope);}
  ;

expression returns [TLNode node]
  :  ...
  |  lookup {node = $lookup.node;}           
  ;

lookup returns [TLNode node]
  :  ^(LOOKUP functionCall indexes?)
  |  ^(LOOKUP list indexes?)
  |  ^(LOOKUP expression indexes?) 
  |  ^(LOOKUP i=Identifier x=indexes?) 
      {
        node = ($x.e != null)
          ? new LookupNode(new IdentifierNode($i.text, currentScope), $x.e)
          : new IdentifierNode($i.text, currentScope);
      }
  |  ^(LOOKUP String indexes?)
  ;

indexes returns [List<TLNode> e]
@init {e = new ArrayList<TLNode>();}
  :  ^(INDEXES (expression {e.add($expression.node);})+)
  ;

As you can see in the lookup rule, a variable lookup can be, just as with an assignment, as simple as fetching a variable from the scope, but it can also have a couple indexes after it. Here's an implementation for both the Lookup- and IdentifierNodes:

LookupNode.java


package tl.tree;

import tl.TLValue;
import java.util.ArrayList;
import java.util.List;

public class LookupNode implements TLNode {

  private TLNode expression;
  private List<TLNode> indexes;

  public LookupNode(TLNode e, List<TLNode> i) {
    expression = e;
    indexes = i;
  }

  @Override
  public TLValue evaluate() {

    TLValue value = expression.evaluate();

    List<TLValue> indexValues = new ArrayList<TLValue>();

    for (TLNode indexNode : indexes) {
      indexValues.add(indexNode.evaluate());
    }

    for(TLValue index : indexValues) {

      if(!index.isNumber() || !(value.isList() || value.isString())) {
        throw new RuntimeException("illegal expression: " + 
            expression + "[" + index + "]");
      }

      int idx = index.asLong().intValue();

      if(value.isList()) {
        value = value.asList().get(idx);
      }
      else if(value.isString()) {
        value = new TLValue(String.valueOf(value.asString().charAt(idx)));
      }
    }

    return value;
  }
}

As you can see, I also added support for indexing a string literal: you can do ch = ("abc")[1]; after which ch equals "b".

IdentifierNode.java


package tl.tree;

import tl.Scope;
import tl.TLValue;

public class IdentifierNode implements TLNode {

  private String identifier;
  private Scope scope;

  public IdentifierNode(String id, Scope s) {
    identifier = id;
    scope = s;
  }

  @Override
  public TLValue evaluate() {
    TLValue value = scope.resolve(identifier);
    if(value == null) {
      throw new RuntimeException("no such variable: " + this);
    }
    return value;
  }

  @Override
  public String toString() {
    return identifier;
  }
}



Wrap up


If you now put the following source inside test.tl:
a = 21;
b = a;
println(a + b);

or:
a = 5;

if a < 4 do
  a = 1;
else if a < 0 do 
  a = 2;
else do
  a = 42;
end

println(a);

You'll see 42.0 being printed to the console in both cases, after you executed the run Ant target.

That pretty much concludes the technical part of my tutorial about ANTLR and how to create a small, dynamically typed scripting language with it. The last part of this series includes a download link containing all sources (so including the missing node-classes) and some recommendations to continue from here on. Continue reading.

You can download the sources created up to this point by clicking here.

Tuesday, March 08, 2011

7. Interpreting and evaluating TL I


This is part 7/9 of Creating your own programming language with ANTLR.


Index





Introduction


First, make sure your files are all up to date. If you're not sure they are, download all source files, ANTLR jar, build file etc. here so that you won't get any strange errors in this part of the tutorial.

This part of the tutorial will make some necessary preparations needed to actually evaluating the input source.



Preparation


Before actually evaluating the input source, we'll have to create a couple of supporting classes like a class that represents a function, a generic value and something that keeps track of our variables: a scope.

TLValue.java


Since TL is a dynamically type language, we'll create a generic value, called TLValue that gets checked and operated on at runtime. Such a class could look like this (place it in src/main/tl/):
package tl;

import java.util.List;

public class TLValue implements Comparable<TLValue> {

  public static final TLValue NULL = new TLValue();
  public static final TLValue VOID = new TLValue();

  private Object value;

  private TLValue() {
    // private constructor: only used for NULL and VOID
    value = new Object();
  }

  public TLValue(Object v) {
    if(v == null) {
      throw new RuntimeException("v == null");
    }
    value = v;
    // only accept boolean, list, number or string types
    if(!(isBoolean() || isList() || isNumber() || isString())) {
      throw new RuntimeException("invalid type: " + v + " (" + v.getClass() + ")");
    }
  }

  public Boolean asBoolean() {
    return (Boolean)value;
  }

  public Double asDouble() {
    return ((Number)value).doubleValue();
  }

  public Long asLong() {
    return ((Number)value).longValue();
  }

  @SuppressWarnings("unchecked")
  public List<TLValue> asList() {
    return (List<TLValue>)value;
  }

  public String asString() {
    return (String)value;
  }

  @Override
  public int compareTo(TLValue that) {
    if(this.isNumber() && that.isNumber()) {
      if(this.equals(that)) {
        return 0;
      }
      else {
        return this.asDouble().compareTo(that.asDouble());
      }
    }
    else if(this.isString() && that.isString()) {
      return this.asString().compareTo(that.asString());
    }
    else {
      throw new RuntimeException("illegal expression: can't compare `" + 
          this + "` to `" + that + "`");
    }
  }

  @Override
  public boolean equals(Object o) {
    if(this == VOID || o == VOID) {
      throw new RuntimeException("can't use VOID: " + this + " ==/!= " + o);
    }
    if(this == o) {
      return true;
    }
    if(o == null || this.getClass() != o.getClass()) {
      return false;
    }
    TLValue that = (TLValue)o;
    if(this.isNumber() && that.isNumber()) {
      double diff = Math.abs(this.asDouble() - that.asDouble());
      return diff < 0.00000000001;
    }
    else {
      return this.value.equals(that.value);
    }
  }

  @Override
  public int hashCode() {
    return value.hashCode();
  }

  public boolean isBoolean() {
    return value instanceof Boolean;
  }

  public boolean isNumber() {
    return value instanceof Number;
  }

  public boolean isList() {
    return value instanceof List<?>;
  }

  public boolean isNull() {
    return this == NULL;
  }

  public boolean isVoid() {
    return this == VOID;
  }

  public boolean isString() {
    return value instanceof String;
  }

  @Override
  public String toString() {
    return isNull() ? "NULL" : isVoid() ? "VOID" : String.valueOf(value);
  }
}

Function.java


Remember that in part 5. Building an AST, we omitted the rule responsible for parsing a function declaration from our AST. Let's implement that now by creating a Function class. We'll put it in the same package as our Main.java class: package tl;. But first, we'll need to adjust our combined grammar file TL.g a bit.

So instead of what we had:
functionDecl
  :  Def Identifier '(' idList? ')' block End { /* implemented later */ }
  ;

we'll make a call to the method defineFunction(...)
functionDecl
  :  Def Identifier '(' idList? ')' block End 
     {defineFunction($Identifier.text, $idList.tree, $block.tree);}
  ;

That method does not exist in our TLParser class of course: we'll need to create it on our own. You can add custom class variables and methods to a parser by putting them in the @parser::members section, which should be placed after the @header sections.

// tokens { ... }

@parser::header {
  package tl.parser;
  import tl.*; 
  import java.util.Map;
  import java.util.HashMap;
}

@lexer::header {
  package tl.parser;
}

@parser::members {
  public Map<String, Function> functions = new HashMap<String, Function>();
  
  private void defineFunction(String id, Object idList, Object block) {

    // `idList` is possibly null!  Create an empty tree in that case. 
    CommonTree idListTree = idList == null ? new CommonTree() : (CommonTree)idList;

    // `block` is never null
    CommonTree blockTree = (CommonTree)block;

    // The function name with the number of parameters after it, is the unique key
    String key = id + idListTree.getChildCount();
    functions.put(key, new Function(id, idListTree, blockTree));
  }
}

// rules ...

Notice that I also added a couple of extra imports in the @parser::header section. So, our parser now builds a Map of Functions for us. Let's create that class now:
package tl;

import org.antlr.runtime.RecognitionException;
import org.antlr.runtime.tree.CommonTree;
import org.antlr.runtime.tree.CommonTreeNodeStream;
import tl.parser.TLTreeWalker;
import tl.tree.TLNode;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class Function {

  private String id;
  private List<String> identifiers;
  private CommonTree code;
  private Scope scope;

  public Function(String i, CommonTree ids, CommonTree block) {
    id = i;
    identifiers = toList(ids);
    code = block;
    scope = new Scope();
  }

  public Function(Function original) {
    // Used for recursively calling functions
    id = original.id;
    identifiers = original.identifiers;
    code = original.code;
    scope = original.scope.copy();
  }

  public TLValue invoke(List<TLNode> params, Map<String, Function> functions) {

    if(params.size() != identifiers.size()) {
      throw new RuntimeException("illegal function call: " + identifiers.size() +
          " parameters expected for function `" + id + "`");
    }

    // Assign all expression parameters to this function's identifiers
    for(int i = 0; i < identifiers.size(); i++) {
      scope.assign(identifiers.get(i), params.get(i).evaluate());
    }

    try {
      // Create a tree walker to evaluate this function's code block
      CommonTreeNodeStream nodes = new CommonTreeNodeStream(code);
      TLTreeWalker walker = new TLTreeWalker(nodes, scope, functions);
      return walker.walk().evaluate();
    } catch (RecognitionException e) {
      // do not recover from this
      throw new RuntimeException("something went wrong, terminating", e);
    }
  }

  private List<String> toList(CommonTree tree) {
    List<String> ids = new ArrayList<String>();
    
    // convert the tree to a List of Strings
    for(int i = 0; i < tree.getChildCount(); i++) {
      CommonTree child = (CommonTree)tree.getChild(i);
      ids.add(child.getText());
    }
    return ids;
  }
}

Notice that a Function creates its own Scope instance and the invoke(...) takes a list of TLNode instances. It also returns a TLValue by invoking evaluate() on what walker.walk() returns. Which doesn't return anything at the moment, but we'll change that later on.

Let's define TLNode and Scope next.

TLNode.java


Our tree now contains only CommonTree objects, but later on, we'll create custom node classes. All of our custom node classes implement the same interface: namely the TLNode interface:
package tl.tree;

public interface TLNode {
  tl.TLValue evaluate();
}

which I've placed in the tl.tree package. So create a folder called tree in the src/main/tl/ directory and save the interface in it.

Scope.java


Every code block, and functions, have their own local variable scope. Note that for-, while- and if-statements also have their own code block! So we'll create a class called Scope and place it in src/main/tl/ as well. This class keeps track of it's parent scope to resolve variables itself has no notion of, and will store all declared variables (or ask a parent scope to re-assign a certain variable if it's not declared in its own scope).
package tl;

import java.util.HashMap;
import java.util.Map;

public class Scope {

  private Scope parent;
  private Map<String, TLValue> variables;

  public Scope() {
    // only for the global scope, the parent is null
    this(null);
  }

  public Scope(Scope p) {
    parent = p;
    variables = new HashMap<String, TLValue>();
  }

  public void assign(String var, TLValue value) {
    if(resolve(var) != null) {
      // There is already such a variable, re-assign it
      this.reAssign(var, value);
    }
    else {
      // A newly declared variable
      variables.put(var, value);
    }
  }

  public Scope copy() {
    // Create a shallow copy of this scope. Used in case functions are
    // are recursively called. If we wouldn't create a copy in such cases,
    // changing the variables would result in changes to the Maps from
    // other "recursive scopes".
    Scope s = new Scope();
    s.variables = new HashMap<String, TLValue>(this.variables);
    return s;
  }

  public boolean isGlobalScope() {
    return parent == null;
  }

  public Scope parent() {
    return parent;
  }

  private void reAssign(String identifier, TLValue value) {
    if(variables.containsKey(identifier)) {
      // The variable is declared in this scope
      variables.put(identifier, value);
    }
    else if(parent != null) {
      // The variable was not declared in this scope, so let
      // the parent scope re-assign it
      parent.reAssign(identifier, value);
    }
  }

  public TLValue resolve(String var) {
    TLValue value = variables.get(var);
    if(value != null) {
      // The variable resides in this scope
      return value;
    }
    else if(!isGlobalScope()) {
      // Let the parent scope look for the variable
      return parent.resolve(var);
    }
    else {
      // Unknown variable
      return null;
    }
  }
}

TLTreeWalker.g


Inside our tree walker, we'll also have a reference to the Map of functions and we'll keep track of the current Scope our tree walker is in. Finally, I've let the entry point of our walker, the walk rule, return a TLNode:
tree grammar TLTreeWalker;

options {
  tokenVocab=TL;
  ASTLabelType=CommonTree;
}

@header {
  package tl.parser;
  import tl.*;
  import tl.tree.*;
  import java.util.Map;
  import java.util.HashMap;
}

@members {
  public Map<String, Function> functions = null;
  Scope currentScope = null;
  
  public TLTreeWalker(CommonTreeNodeStream nodes, Map<String, Function> fns) {
    this(nodes, null, fns);
  }
  
  public TLTreeWalker(CommonTreeNodeStream nds, Scope sc, Map<String, Function> fns) {
    super(nds);
    currentScope = sc;
    functions = fns;
  }
}

walk returns [TLNode node]
  :  block {node = null;}
  ;

// other rules ...

Main.java


We can now test all changes with the new Main class:
package tl;

import tl.parser.*;
import tl.tree.*;
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;

public class Main {
  public static void main(String[] args) throws Exception {
    // create an instance of the lexer
    TLLexer lexer = new TLLexer(new ANTLRFileStream("test.tl"));
        
    // wrap a token-stream around the lexer
    CommonTokenStream tokens = new CommonTokenStream(lexer);
        
    // create the parser
    TLParser parser = new TLParser(tokens);
    
    // walk the tree
    CommonTree tree = (CommonTree)parser.parse().getTree();
    CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
    
    // pass the reference to the Map of functions to the tree walker
    TLTreeWalker walker = new TLTreeWalker(nodes, parser.functions);
    
    // get the returned node 
    TLNode returned = walker.walk();
    System.out.println(returned == null ? "null" : returned.evaluate());
  }
}

That should make our project compilable again. Test everything by changing the contents of test.tl with some different sources. Again, if nothing is printed, you're okay. Of course, null is printed since that's what our walk now still returns.

In case you haven't been able to get things running properly so far, download the correct sources of the project developed until now, here.



Evaluation I


As you've seen, our walk rule now returns a TLNode. We're going to let all rules in our tree grammar return such an object (except some rules that are only created to support a single other rule, like idList or expList or the child-rules of an if-statement).

Our tree grammar including the return types now looks like:
tree grammar TLTreeWalker;

options {
  tokenVocab=TL;
  ASTLabelType=CommonTree;
}

@header {
  package tl.parser;
  import tl.*;
  import tl.tree.*;
  import java.util.Map;
  import java.util.HashMap;
}

@members {
  public Map<String, Function> functions = null;
  Scope currentScope = null;
  
  public TLTreeWalker(CommonTreeNodeStream nodes, Map<String, Function> fns) {
    this(nodes, new Scope(), fns);
  }
  
  public TLTreeWalker(CommonTreeNodeStream nds, Scope sc, Map<String, Function> fns) {
    super(nds);
    currentScope = sc;
    functions = fns;
  }
}

walk returns [TLNode node]
  :  block
  ;

block returns [TLNode node]
  :  ^(BLOCK ^(STATEMENTS statement*) ^(RETURN expression?))
  ;

statement returns [TLNode node]
  :  assignment
  |  functionCall
  |  ifStatement
  |  forStatement
  |  whileStatement
  ;

assignment returns [TLNode node]
  :  ^(ASSIGNMENT Identifier indexes? expression)
  ;

functionCall returns [TLNode node]
  :  ^(FUNC_CALL Identifier exprList?)
  |  ^(FUNC_CALL Println expression?)
  |  ^(FUNC_CALL Print expression)
  |  ^(FUNC_CALL Assert expression)
  |  ^(FUNC_CALL Size expression)
  ;

ifStatement returns [TLNode node]
  :  ^(IF ifStat elseIfStat* elseStat?)
  ;

ifStat
  :  ^(EXP expression block)
  ;

elseIfStat
  :  ^(EXP expression block)
  ;

elseStat
  :  ^(EXP block)
  ;
   
forStatement returns [TLNode node]
  :  ^(For Identifier expression expression block)
  ;

whileStatement returns [TLNode node]
  :  ^(While expression block)
  ;

idList returns [java.util.List<String> i]
  :  ^(ID_LIST Identifier+)
  ;

exprList returns [java.util.List<TLNode> e]
  :  ^(EXP_LIST expression+)
  ;

expression returns [TLNode node]
  :  ^(TERNARY expression expression expression)
  |  ^(In expression expression)
  |  ^('||' expression expression)
  |  ^('&&' expression expression)
  |  ^('==' expression expression)
  |  ^('!=' expression expression)
  |  ^('>=' expression expression)
  |  ^('<=' expression expression)
  |  ^('>' expression expression)
  |  ^('<' expression expression)
  |  ^('+' expression expression)
  |  ^('-' expression expression)
  |  ^('*' expression expression)
  |  ^('/' expression expression)
  |  ^('%' expression expression)
  |  ^('^' expression expression)
  |  ^(UNARY_MIN expression)
  |  ^(NEGATE expression)
  |  Number
  |  Bool
  |  Null
  |  lookup           
  ;

lookup returns [TLNode node]
  :  ^(LOOKUP functionCall indexes?)
  |  ^(LOOKUP list indexes?)
  |  ^(LOOKUP expression indexes?) 
  |  ^(LOOKUP Identifier indexes?)
  |  ^(LOOKUP String indexes?)
  ;
  
list returns [TLNode node]
  :  ^(LIST exprList?)
  ;

indexes returns [java.util.List<TLNode> e]
  :  ^(INDEXES expression+)
  ;

The next step is to write custom TLNode classes and let these be returned by our tree grammar rules. That way, we simply invoke our top-most rule, walk that returns the root of the tree, and when we call its evaluate() method, it should recursively call all it's child evaluate() methods, and so on.

The second part of this evaluate-tutorial can be read here: 8. Interpreting and evaluating TL II.

Download the most recent source files of this tutorial here.

Monday, March 07, 2011

6. Creating a tree grammar


This is part 6/9 of Creating your own programming language with ANTLR.


Index





Introduction


So far we've created a lexer to chop up he character stream into tokens, called the lexical analysis. Using the parser we've checked the syntactic validness of the lexer's tokens stream, and let the parser create a proper AST structure. It's now time to prepare for the semantic analysis, which in our case is simply evaluating the input, or throwing up an exception. For example, take the statement:
if 1+1 > true do
  // ...
end

which is lexically correct (they're all valid tokens), and is syntactically correct (it adheres to the grammar of our relExpr rule), but semantically, it makes no sense. We can't compare an integer value with a boolean value.

It is now time to create some sort of walker (or iterator) that traverses the AST and can then evaluate the expressions/statements it encounters. You can do that by writing an own iterator to walk the AST, but ANTLR also provides a mechanism to do this. You can let ANTLR generate such an iterator by writing a so called tree grammar in which you declare how the tree built by your parser can look like. In practice, you pretty much copy all parser rules from the combined grammar into a new tree grammar file, and only leave the rewrite rules in place.

If you haven't already got the necessary files, you can download the project here, which contains all the source code developed so far, the ANTLR jar, build script etc.



Preparation


A tree grammar is used to create an iterator that walks the tree produced by the parser. Therefor, it resembles the parser grammar quite a bit, only that it is a more compact version of it: it does not contain the tokens that have been omitted from the tree, and also does not contain any lexer rules.

You begin a tree grammar file with the keywords tree grammar followed by its name, which will be TLTreeWalker. After that, you declare what tokens are to be used and the type of tree-tokens that it can expect. These are both defined in the options { ... } section of the tree grammar. And like the combined grammar, we will also need to tell ANTLR to put the generated walker in the tl.parser package.

// file: TLTreeWalker.g

tree grammar TLTreeWalker; // the generated file is TLTreeWalker.java

options {
  tokenVocab=TL; // this means to use the TL.tokens file
  ASTLabelType=CommonTree;
}

@header {
  package tl.parser;
}

// rules here ...

Our Ant build file, build.xml will need to be extended so that it also generates the tree walker. Change the generate target into the following:
<target name="generate" depends="clean" description="Generates the lexer and parser.">
  <echo>Generating the lexer and parser...</echo>
  <java classname="org.antlr.Tool" fork="true" failonerror="true">
    <arg value="-fo" />
    <arg value="src/main/tl/parser/" />
    <arg value="src/grammar/TL.g" />
    <classpath refid="classpath" />
  </java>
  <echo>Generating the tree walker...</echo>
  <java classname="org.antlr.Tool" fork="true" failonerror="true">
    <arg value="-fo" />
    <arg value="src/main/tl/parser/" />
    <arg value="src/grammar/TLTreeWalker.g" />
    <classpath refid="classpath" />
  </java>
  <antcall target="compile" />
</target>

And of course, our main class will need to be adjusted so that the tree walker class get instantiated and used:
package tl;

import tl.parser.*;
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;

public class Main {
  public static void main(String[] args) throws Exception {
    // create an instance of the lexer
    TLLexer lexer = new TLLexer(new ANTLRFileStream("test.tl"));
        
    // wrap a token-stream around the lexer
    CommonTokenStream tokens = new CommonTokenStream(lexer);
        
    // create the parser
    TLParser parser = new TLParser(tokens);
    
    // walk the tree
    CommonTree tree = (CommonTree)parser.parse().getTree();
    CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
    TLTreeWalker walker = new TLTreeWalker(nodes);
    walker.walk();
  }
}



Tree grammar


First create a file called TLTreeWalker.g and save it in src/grammar/. The tree grammar would look like this:
tree grammar TLTreeWalker;

options {
  tokenVocab=TL;
  ASTLabelType=CommonTree;
}

@header {
  package tl.parser;
}

walk
  :  block
  ;

block
  :  ^(BLOCK ^(STATEMENTS statement*) ^(RETURN expression?))
  ;

statement
  :  assignment
  |  functionCall
  |  ifStatement
  |  forStatement
  |  whileStatement
  ;

assignment
  :  ^(ASSIGNMENT Identifier indexes? expression)
  ;

functionCall
  :  ^(FUNC_CALL Identifier exprList?)
  |  ^(FUNC_CALL Println expression?)
  |  ^(FUNC_CALL Print expression)
  |  ^(FUNC_CALL Assert expression)
  |  ^(FUNC_CALL Size expression)
  ;

ifStatement
  :  ^(IF ifStat elseIfStat* elseStat?)
  ;

ifStat
  :  ^(EXP expression block)
  ;

elseIfStat
  :  ^(EXP expression block)
  ;

elseStat
  :  ^(EXP block)
  ;
   
forStatement
  :  ^(For Identifier expression expression block)
  ;

whileStatement
  :  ^(While expression block)
  ;

idList
  :  ^(ID_LIST Identifier+)
  ;

exprList
  :  ^(EXP_LIST expression+)
  ;

expression
  :  ^(TERNARY expression expression expression)
  |  ^(In expression expression)
  |  ^('||' expression expression)
  |  ^('&&' expression expression)
  |  ^('==' expression expression)
  |  ^('!=' expression expression)
  |  ^('>=' expression expression)
  |  ^('<=' expression expression)
  |  ^('>' expression expression)
  |  ^('<' expression expression)
  |  ^('+' expression expression)
  |  ^('-' expression expression)
  |  ^('*' expression expression)
  |  ^('/' expression expression)
  |  ^('%' expression expression)
  |  ^('^' expression expression)
  |  ^(UNARY_MIN expression)
  |  ^(NEGATE expression)
  |  Number
  |  Bool
  |  Null
  |  lookup           
  ;

list
  :  ^(LIST exprList?)
  ;

lookup
  :  ^(LOOKUP functionCall indexes?)
  |  ^(LOOKUP list indexes?)
  |  ^(LOOKUP expression indexes?) 
  |  ^(LOOKUP Identifier indexes?)
  |  ^(LOOKUP String indexes?)
  ;

indexes
  :  ^(INDEXES expression+)
  ;

Notice that the doEndBlock is not part of this tree grammar. This is because the Do and End tokens were removed during the rewrite phase in the previous part of this tutorial. The rules that used the doEndBlock rule, are now using the block instead.

Test the tree grammar with a couple of different input to see if it all works (no error message(s) means everything is okay). As usual, executing the Ant target run will be enough: all necessary files will be generated, compiled and the main class will be executed.

This tree grammar will then be mixed with custom code mixed in it that will evaluate the expressions and statements in the AST. But that will be done in the next part: 7. Interpreting and evaluating TL I.

P.S. You can download the project containing all source code created so far, build script, ANTLR jar etc. here.

Sunday, March 06, 2011

5. Building an AST


This is part 5/9 of Creating your own programming language with ANTLR.


Index





Introduction


In this part we're going to continue with the combined grammar file that already contains all lexer- and parser rules. If you haven't already got the necessary files, you can download the project here, which contains all the source code, ANTLR jar, build script etc.



AST construction in ANTLR


By default, ANTLR's generated parser emits tokens of type CommonToken as a 1 dimensional list. To tell our parser it should create CommonTree tokens that can have parent- and child tokens, we need to set the option output=AST in the options section of our grammar. This section must come directly after the grammar declaration:
grammar TL;

options { 
  output=AST;
}

@parser::header {
  package tl.parser;
}

@lexer::header {
  package tl.parser;
}

parse
  :  block EOF
  ;

// ...

Now edit the test.tl file and paste the following in it:
a = 2 + 5;
b = a * 2;

Also edit the Main.java file to look like this:
package tl;

import tl.parser.*;
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    // create an instance of the lexer
    TLLexer lexer = new TLLexer(new ANTLRFileStream("test.tl"));
        
    // wrap a token-stream around the lexer
    CommonTokenStream tokens = new CommonTokenStream(lexer);
        
    // create the parser
    TLParser parser = new TLParser(tokens);
    
    // invoke the entry point of our parser and generate a DOT image of the tree
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

The code above will cause a DOT file to be printed to the console of the generated AST. It will produce the following output:

digraph {

  n0 [label="A"];
  n1 [label="B"];
  n2 [label="C"];
  n3 [label="2"];
  n4 [label="+"];
  n5 [label="5"];
  n6 [label=";"];
  n7 [label="b"];
  n8 [label="="];
  n9 [label="a"];
  n10 [label="*"];
  n11 [label="2"];
  n12 [label=";"];
  n13 [label=""];

  n0 -> n1 // "" -> "a"
  n0 -> n2 // "" -> "="
  n0 -> n3 // "" -> "2"
  n0 -> n4 // "" -> "+"
  n0 -> n5 // "" -> "5"
  n0 -> n6 // "" -> ";"
  n0 -> n7 // "" -> "b"
  n0 -> n8 // "" -> "="
  n0 -> n9 // "" -> "a"
  n0 -> n10 // "" -> "*"
  n0 -> n11 // "" -> "2"
  n0 -> n12 // "" -> ";"
  n0 -> n13 // "" -> ""

}

There are many viewers to generate images from such DOT files. I like graphviz-dev.appspot.com as a quick online tool. After you have pasted the output from Main in the graphviz-dev.appspot.com tool, you'll see an image like this appear:


This is still just a 1 dimensional list of tokens (but now of type CommonTree). We'll have to do a bit more: ANTLR does not magically know what rules need to be the root, and what tokens will need to be omitted from our tree.



Tree operators and rewrite rules


There are two ways to create a proper AST in ANTLR:
  1. by using tree operators: ^ and !
  2. or by using rewrite rules: ... -> ^(...)

1. Tree operators

You can tell ANTLR to make a certain node the root by placing a ^ after it in your parser rules, and to exclude certain nodes from the tree, place a ! after it.

Take the following parser rule for example:
bar
  :  A B C D
  ;

If we wanted to let B become the root, and wanted to exclude token D from the tree, we'd do that like this:
bar
  :  A B^ C D!
  ;

This will produce the tree:


2. Rewrite rules

The second option to create an AST, is to use rewrite rules. A rewrite rule is placed after a parser rule (but before the ';'!). It consists of an arrow, ->, followed by ^( ... ) where the first token (or rule) inside the parenthesis will become the root of the tree. Tokens that need to be removed from the AST will simply be omitted from the rewrite rule.

To create the same tree as the example above using a rewrite rule, we'd write:
bar
  :  A B C D -> ^(B A C)
  ;

I.e.: B will be the root, and A and C its children (D is removed).

What to use?

The first option using tree operators may look easier to use. However, when your parser rules become larger, the (small) operators will become less conspicuous. Also, the rewrite rules are more flexible. If we wanted to let token C become the first child of parent token B in the previous example, tree operators wouldn't be able to do this, but a rewrite rule would simply look like this:
bar
  :  A B C D -> ^(B C A)
  ;

Another advantage is that the translation of a parser grammar to a tree grammar (the next part of this tutorial) is more straight forward. But you'll see that for yourself later on.

But sometimes it is more convenient to use tree operators than rewrite rules. For example, with an add expression:
addExpr
  :  multiplyExpr (('+' | '-') multiplyExpr)*
  ;

we would like to make the + or - the root of the expression. With rewrite rules, that would look like this:
addExpr
  :  (multiplyExpr -> multiplyExpr) 
     ( '+' e=multiplyExpr -> ^('+' $addExpr $e) 
     | '-' e=multiplyExpr -> ^('-' $addExpr $e)
     )*
  ;

which is no trivial rewrite syntax! And I'll leave it unexplained because in our grammar, we'll go for the far less complex:
addExpr
  :  multiplyExpr (('+' | '-')^ multiplyExpr)*
  ;

Yes, that is correct: the single ^ produces the exact tree as the complex looking rewrite rule! For the expression 1+2+3, both produce the tree:




Imaginary tokens


Sometimes a parser rule doesn't have an obvious root candidate. In that case we can insert a so called imaginary token. Take the block rule for example:
block
  :  (statement | functionDecl)* (Return expression ';')?
  ;

which matches zero or more statements or function declarations, optionally ending with a return statement. To mark the first statement as the root would make little sense. What we're going to do is create a couple of imaginary tokens to make a block tree always look the same. The first we need to do is put a tokens { ... } section directly below the options { ... } section (and above the header sections!):
grammar TL;

options { 
  output=AST;
}

tokens {
  BLOCK;
  RETURN;
  STATEMENTS;
}

@parser::header {
  package tl.parser;
}

// ...

Imaginary tokens should start with a capital, just like lexer rules. I find it convenient to let them be distinguishable from lexer rules by using only capitals instead of just capitalizing the first letter like I did with lexer rules.

Our block rule including a rewrite rule would look like:
block
  :  (statement | functionDecl)* (Return expression ';')? 
     -> ^(BLOCK ^(STATEMENTS statement*) ^(RETURN expression?))
  ;

This makes sure every block-tree will have exactly two children: STATEMENTS and RETURN.

If we didn't use STATEMENTS, all zero or more statements would be direct children of the root BLOCK node, making it harder to evaluate since we'd always have to inspect if we're at the end, which could possibly be the return statement. Now we know that all zero or more statements are in the left tree, and the optional return statement is in the right tree.

You may have noticed that I removed the functionDecl from the tree. This is no accident: at the parsing stage, we're going to create a java.uti.Map containing all methods so that when evaluating our language, we can invoke functions before they've been defined in the source. We'll handle that at a later stage of this tutorial.

Note that you can't use a rewrite rule and tree operators in a single parser rule: it's always just one of the two.



A proper AST


Okay, having said all that, here's how our grammar looks like with tree operators and rewrite rules:
grammar TL;

options {
  output=AST;
}

tokens {
  BLOCK;
  RETURN;
  STATEMENTS;
  ASSIGNMENT;
  FUNC_CALL;
  EXP;
  EXP_LIST;
  ID_LIST;
  IF;
  TERNARY;
  UNARY_MIN;
  NEGATE;
  FUNCTION;
  INDEXES;
  LIST;
  LOOKUP;
}

@parser::header {
  package tl.parser;
}

@lexer::header {
  package tl.parser;
}

parse
  :  block EOF -> block
  ;

block
  :  (statement | functionDecl)* (Return expression ';')? 
     -> ^(BLOCK ^(STATEMENTS statement*) ^(RETURN expression?))
  ;

statement
  :  assignment ';'   -> assignment
  |  functionCall ';' -> functionCall
  |  ifStatement
  |  forStatement
  |  whileStatement
  ;

assignment
  :  Identifier indexes? '=' expression 
     -> ^(ASSIGNMENT Identifier indexes? expression)
  ;

functionCall
  :  Identifier '(' exprList? ')' -> ^(FUNC_CALL Identifier exprList?)
  |  Println '(' expression? ')'  -> ^(FUNC_CALL Println expression?)
  |  Print '(' expression ')'     -> ^(FUNC_CALL Print expression)
  |  Assert '(' expression ')'    -> ^(FUNC_CALL Assert expression)
  |  Size '(' expression ')'      -> ^(FUNC_CALL Size expression)
  ;

ifStatement
  :  ifStat elseIfStat* elseStat? End -> ^(IF ifStat elseIfStat* elseStat?)
  ;

ifStat
  :  If expression Do block -> ^(EXP expression block)
  ;

elseIfStat
  :  Else If expression Do block -> ^(EXP expression block)
  ;

elseStat
  :  Else Do block -> ^(EXP block)
  ;

functionDecl
  :  Def Identifier '(' idList? ')' block End {/* implemented later */}
  ;

forStatement
  :  For Identifier '=' expression To expression Do block End 
     -> ^(For Identifier expression expression block)
  ;

whileStatement
  :  While expression Do block End -> ^(While expression block)
  ;

idList
  :  Identifier (',' Identifier)* -> ^(ID_LIST Identifier+)
  ;

exprList
  :  expression (',' expression)* -> ^(EXP_LIST expression+)
  ;

expression
  :  condExpr
  ;

condExpr
  :  (orExpr -> orExpr) 
     ( '?' a=expression ':' b=expression -> ^(TERNARY orExpr $a $b)
     | In expression                     -> ^(In orExpr expression)
     )?
  ;

orExpr
  :  andExpr ('||'^ andExpr)*
  ;

andExpr
  :  equExpr ('&&'^ equExpr)*
  ;

equExpr
  :  relExpr (('==' | '!=')^ relExpr)*
  ;

relExpr
  :  addExpr (('>=' | '<=' | '>' | '<')^ addExpr)*
  ;

addExpr
  :  mulExpr (('+' | '-')^ mulExpr)*
  ;

mulExpr
  :  powExpr (('*' | '/' | '%')^ powExpr)*
  ;

powExpr
  :  unaryExpr ('^'^ unaryExpr)*
  ;
  
unaryExpr
  :  '-' atom -> ^(UNARY_MIN atom)
  |  '!' atom -> ^(NEGATE atom)
  |  atom
  ;

atom
  :  Number
  |  Bool
  |  Null
  |  lookup
  ;

list
  :  '[' exprList? ']' -> ^(LIST exprList?)
  ;

lookup
  :  functionCall indexes?       -> ^(LOOKUP functionCall indexes?)
  |  list indexes?               -> ^(LOOKUP list indexes?)
  |  Identifier indexes?         -> ^(LOOKUP Identifier indexes?)
  |  String indexes?             -> ^(LOOKUP String indexes?)
  |  '(' expression ')' indexes? -> ^(LOOKUP expression indexes?)
  ;

indexes
  :  ('[' expression ']')+ -> ^(INDEXES expression+)
  ;

Println  : 'println';
Print    : 'print';
Assert   : 'assert';
Size     : 'size';
Def      : 'def';
If       : 'if';
Else     : 'else';
Return   : 'return';
For      : 'for';
While    : 'while';
To       : 'to';
Do       : 'do';
End      : 'end';
In       : 'in';
Null     : 'null';

Or       : '||';
And      : '&&';
Equals   : '==';
NEquals  : '!=';
GTEquals : '>=';
LTEquals : '<=';
Pow      : '^';
Excl     : '!';
GT       : '>';
LT       : '<';
Add      : '+';
Subtract : '-';
Multiply : '*';
Divide   : '/';
Modulus  : '%';
OBrace   : '{';
CBrace   : '}';
OBracket : '[';
CBracket : ']';
OParen   : '(';
CParen   : ')';
SColon   : ';';
Assign   : '=';
Comma    : ',';
QMark    : '?';
Colon    : ':';

Bool
  :  'true' 
  |  'false'
  ;

Number
  :  Int ('.' Digit*)?
  ;

Identifier
  :  ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | Digit)*
  ;

String
@after {
  setText(getText().substring(1, getText().length()-1).replaceAll("\\\\(.)", "$1"));
}
  :  '"'  (~('"' | '\\')  | '\\' ('\\' | '"'))* '"' 
  |  '\'' (~('\'' | '\\') | '\\' ('\\' | '\''))* '\''
  ;

Comment
  :  '//' ~('\r' | '\n')* {skip();}
  |  '/*' .* '*/'         {skip();}
  ;

Space
  :  (' ' | '\t' | '\r' | '\n' | '\u000C') {skip();}
  ;

fragment Int
  :  '1'..'9' Digit*
  |  '0'
  ;
  
fragment Digit 
  :  '0'..'9'
  ;

And to test this, run the Ant target using the command line parameter -emacs, which casues Ant not to print all the [java] etc. stuff to the console, making it easier to copy and paste the DOT source.

Okay, the tree that was generated is a heck of a lot easier to interpret than the flat list of nodes! But before evaluating, we'll first have to create some sort of iterator that traverses our freshly constructed AST, which will be the subject of the next part 6. Creating a tree grammar.

P.S. You can download the project containing all source code created so far, build script, ANTLR jar etc. here.

Saturday, March 05, 2011

4. Syntactic analysis of TL


This is part 4/9 of Creating your own programming language with ANTLR.


Index





Introduction


In this part, we're going to extend our combined grammar file from part 3 with parser rules. After the lexer has provided us with tokens, we're going to let the parser validate the order of these tokens, which is called the syntactic analysis of the language.



Project structure


Before we continue, we're going to setup a very basic project structure to keep things a bit orderly, and to make generating, compiling and running code easier.

The project will look like:

TL_ROOT
   |
   +-- build/
   |     |
   |     +-- classes/
   +-- lib/
   |    |
   |    +-- antlr-3.2.jar
   +-- src/
   |    |
   |    +-- grammar/
   |    |      |
   |    |      +-- TL.g
   |    +-- main/
   |          | 
   |          +-- tl/  
   |               |
   |               +-- parser/
   |               +-- Main.java
   +-- build.xml
   +-- test.tl

The names that end with a / are directories, the others, not surprisingly, are files. You can re-use the files we used in 3. Lexical analysis of TL. There are some small modifications however:

In the combined grammar file TL.g we will tell ANTLR to put both the generated lexer and parser in the package lt.parser by adding this package declaration in their respective header sections:
grammar TL;

@parser::header {
  package tl.parser;
}

@lexer::header {
  package tl.parser;
}

parse
  :  .* EOF
  ;
  
// the lexer rules here

The class Main.java will be in the package tl and it will import lt.parser.*, which will be the place where our generated lexer and parser are going to be placed:
package tl;

import tl.parser.*;
import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    // create an instance of the lexer
    TLLexer lexer = new TLLexer(new ANTLRFileStream("test.tl"));
        
    // wrap a token-stream around the lexer
    CommonTokenStream tokens = new CommonTokenStream(lexer);
        
    // create the parser
    TLParser parser = new TLParser(tokens);
    
    // invoke the entry point of our parser
    parser.parse();
    System.out.println("Done!");
  }
}

We'll use Ant to do all the boring tasks such as generating, compiling and running (source) code. I assume you, as a Java developer (or Java enthusiast), either already have Ant installed, or are able to do so on your own.

The Ant build script we'll be using looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<project name="TinyLanguage" default="run">

  <path id="classpath">
    <pathelement location="build/classes/" />
    <pathelement location="src/main/" />
    <fileset dir="lib">
      <include name="*.jar" />
    </fileset>
  </path>

  <target name="clean">
    <delete dir="build/classes/" />
    <mkdir dir="build/classes/" />
  </target>

  <target name="compile" depends="clean">
    <javac srcdir="src/main/" destdir="build/classes/" includeantruntime="false">
      <classpath refid="classpath" />
    </javac>
  </target>

  <target name="generate" depends="clean">
    <echo>Generating the lexer and parser...</echo>
    <java classname="org.antlr.Tool" fork="true" failonerror="true">
      <arg value="-fo" />
      <arg value="src/main/tl/parser/" />
      <arg value="src/grammar/TL.g" />
      <classpath refid="classpath" />
    </java>
    <antcall target="compile" />
  </target>

  <target name="run" depends="generate">
    <echo>Running the main class...</echo>
    <java classname="tl.Main">
      <classpath refid="classpath" />
    </java>
  </target>

</project>

That's is. Now open a command prompt, navigate to the project root, and test if everything works by executing the run target:

ant run

which should produce the following output:

/Path/To/TinyLanguage$ ant run
Buildfile: /Path/To/TinyLanguage/build.xml

clean:
   [delete] Deleting directory /Path/To/TinyLanguage/build/classes
    [mkdir] Created dir: /Path/To/TinyLanguage/build/classes

generate:
     [echo] Generating the lexer and parser...

clean:
   [delete] Deleting directory /Path/To/TinyLanguage/build/classes
    [mkdir] Created dir: /Path/To/TinyLanguage/build/classes

compile:
    [javac] Compiling 3 source files to /Path/To/TinyLanguage/build/classes

run:
     [echo] Running the main class...
     [java] Done!

BUILD SUCCESSFUL
Total time: 1 second



Parser rules


Let's start defining the parser rules at the highest level. A TL script is essentially a block of of code. A block of code is zero or more statements or function declarations optionally ending with return statement. A statement is either an assignment, a function call, an if-statement, a for-statement or a while-statement. Assignments, function calls and return statements all must end with a semi colon.

If we translate the above into ANTLR parser rules, we could end up with this:
parse
  :  block EOF
  ;

block
  :  (statement | functionDecl)* (Return expression ';')?
  ;

statement
  :  assignment ';'
  |  functionCall ';'
  |  ifStatement
  |  forStatement
  |  whileStatement
  ;

functionDecl
  :  Def Identifier '(' idList? ')' block End
  ;

idList
  :  Identifier (',' Identifier)*
  ;

You can paste the rules into the TL.g file. They should come after the @header { ... } sections. You can mix lexer- and parser rules, but your grammar is more readable if you keep them separate. People usually write lexer rules at the end of the grammar file (remember that the order of the lexer rules is important!).



An assignment can be just a simple a = 12*5; (on the left an identifier, and on the right an expression), but, it could also look like: X[0][1] = false; where X points to a 2 dimensional list where the first list's second element must be set to false. Such a rule would look like this:
assignment
  :  Identifier indexes? '=' expression
  ;

indexes
  :  ('[' expression ']')+
  ;



An expression has a bit more rules. We start at the operators with the lowest precedence: the ternary condition (a = 1+2==3 ? true : false) and the in operator ('foo' in ['bar','foo']) and we then trickle down towards the operators with the highest precedence (the unary - and !).
expression
  :  condExpr
  ;

condExpr
  :  orExpr ( '?' expression ':' expression
            | In expression
            )?
  ;

orExpr
  :  andExpr ('||' andExpr)*
  ;

andExpr
  :  equExpr ('&&' equExpr)*
  ;

equExpr
  :  relExpr (('==' | '!=') relExpr)*
  ;

relExpr
  :  addExpr (('>=' | '<=' | '>' | '<') addExpr)*
  ;

addExpr
  :  mulExpr (('+' | '-') mulExpr)*
  ;

mulExpr
  :  powExpr (('*' | '/' | '%') powExpr)*
  ;

powExpr
  :  unaryExpr ('^' unaryExpr)*
  ;
  
unaryExpr
  :  '-' atom
  |  '!' atom
  |  atom
  ;

atom
  :  Null
  |  Number
  |  Bool
  |  lookup
  ;



The lookup rule can be various other rules with optional indexes after it:
lookup
  :  functionCall indexes?
  |  '(' expression ')' indexes?
  |  list indexes?
  |  Identifier indexes?
  |  String indexes?
  ;



The list rule shouldn't pose too many problems:
list
  :  '[' exprList? ']'
  ;

exprList
  :  expression (',' expression)*
  ;



A function call would look like:
functionCall
  :  Identifier '(' exprList? ')'
  |  Println '(' expression? ')'
  |  Print '(' expression ')'
  |  Assert '(' expression ')'
  |  Size '(' expression ')'
  ;

As you can see, all of the function calls print(...), assert(...) and size(...) take exactly one parameter (or expression), println(...) takes an optional expression as a parameter, and a call to a custom defined function takes zero or more expressions as parameter.



Lastly, there are the if-, for-, and while statements. I've divided the if statement into a couple of separate sub-rules otherwise the entire rule would have been too big for my taste:
ifStatement
  :  ifStat elseIfStat* elseStat? End 
  ;

ifStat
  :  If expression Do block
  ;

elseIfStat
  :  Else If expression Do block
  ;

elseStat
  :  Else Do block
  ;

forStatement
  :  For Identifier '=' expression To expression Do block End 
  ;

whileStatement
  :  While expression Do block End
  ;




That's the complete parser grammar of TL! If you execute your the Ant target run so that the file test.tl gets parsed, there shouldn't appear any error messages on your screen, which means that the contents of the test script is first properly tokenized, and after that, the parser has correctly parsed these tokens (the file is syntactically valid).



Next


So that's all the parser does: it checks if the token stream is syntactically valid. But our tokens are still just a flat, 1 dimensional list. It also contains tokens that are only needed to check the validity of the source but are making it hard to evaluate the expression(s). Let's say out input source is true && (false || true && (true || false));, our parser now just validates it as syntactic valid input, but the token stream is just a 1 dimensional list of tokens while it would be easier to evaluate such input if the token stream was structured like this:


I.e. the parenthesis and semi colon are gone, and every operator node has exactly two children that can be evaluated more easily.

The structuring of our token stream into an AST (abstract syntax tree) will be handled in part 5. Building an AST.

P.S. You can download the project containing all source code, build script, ANTLR jar etc. here.

Friday, March 04, 2011

3. Lexical analysis of TL


This is part 3/9 of Creating your own programming language with ANTLR.


Index





Introduction


In this part we're going to write a lexical analyzer, or just lexer, for our TL language. Since TL is an uncomplicated language, we're going to write the lexer- and parser grammar in one combined ANTLR grammar as shown in part 2. Introduction to ANTLR.

Looking at part 1. TL specification, here's the list of tokens in the combined grammar file TL.g:
grammar TL;

parse
  :  (t=. 
          {System.out.printf("text: \%-7s  type: \%s \n", 
           $t.text, tokenNames[$t.type]);}
     )* 
     EOF
  ;
  
Println  : 'println';
Print    : 'print';
Assert   : 'assert';
Size     : 'size';
Def      : 'def';
If       : 'if';
Else     : 'else';
Return   : 'return';
For      : 'for';
While    : 'while';
To       : 'to';
Do       : 'do';
End      : 'end';
In       : 'in';
Null     : 'null';

Or       : '||';
And      : '&&';
Equals   : '==';
NEquals  : '!=';
GTEquals : '>=';
LTEquals : '<=';
Pow      : '^';
Excl     : '!';
GT       : '>';
LT       : '<';
Add      : '+';
Subtract : '-';
Multiply : '*';
Divide   : '/';
Modulus  : '%';
OBrace   : '{';
CBrace   : '}';
OBracket : '[';
CBracket : ']';
OParen   : '(';
CParen   : ')';
SColon   : ';';
Assign   : '=';
Comma    : ',';
QMark    : '?';
Colon    : ':';

Bool
  :  'true' 
  |  'false'
  ;

Number
  :  Int ('.' Digit*)?
  ;

Identifier
  :  ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | Digit)*
  ;

String
@after {
  setText(getText().substring(1, getText().length()-1).replaceAll("\\\\(.)", "$1"));
}
  :  '"'  (~('"' | '\\')  | '\\' .)* '"' 
  |  '\'' (~('\'' | '\\') | '\\' .)* '\'' 
  ;

Comment
  :  '//' ~('\r' | '\n')* {skip();}
  |  '/*' .* '*/'         {skip();}
  ;

Space
  :  (' ' | '\t' | '\r' | '\n' | '\u000C') {skip();}
  ;

fragment Int
  :  '1'..'9' Digit*
  |  '0'
  ;
  
fragment Digit 
  :  '0'..'9'
  ;
There are a couple of things that might be confusing, and/or things I haven't mentioned before. I'll address those first before testing our lexer.


DOT meta character

In the parser rule parse, I've used the . (dot):
parse
  :  .* EOF
  ;
A dot can mean two things:
  • inside a lexer rule, it matches any character in the range \u0000..\u0FFFF;
  • inside a parser rule, it matches any lexer rule.
Many people new to ANTLR think that when doing:
foo
  :  .
  ;

A : 'a';

B : 'b';
the rule foo will match any character, which is not the case. The rule foo matches either rule A or B (so only the character 'a' or 'b').

The rule C in:
C : . ;
does match any character in the range \u0000..\u0FFFF.


Order of lexer rules

Notice that the first few lexer rules (Println to Null) are some of the so called reserved words. ANTLR's lexer will try to match the first lexer rule it encounters, and if it can't, it will go down the list from top to bottom. So it's important that such reserved words come before a rule like Identifier, which would cause no reserved word token to be created since it too matches the string 'println' like the token Println does.


Custom code

Note that I used {skip();} in both the Comment and Space rules. Whenever the lexer matches either of those tokens, it will be discarded immediately. These tokens will therefor not be available in parser rules. This skip() method is inherited from the Lexer class, which our TLLexer automatically extends.

I also used:
setText(getText().substring(1, getText().length()-1).replaceAll("\\\\(.)", "$1"));
in the String rule. This is just a short notation for:
// get the text this token matched
String matched = getText();
       
// remove the leading and trailing quote
matched = matched.substring(1, matched.length()-1);
       
// replace all `\X` with `X`
matched = matched.replaceAll("\\\\(.)", "$1");
       
// set the new contents of this token
setText(matched);
where getText() and setText(String) are methods of the CommonToken class available in our String rule.

Of course, the latter code is far more readable, but personally, I'd like to keep my ANTLR grammar as empty as possible: I don't like to scan through many lines of Java code to see where the ANTLR rules are. But, if you feel more comfortable with the latter code, by all means, use that.


Fragment rules

Lastly, there are two rules that have the keyword fragment in front of it. You create fragment rules when you don't want such a rule to become a token of its own. These rules are therefor only available from other lexer rules. See of it as some sort of macro: you define the fragment rule Digit to match a single digit and you can then use this Digit fragment rule from any other lexer rule instead of duplicating '0'..'9' in many lexer rules.

The example of such a Digit fragment might not be a very convincing one, but let's say we wanted to support hexadecimal character literals inside out lexer that'd look like this \u005A (\u followed by four hexadecimal digits). We could do that by doing:
HexChar
  :  '\\' 'u' ('0'..'9' | 'a'..'f' | 'A'..'F') ('0'..'9' | 'a'..'f' | 'A'..'F') 
              ('0'..'9' | 'a'..'f' | 'A'..'F') ('0'..'9' | 'a'..'f' | 'A'..'F')
  ;
But the following is far better:
HexChar
  :  '\\' 'u' HexDigit HexDigit HexDigit HexDigit         
  ;

fragment HexDigit
  :  '0'..'9' | 'a'..'f' | 'A'..'F'
  ;
And if we didn't make HexDigit a fragment, there would be the risk that a HexDigit would become a token itself, which is probably not desired.

Also notice that I placed two backslashes inside '\\'. Inside literals, the \ and ' need to be escaped with a backslash if you want to match them.


Testing the lexer

Let's create a test class for our lexer, called Main.java, and copy the following contents in it:
import org.antlr.runtime.*;  
  
public class Main {  
  public static void main(String[] args) throws Exception {  
    // create an instance of the lexer  
    TLLexer lexer = new TLLexer(new ANTLRFileStream(args[0]));  
          
    // wrap a token-stream around the lexer  
    CommonTokenStream tokens = new CommonTokenStream(lexer);  
          
    // create the parser  
    TLParser parser = new TLParser(tokens);  
     
    // invoke the 'parse' rule
    parser.parse();
  }  
}  
And also create a small test script called test.tl containing:
aaa = 1.333 + 'a\'b"'; 
b = /* comment 1 */ aaa^aaa; 
println('b=' + b); // comment 2
Now generate the lexer:

java -cp antlr-3.2.jar org.antlr.Tool TL.g

And since it's a combined grammar, the parser will also be generated, but the parser will only have a single rule, parse, that matches the entire token stream and prints the text and type of each token.

Compile all Java source files:

javac -cp antlr-3.2.jar *.java

And run the main class:

java -cp .:antlr-3.2.jar Main test.tl # *nix/Mac
java -cp .;antlr-3.2.jar Main test.tl # Windows


which will produce the following output:

text: aaa      type: Identifier 
text: =        type: Assign 
text: 1.333    type: Number 
text: +        type: Add 
text: a'b"     type: String 
text: ;        type: SColon 
text: b        type: Identifier 
text: =        type: Assign 
text: aaa      type: Identifier 
text: ^        type: Pow 
text: aaa      type: Identifier 
text: ;        type: SColon 
text: println  type: Println 
text: (        type: OParen 
text: b=       type: String 
text: +        type: Add 
text: b        type: Identifier 
text: )        type: CParen 
text: ;        type: SColon
As you can see, the white spaces and comments are no longer part of the token stream and the rest of the input is correctly tokenized.

Okay, that concludes the lexical analysis of our language. Next up is the more challenging part: the syntactic analysis, or less formally, the parsing of the tokens produced by the lexer. Continue reading: 4. Syntactic analysis of TL

Thursday, March 03, 2011

2. Introduction to ANTLR


This is part 2/9 of Creating your own programming language with ANTLR.


Index



Introduction


ANTLR is a so called parser generator, or, sometimes called a compiler-compiler. In a nutshell: given a grammar of a language, ANTLR can generate a lexer and parser for said language. I realize that a one line definition isn't worth much, so we'll create a small CSV parser using ANTLR before tackling our more complex TL. To keep things simple, we won't be creating a CSV parser that adheres strictly to its RFC, but it's not too hard to rewrite it so that it does. I'll leave that as an exercise for the reader (sorry, I just had to cram that remark in this tutorial somewhere: I won't do it again. Promised!).

The first step is deciding what the smallest building blocks of the language are. In a CSV file, a single value is one of:

  • a simple value: zero or more characters other than a comma, line break or a double quote;
  • a quoted value: a single double quote followed by zero or more characters other than a double quote and ending with a single double quote. If a double quote is to be part of the value, it can be escaped by placing a double quote before it (so that's a double double quote: "").

Each row in a CSV file is one or more values as described above, delimited by a comma.

As you might have noticed, I underlined some of the words in my specification of a CSV file. These are the building blocks of the language. As you can see, some blocks are made up from other blocks, like:

  • file (one ore more rows)
  • row (one or more values separated by comma's)
  • simple value (see the list above)
  • quoted value (see the list above)

These are also known as non-terminals in parser theory. While the other blocks can't be broken down into smaller blocks:

  • comma
  • line break
  • double quote
  • character other than a double quote

which are, not surprisingly, called terminals.

However, in this example we'll be parsing values as terminals because it's easier to do so (as you'll see later on).


Lexing


The first step is to create a lexer (or scanner) for our CSV language. A lexer is a thing that, given a stream of single characters, produces a stream of tokens. For example, the string AA,bBb,"c""C" would be presented to the lexer as follows:

A A , b B b , " c " " C "

i.e., just an array of characters. It should (ideally) produce a stream of tokens that looks like:

  • AA
  • ,
  • bBb
  • ,
  • "c""C"


ANTLR lexer


First create a file called CSVLexer.g, which is the ANTLR grammar file. Every ANTLR grammar file must start with a grammar-type and -name. The name should correspond to the name of the file. So, inside CSVLexer.g, this is the first you type:
// CSVLexer.g
lexer grammar CSVLexer;

/*
the rest of the file here
*/
As you can see, the same single- and multi-line comments as in Java are allowed in ANTLR grammar files.

Now we're going to define our three token types (value, comma and line-break) in our grammar file (which I'll call rules from now on). In a lexer grammar, you always define your rules starting with a capital. This is no convention, but a must!

Okay, let start with the easiest: the comma. The lexer rule for a comma is this:
Comma 
  :  ','
  ;
Note that the indentation is my personal preference, but the following is equivalent:
Comma: ',';
The first just looks better :).

So, to emphasize, an ANTLR rule must always look like this:

RULE_NAME : RULE_CONTENTS ;

Now the rule for line breaks. A line break is one of the following:
  • A) "\r\n" on Windows
  • B) "\n" on Macs/*nix
  • C) "\r" on old Macs
This results in the following rule:
LineBreak 
  :  '\r' '\n'  // A
  |  '\n'       // B
  |  '\r'       // C
  ;
or, when using the operator ? to make something optional, you could use the slightly more compact rule:
LineBreak 
  :  '\r'? '\n'  // matches '\r\n' or '\n'
  |  '\r'
  ;
And the lexer rules for a CSV value could look like this:
SimpleValue
  :  ~(',' | '\r' | '\n' | '"')+
  ;

QuotedValue
  :  '"' ('""' | ~'"')* '"'
  ;
As you might have made up from the example above, some characters are special inside ANTLR grammars. Here's a table of the ones you've seen so far:

character meaning example matches
| logical OR 'a' | 'b' either 'a' or 'b'
? optional 'a' 'b'? either 'ab' or 'a'
* none or more 'a'* nothing, 'a', 'aa', 'aaa', ...
+ once or more 'a'+ 'a', 'aa', 'aaa', ...
~ negation ~('a' | 'b') any character (in the range \u0000..\uFFFF) except 'a' and 'b'
(...) grouping ('a' 'b')+ 'ab', 'abab', 'ababab', ...

So, the part: ('""' | ~'"')* from the lexer rule QuotedValue matches 2 quotes or any character other than a quote, which is then repeated zero or more times.

Our lexer grammar now looks like this:
lexer grammar CSVLexer;

Comma 
  :  ','
  ;

LineBreak
  :  '\r'? '\n'
  |  '\r'
  ;

SimpleValue
  :  ~(',' | '\r' | '\n' | '"')+
  ;

QuotedValue
  :  '"' ('""' | ~'"')* '"'
  ;
and let's say our CSV file look like this:

value1,value2,"value3.1,"",value3.2"
"line
break",Bbb,end


i.e. 2 rows containing each 3 values (note that "line and break" are one value: quoted values can contain line breaks!).

Now let's generate a lexer from the grammar we've created. To do this, download ANTLR v3.2 and put it in the same directory as the CSVLexer.g file. Generating a lexer from our CSVLexer.g file can be done by issuing the following command on the command line of your OS:

java -cp antlr-3.2.jar org.antlr.Tool CSVLexer.g

After doing so, two files have been generated: CSVLexer.java and CSVLexer.tokens. The first one is your actual lexer class that is able to tokenize the source into tokens. You can test this lexer with the following class:
import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    // the input source
    String source = 
        "value1,value2,\"value3.1,\"\",value3.2\"" + "\n" + 
        "\"line\nbreak\",Bbb,end";
        
    // create an instance of the lexer
    CSVLexer lexer = new CSVLexer(new ANTLRStringStream(source));
        
    // wrap a token-stream around the lexer
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    
    // when using ANTLR v3.3 or v3.4, un-comment the next line:
    //tokens.fill();

    // traverse the tokens and print them to see if the correct tokens are created
    int n = 1;
    for(Object o : tokens.getTokens()) {
      CommonToken token = (CommonToken)o;
      System.out.println("token(" + n + ") = " + token.getText().replace("\n", "\\n"));
      n++;
    }
  }
}

which, after compiling all .java files and running the Main class:

javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main


(on Windows, substitute the ':' for a ';' in the last command)

results in the following output:

token(1) = value1
token(2) = ,
token(3) = value2
token(4) = ,
token(5) = "value3.1,"",value3.2"
token(6) = \n
token(7) = "line\nbreak"
token(8) = ,
token(9) = Bbb
token(10) = ,
token(11) = end



Parsing


After the lexer has created the tokens from the source, the tokens are then passed to the parser. The parser performs the syntactic analysis. Some source might my lexically correct, but syntactically incorrect. Take the input a\n\nb is lexically correct: it will be tokenized as Value, LineBreak, LineBreak and Value but empty lines are not allowed in CSV files, so syntactically the input is not correct.


ANTLR parser


To create an ANTLR parser grammar, we could create a file called CSVParser.g, but in case of rather small grammars, it's easier to create a so called combined grammar in which you can mix lexer- and parser rules. Instead of declaring either lexer grammar ... or parser grammar ... at the start, simply create a file called CSV.g and declare the grammar like this:
grammar CSV;

// ... parser- and lexer rules

in which case a CSVParser.java and CSVLexer.java are automatically generated ("Lexer" and "Parser" are automatically appended to the grammar name).

So rename CSVLexer.g into CSV.g and copy the following into it:
grammar CSV;

file
  :  row+ EOF
  ;

row
  :  value (Comma value)* (LineBreak | EOF)
  ;

value
  :  SimpleValue
  |  QuotedValue
  ;

Comma
  :  ','
  ;

LineBreak
  :  '\r'? '\n'
  |  '\r'
  ;

SimpleValue
  :  ~(',' | '\r' | '\n' | '"')+
  ;

QuotedValue
  :  '"' ('""' | ~'"')* '"'
  ;

As you can see, the four lexer rules are still the same and there are now three parser rules (which must begin with a lower case letter!). The file rule, which is the entry point of our grammar, simply states that a file is one ore more rows followed by the reserved ANTLR keyword EOF (meaning the end-of-file). And a row is one ore more values separated by a comma ending with either a line break, or the end-of-file.

Now edit Main.java with the following contents:
import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    // the input source
    String source = 
        "value1,value2,\"value3.1,\"\",value3.2\"" + "\n" + 
        "\"line\nbreak\",Bbb,end";
    
    // create an instance of the lexer
    CSVLexer lexer = new CSVLexer(new ANTLRStringStream(source));
        
    // wrap a token-stream around the lexer
    CommonTokenStream tokens = new CommonTokenStream(lexer);
        
    // create the parser
    CSVParser parser = new CSVParser(tokens);
    
    // invoke the entry point of our grammar
    parser.file();
  }
}

and generate a lexer and parser from the grammar file:

java -cp antlr-3.2.jar org.antlr.Tool CSV.g

Then compile all Java source files:

javac -cp antlr-3.2.jar *.java

and run the main class:

java -cp .:antlr-3.2.jar Main

(on Windows, substitute the ':' for a ';' in the last command)

When running the main class, there shouldn't be any output printed to the console afterwards. This means that the parser didn't find any errors. Go ahead and edit the String source in the Main.java class so that it's no longer valid CSV, like:
String source = "a,\"b,c";

for example, and compile and run it. You will see the following error being printed to the console:

line 1:6 mismatched character '<EOF>' expecting '"'

which is pretty self-explanatory: the parser encountered the EOF while it expected a closing quote.


Actions in an ANTLR grammar


Okay, now that we have an ANTLR grammar that produces a lexer and parser, and we're able to instantiate these classes in our Java test class, it's time to let the parser do some actual work. We'd like the file rule return some 2 dimensional collection of strings. A row would then be a likely candidate to return a 1 dimensional collection of strings, and the value rule would then return, not surprisingly, a single string. You can let an ANTLR rule return an object by placing returns [AnObject obj] after the rule name. Let's apply that to our parser rules:
file returns [List<List<String>> data]
  :  row+ EOF
  ;

row returns [List<String> row]
  :  value (Comma value)* (LineBreak | EOF)
  ;

value returns [String val]
  :  SimpleValue
  |  QuotedValue
  ;

All the return values: data, row and val, are initialized with null, so we'll have to do a bit of work to assign actual values to them.

Let's start with the value rule. You can embed custom Java code in your grammar by wrapping braces around that code. So, if we wanted to set the value of val to "XYZ", we'd do that like this:
value returns [String val]
  :  SimpleValue {val = "XYZ";}
  |  QuotedValue {val = "XYZ";}
  ;

Notice that I did that twice: in case of a SimpleValue and for a QuotedValue too. But we really want to get a hold of the contents these tokens have matched, of course. You can get their contents in your Java code by typing a $ sign followed by their rule name and then invoking the token's .text attribute:
value returns [String val]
  :  SimpleValue {val = $SimpleValue.text;}
  |  QuotedValue {val = $QuotedValue.text;}
  ;

And if you want to strip the start- and end-quotes from QuotedValue, and replace all "" with a single double quote, do something like this:
value returns [String val]
  :  SimpleValue {val = $SimpleValue.text;}
  |  QuotedValue 
     {
       val = $QuotedValue.text;
       val = val.substring(1, val.length()-1); // remove leading- and trailing quotes
       val = val.replace("\"\"", "\""); // replace all `""` with `"`
     }
  ;

Next is the row rule. This one returns a list of strings, and there's a * in there that matches zero or more value rules, so we can't assign a single return value here. At the very start of the rule, we're going to instantiate the List<String> that will be returned inside an @init { ... } block. The code in such init-blocks are executed before any of rule itself is matched. After that, we'll fill the List<String> with the String val from the value rule:
row returns [List<String> list]
@init {list = new ArrayList<String>();}
  :  a=value {list.add($a.val);} (Comma b=value {list.add($b.val);})* (LineBreak | EOF)
  ;

As you can see in the example above, since there are two value sub-rules in there, we can't reference the String val by doing $value.val because the parser does not know which value sub-rule we mean. So I assigned variables a and b to these sub-rules, and referenced a and b instead.

Lastly, we will also initialize a List<List<String>> for the file rule and add the return values of the one or more row rules to it:
file returns [List<List<String>> data]
@init {data = new ArrayList<List<String>>();}
  :  (row {data.add($row.list);})+ EOF
  ;

For completeness sake, here's the final CSV grammar:
grammar CSV;

file returns [List<List<String>> data]
@init {data = new ArrayList<List<String>>();}
  :  (row {data.add($row.list);})+ EOF
  ;

row returns [List<String> list]
@init {list = new ArrayList<String>();}
  :  a=value {list.add($a.val);} (Comma b=value {list.add($b.val);})* (LineBreak | EOF)
  ;

value returns [String val]
  :  SimpleValue {val = $SimpleValue.text;}
  |  QuotedValue 
     {
       val = $QuotedValue.text;
       val = val.substring(1, val.length()-1); // remove leading- and trailing quotes
       val = val.replace("\"\"", "\""); // replace all `""` with `"`
     }
  ;

Comma
  :  ','
  ;

LineBreak
  :  '\r'? '\n'
  |  '\r'
  ;

SimpleValue
  :  ~(',' | '\r' | '\n' | '"')+
  ;

QuotedValue
  :  '"' ('""' | ~'"')* '"'
  ;

which can be tested with the class:
import org.antlr.runtime.*;
import java.util.List;

public class Main {
  public static void main(String[] args) throws Exception {
    // the input source
    String source = 
        "aaa,bbb,ccc" + "\n" + 
        "\"d,\"\"d\",eee,fff";
    
    // create an instance of the lexer
    CSVLexer lexer = new CSVLexer(new ANTLRStringStream(source));
        
    // wrap a token-stream around the lexer
    CommonTokenStream tokens = new CommonTokenStream(lexer);
        
    // create the parser
    CSVParser parser = new CSVParser(tokens);
    
    // invoke the entry point of our grammar
    List<List<String>> data = parser.file();
    
    // display the contents of the CSV source
    for(int r = 0; r < data.size(); r++) {
      List<String> row = data.get(r);
      for(int c = 0; c < row.size(); c++) {
        System.out.println("(row=" + (r+1) + ",col=" + (c+1) + ") = " + row.get(c));
      }
    }
  }
}

And after running the main class, the following is printed to the console:

(row=1,col=1) = aaa
(row=1,col=2) = bbb
(row=1,col=3) = ccc
(row=2,col=1) = d,"d
(row=2,col=2) = eee
(row=2,col=3) = fff


which are the expected values.



This concludes our introduction to ANTLR. Next up is part 3 of this tutorial where we'll be looking at the lexical analysis of our TL language.

Continue reading part 3. Lexical analysis of TL.