Day 1 with Esprima

Esprima is a JS parser that outputs an AST. There are plenty of articles on parsers and ASTs. Today, I'm going to take a look at how Esprima builds the AST.

What is 2+2?

2+2
{
    "type": "Program",
    "body": [
        {
            "type": "ExpressionStatement",
            "expression": {
                "type": "BinaryExpression",
                "operator": "+",
                "left": {
                    "type": "Literal",
                    "value": 2,
                    "raw": "2"
                },
                "right": {
                    "type": "Literal",
                    "value": 2,
                    "raw": "2"
                }
            }
        }
    ]
}

Lets look at two things:

  1. How are Binary Expressions parsed
  2. How did Esprima determine "2+2" was a binary expression?

1. How are Binary Expressions parsed?

Here's a simplified version of the binary expression parser.

Order of operations:

  1. get left value (1)
  2. get operator (+)
  3. get right value (2)
  4. return a new expression node
function parseBinaryExpression() {
    // get left value
    var marker = lookahead;
    var left = inheritCoverGrammar(parseUnaryExpression);

    // get operator
    var oparator = lookahead;
    var prec = binaryPrecedence(oparator, state.allowIn);
    oparator.prec = prec;

    // get right value
    var right = isolateCoverGrammar(parseUnaryExpression);

    // return node
    var node = new WrappingNode(marker)
    node.finishBinaryExpression(operator, left, right);
    return node;
}

Lets take a closer look at how the left value is fetched. By the way, left is actually quite simple. We're basically looking at the literal number 2.

{
    "type": "Literal",
    "value": 2,
    "raw": "2"
}

Here's the code that finds the left value.

marker = lookahead;
left = inheritCoverGrammar(parseUnaryExpression);

The marker (lookahead) value is the token that we'll parse. json { "type": 6, "value": 2, "lineNumber": 1, "lineStart": 0, "start": 0, "end": 1 }

The inherit cover grammar... parse unary expression... business is Esprima's strategy for parsing the left value. We'll see in a second that, this expands out really quickly as Esprima searches for what the left value really is.

inheritCoverGrammar(parseUnaryExpression);

Here's Esprima's call stack as it searches for the left value's value.

.9. finishLiteral
.8. parsePrimaryExpression (esprima.js:3155)
.7. inheritCoverGrammar (esprima.js:2610)
.6. parseLeftHandSideExpressionAllowCall (esprima.js:3324)
.5. inheritCoverGrammar (esprima.js:2610)
.4. parsePostfixExpression (esprima.js:3398)
.3. parseUnaryExpression (esprima.js:3427)
.2. inheritCoverGrammar (esprima.js:2610)
.1. parseBinaryExpression (esprima.js:3547)

Here's how Esprima narrow's the search:

  1. parseUnaryExpression
  2. parsePostfixExpression
  3. parseLeftHandSideExpressionAllowCall
  4. parsePrimaryExpression
  5. finishLiteral

parsePrimaryExpression finally figures out that our left value is a literal value.

The way it does it is really neat. There's a specific order it checks things:

  1. parenthesis, bracket, or curly
  2. identifiers
  3. literals
  4. keywords like function, this, or class
  5. give up and throw an error like this Uncaught Error: Line 1: Unexpected token ILLEGAL
function parsePrimaryExpression() {
    if (match('(')) return inheritCoverGrammar(parseGroupExpression);
    if (match('[')) return inheritCoverGrammar(parseArrayInitialiser);
    if (match('{')) return inheritCoverGrammar(parseObjectInitialiser);

    type = lookahead.type;
    if (type === Token.Identifier) {
        return node.finishIdentifier(lex().value);

    } else if (type === Token.StringLiteral || type === Token.NumericLiteral) {
        return node.finishLiteral(lex());

    } else if (type === Token.Keyword) {
        if (matchKeyword('function')) return parseFunctionExpression();
        if (matchKeyword('this')) return node.finishThisExpression();
        if (matchKeyword('class')) return parseClassExpression();

    throwUnexpectedToken(lex());
}

2. How did Esprima determine "2+2" was a binary expression?

Alright, so we now know a little bit about Esprima parsed the "2+2" expression. Let's see how Esprima figured out 2+2 was in fact a binary expression.

Here's the call stack leading up to parseBinaryExpression


11. parseBinaryExpression (esprima.js:3593)
10. inheritCoverGrammar (esprima.js:2610)
9. parseConditionalExpression (esprima.js:3604)
8. parseAssignmentExpression (esprima.js:3800)
7. isolateCoverGrammar (esprima.js:2592)
6. parseExpression (esprima.js:3844)
5. parseStatement (esprima.js:4639)
4. parseStatementListItem (esprima.js:3888)
3. parseScriptBody (esprima.js:5319)
2. parseProgram (esprima.js:5335)
1. parse (esprima.js:5520)

The first three steps setup the state for parsing statements. Statements and expressions are two of the main guys, so that's where it gets fun.

  1. parse
  2. parseProgram
  3. parseScriptBody
  4. parseStatementListItem

parseStatementListItem basically says, if the statement begins with a keyword, parse it, otherwise parse a plain vanilla statement.

function parseStatementListItem() {
  if (lookahead.type === Token.Keyword) {
    switch (lookahead.value) {
    case 'export': return parseExportDeclaration();
    case 'import': return parseImportDeclaration();
    case 'let': return parseLexicalDeclaration({inFor: false});
    case 'function': return parseFunctionDeclaration(new Node());
    case 'class': return parseClassDeclaration();
    }
  }

  return parseStatement();
}

and parseStatement is where the really good stuff happens

Order of operations

  1. look for punctuation
  2. look for keywords
  3. treat it as an expression
  4. along the way, swallow the trailing semicolon if it's there (yum)

Our statement in this case, is just the expression 2+2, so we skip the punctuation and the keywords and jump directly to parseExpression.

function parseStatement() {
  if (type === Token.Punctuator) {
      switch (lookahead.value) {
      case ';': return parseEmptyStatement(node);
      case '(': return parseExpressionStatement(node);
      default: break;
      }
  } else if (type === Token.Keyword) {
      switch (lookahead.value) {
      case 'break': return parseBreakStatement(node);
      case 'continue': return parseContinueStatement(node);
      case 'debugger': return parseDebuggerStatement(node);
      case 'do': return parseDoWhileStatement(node);
      case 'for': return parseForStatement(node);
      case 'function': return parseFunctionDeclaration(node);
      case 'if': return parseIfStatement(node);
      case 'return': return parseReturnStatement(node);
      case 'switch': return parseSwitchStatement(node);
      case 'throw': return parseThrowStatement(node);
      case 'try': return parseTryStatement(node);
      case 'var': return parseVariableStatement(node);
      case 'while': return parseWhileStatement(node);
      case 'with': return parseWithStatement(node);
      default: break;
      }
  }

  var expr = parseExpression();
  consumeSemicolon();
  return node.finishExpressionStatement(expr);
}

If you expect parseExpression to be meaty, then you haven't internalized the zen of a parser. Looking at parse expression, it does two things of note:

  1. it looks to parse an assignment expression
  2. it checks to see if there's actually a list of assignment expressions (think var a = 2, b= 3, c)
function parseExpressions() {
  var expressions = [];
  while (startIndex < length) {
    if (!match(',')) break;
    lex();
    expressions.push(isolateCoverGrammar(parseAssignmentExpression));
  }

  return new WrappingNode(startToken).finishSequenceExpression(expressions);
}

function parseExpression() {
  return match(',')
    ? parseExpressions()
    : isolateCoverGrammar(parseAssignmentExpression);
}

The last two steps parseAssignmentExpression and parseConditionalExpression don't give us too many clues as to how Esprima knew our expression was a binary expression. What I gather here is that, when presented with an expression, Esprima will parse it as an assignment, conditional, binary expression in that order. I'm not sure why that is and certainly there are exceptions to that rule, but here it looks like that's the case.

Here's a small section of the these two function's call sites.

function parseAssignmentExpression() {
  var expr = parseConditionalExpression();
  return expr;
}

function parseConditionalExpression() {
    var expr = inheritCoverGrammar(parseBinaryExpression);
    return expr;
}

Closing

I think it's fun looking into how "2+2" is interpreted. We asked two basic questions, how are Binary Expressions parsed and how did Esprima determine "2+2" was a binary expression? In both cases, the answers yielded tons of information about how Esprima thinks about javascript.

I think this investigation raised more questions than it answered. In subsequent posts, I'd like to look into how esprima deals with tokens, markers, and look aheads. And in that vein, what the whole inheritCoverGrammar business is about.

There are also plenty of interesting meta questions we can ask about how Esprima handles state and shares closure data.

Ofcourse, we can also just ask esprima to parse something slightly more complicated like (var a = 2+2 or function sum(a,b) {return a + b;}).

NOTE: All of the credit goes to the Esprima team
for writing readable code. Also, please forgive me
for simplifying the code examples and perhaps butchering
some of the explanations.

16 Jul 2015