Day 2 with Esprima

Yesterday, we looked at how Esprima parses 2+2. Today, we’re going to go one step further and see how Esprima parses sum.

function sum(a,b) {
    return a + b;
}

Like yesterday, this post is just a polished version of my notes. The transitions will be a bit rough and there are probably some incomplete thoughts, but hopefully you’ll be able to see how Esprima goes about the work of parsing a function.

How do you add two numbers?

So lets start from the beginning.

esprima.parse("function sum(a,b) {return a + b;}")

When you parse the sum function you get an output like this:

{
    "type": "Program",
    "body": [
        {
            "type": "FunctionDeclaration",
            "id": {
                "type": "Identifier",
                "name": "sum"
            },
            "params": [
                {
                    "type": "Identifier",
                    "name": "a"
                },
                {
                    "type": "Identifier",
                    "name": "b"
                }
            ],
            "defaults": [],
            "body": {
                "type": "BlockStatement",
                "body": [
                    {
                        "type": "ReturnStatement",
                        "argument": {
                            "type": "BinaryExpression",
                            "operator": "+",
                            "left": {
                                "type": "Identifier",
                                "name": "a"
                            },
                            "right": {
                                "type": "Identifier",
                                "name": "b"
                            }
                        }
                    }
                ]
            },
            "generator": false,
            "expression": false
        }
    ]
}

Notes:

  • FunctionDeclaration - declarations take the form function foo() {}
  • type.name is set to sum
  • params are an array of identifiers (a, b)
  • body is a ReturnStatement with an argument BinaryExpression.

Call Stack

Alright, now that we see the output, lets take a look at how it was generated. In this case, if we set a breakpoint at parseFunctionDeclaration the call stack is pretty simple.

5. parseFunctionDeclaration
4. parseStatementListItem
3. parseScriptBody
2. parseProgram
1. parse

This shows us that, the function declaration is just a StatementItem found by parseStatementListItem.

function parseStatementListItem() {
  if (lookahead.type === Token.Keyword) {
    switch (lookahead.value) {
    case 'export': return parseExportDeclaration();
    case 'import': return parseImportDeclaration();
    case 'let': return parseLexicalDeclaration({inFor: false});
    case 'function': return parseFunctionDeclaration(new Node());
    case 'class': return parseClassDeclaration();
    }
  }

  return parseStatement();
}

Looking at parseStatementListItem, all it’s doing is comparing the lookahead’s value to different types. It’s not surprising that it finds the function declaration.

How is a function parsed?

Here’s a simplified version of parseFunctionDeclaration We’ll look at how four things are handled: (names, params, strict mode, body).

function parseFunctionDeclaration(node) {

    if (!match('(')) {
        token = lookahead;
        var id = parseVariableIdentifier();
    }

    var params = parseParams();

    var previousStrict = strict;
    var body = parseFunctionSourceElements();
    strict = previousStrict;

    return node.finishFunctionDeclaration(id, params, body);
}

How is the function name parsed?

Function names are parsed the same way other variable identifiers are. There are a couple edge cases to watch out for like yield and strict mode, but basically Esprima lexes, finds the next token and returns the node.

function parseVariableIdentifier() {
    var token = lex();
    if (token.type !== Token.Identifier) throwUnexpectedToken(token);
    return node.finishIdentifier(token.value);
}

How are function params parsed?

When we’re parsing params function sum (a,b) {return a + b}, we’re looking for the a and b identifiers inside of the parens. We also know that they’re separated by commas.

Parse params order of operations is as follows:

  1. first check for a ‘(‘
  2. check for the second ‘)’ if it finds it return early
  3. parse params
function parseParams() {
    options = { params: [] };
    expect('(');

    if (match(')')) return options;
    while (parseParam(options));

    return options;
}

Are you surprised that parseParams delegates responsiblity to parseParam? You shouldn’t be, parsers will always break the problem into the smallest possible unit!

Findings:

  • Params are just variable identifiers. Remember function names? Same thing!
  • options is a variable that is shared between parseParams and parseParam. Interesting choice sir…
  • parseParam checks for the last paren ).

function parseParam(param) {
    token = lookahead;
    var param = parseVariableIdentifier();
    options.params.push(param);

    return !match(')');
}

What’s going on with strict modes?

When parseFunctionDeclaration parses the body it first saves the current strict mode. It then, parses the function body and resets the strict mode.

function parseFunctionDeclaration(node) {

    if (!match('(')) {
        token = lookahead;
        var id = parseVariableIdentifier();
    }

    var params = parseParams();

    var previousStrict = strict;
    var body = parseFunctionSourceElements();
    strict = previousStrict;

    return node.finishFunctionDeclaration(id, params, body);
}

The reason for this is pretty cool. Remember that each function block can define it’s own strict mode preference, well, by keeping a reference to the wrapping block’s strict mode preference we can make sure it doesn’t get clobbered.

function() { 'use strict' }

How is the function body parsed?

parseFunctionSourceElements will do a couple things to set specific function body fields, but the gist of it is parsing statements, which is a lot like parsing a JS script body.

Order of operations:

  • check for a left curly {
  • start parsing statements and checking for the right curly }
function parseFunctionSourceElements() {
    expect('{');
    var body = [];

    while (startIndex < length) {
        token = lookahead;
        if (match('}')) break;

        body.push(parseStatementListItem());
    }

    return node.finishBlockStatement(body);
}

You might expect us to jump into how Esprima parses a statement, but I promise you that if we do that then this will be the blog post that never ends. In the spirit of the parser, if you’re curious, I point you to yesterday’s post on how 2+2 is parsed. If you’ve seen one statement parsed, you’ve seen them all :) Not really, but ya know we got to save something for tomorrow.

Overview

Alright, so that finishes today’s journey into how sum is parsed.

 function sum(a,b) {return a + b;}

We looked at how sum’s name, params, and body were parsed. Along the way, we saw the importance of breaking a big problem down into small pieces and the value of always checking for small grammatical hints (“(“, “)”, “,”, “{“, “}”).

17 Jul 2015