Demystifying Compiler Principles: From Tokens to AST

JS Syntax

AST

2025 Apr 141821 words|Estimated reading time: 10 minutes

Introduction

Have you ever wondered how your editor "understands" your code? Why do different words appear in different colors? How can tools like ESLint precisely identify issues in your code? Or even, how does Babel transform your JSX code into plain JavaScript and then downgrade it to older versions of JS? If you're curious about the answers to these questions, keep reading!

Compiler principles are a crucial field in computer science, focusing on how high-level programming languages are transformed into machine-executable code. In this process, lexical analysis (Tokenization) and syntactic analysis (AST generation) are two key steps.

Lexical Analysis: Breaking down the source code string into meaningful Tokens (e.g., keywords, identifiers, operators, etc.).
Syntactic Analysis: Converting the sequence of Tokens into an Abstract Syntax Tree (AST), a tree-like structure representing the syntactic structure of the code.

This article will guide you through implementing a simple JS lexical analyzer and parser using TypeScript, capable of parsing variable declaration statements, and help you understand how they work.

Part 1: Lexical Analysis (Tokenization)

1. What is a Token?

A Token is the smallest unit of source code, representing elements like keywords, identifiers, operators, and numbers in a language. For example, in JavaScript, the following code:

let x = 42;

Can be broken down into the following Tokens:

let: Keyword
x: Identifier
=: Operator
42: Number
;: Punctuation

2. The Process of Lexical Analysis

The lexical analyzer (Lexer) breaks down the source code string into a sequence of Tokens. Here's how it works:

Read Characters One by One: The Lexer starts from the first character of the source code, reading characters sequentially.
Identify Tokens: Based on the character type (letter, digit, symbol, etc.), the Lexer identifies the Token type.
Generate Tokens: Combines the identified Token type and its corresponding value into a Token object.

3. Implementing a Simple Lexical Analyzer

Here’s a simple lexical analyzer implemented in TypeScript:

// Define Token types
enum TokenType {
  Keyword = "Keyword", // Keywords like let
  Identifier = "Identifier", // Identifiers like x
  Number = "Number", // Numbers like 42
  Operator = "Operator", // Operators like =, +
  Punctuation = "Punctuation", // Punctuation like ;
}

// Define Token structure
type Token = {
  type: TokenType;
  value: string;
};

// Simple Lexical Analyzer
function tokenize(code: string): Token[] {
  const tokens: Token[] = [];
  let pos = 0;

  // Regex matching rules
  const regex = {
    whitespace: /\s/,
    number: /\d/,
    identifier: /[a-zA-Z_]/,
    operator: /[=+]/,
    punctuation: /[;]/,
  };

  while (pos < code.length) {
    let char = code[pos];

    // Skip whitespace
    if (regex.whitespace.test(char)) {
      pos++;
      continue;
    }

    // Match numbers
    if (regex.number.test(char)) {
      let num = "";
      while (regex.number.test(code[pos])) {
        num += code[pos];
        pos++;
      }
      tokens.push({ type: TokenType.Number, value: num });
      continue;
    }

    // Match identifiers or keywords
    if (regex.identifier.test(char)) {
      let id = "";
      while (regex.identifier.test(code[pos])) {
        id += code[pos];
        pos++;
      }
      // Check if it's a keyword
      if (id === "let") {
        tokens.push({ type: TokenType.Keyword, value: id });
      } else {
        tokens.push({ type: TokenType.Identifier, value: id });
      }
      continue;
    }

    // Match operators
    if (regex.operator.test(char)) {
      tokens.push({ type: TokenType.Operator, value: char });
      pos++;
      continue;
    }

    // Match punctuation
    if (regex.punctuation.test(char)) {
      tokens.push({ type: TokenType.Punctuation, value: char });
      pos++;
      continue;
    }

    // Throw error for unknown characters
    throw new Error(`Unknown character: ${char}`);
  }

  return tokens;
}

// Test the Lexical Analyzer
const code = `
let x = 42;
let y = x + 10;
`;

const tokens = tokenize(code);
console.log(tokens);

4. Code Explanation

The code might seem lengthy, but the Token parsing logic is linear and doesn’t involve complex calls. It reads characters one by one and generates corresponding Tokens based on their type. You can copy the code into your editor, run it, and debug it while observing how the data changes. This hands-on approach will help you quickly grasp the flow of data and understand how lexical analysis works.

While debugging, focus on:

Character Reading and Matching: Observe how pos changes and the value of char.
Token Generation: See how the tokens array is populated step by step.
Regex Matching: Verify if the regex correctly identifies different character types.

Hands-on practice is the best way to learn programming, so don’t hesitate to try it out!

5. Output

Running the above code will produce the following tokens array:

[
  { type: "Keyword", value: "let" },
  { type: "Identifier", value: "x" },
  { type: "Operator", value: "=" },
  { type: "Number", value: "42" },
  { type: "Punctuation", value: ";" },
  { type: "Keyword", value: "let" },
  { type: "Identifier", value: "y" },
  { type: "Operator", value: "=" },
  { type: "Identifier", value: "x" },
  { type: "Operator", value: "+" },
  { type: "Number", value: "10" },
  { type: "Punctuation", value: ";" }
]

Part 2: Syntactic Analysis (AST Generation)

1. What is an AST?

An Abstract Syntax Tree (AST) is a tree representation of source code, where each node represents a syntactic structure (e.g., expressions, statements). For example, the following code:

let x = 42;

Could have an AST like this:

{
  "type": "Program",
  "body": [
    {
      "type": "VariableDeclaration",
      "declarations": [
        {
          "type": "VariableDeclarator",
          "id": { "type": "Identifier", "name": "x" },
          "init": { "type": "Literal", "value": 42 }
        }
      ],
      "kind": "let"
    }
  ]
}

2. The Process of Syntactic Analysis

The parser converts a sequence of Tokens into an AST. Here's how it works:

Recursive Parsing: Reads Tokens one by one and recursively builds the AST based on syntax rules.
Syntax Rules: Generates corresponding AST nodes based on language syntax rules (e.g., variable declarations, expressions).

3. Implementing a Simple Parser

Here’s a simple parser implemented in TypeScript:

// Define AST node types
interface Program {
  type: "Program";
  body: Statement[];
}

interface VariableDeclaration {
  type: "VariableDeclaration";
  declarations: VariableDeclarator[];
  kind: "let" | "const" | "var";
}

interface VariableDeclarator {
  type: "VariableDeclarator";
  id: Identifier;
  init: Expression | null;
}

interface Identifier {
  type: "Identifier";
  name: string;
}

interface Literal {
  type: "Literal";
  value: number | string;
}

interface BinaryExpression {
  type: "BinaryExpression";
  operator: string;
  left: Expression;
  right: Expression;
}

// Define expression types
type Expression = Identifier | Literal | BinaryExpression;

// Define statement types
type Statement = VariableDeclaration;

// Parser
class Parser {
  private tokens: Token[];
  private pos: number = 0;

  constructor(tokens: Token[]) {
    this.tokens = tokens;
  }

  // Get current Token
  private currentToken(): Token {
    return this.tokens[this.pos];
  }

  // Consume current Token
  private consume(type: TokenType): Token {
    const token = this.currentToken();
    if (token.type === type) {
      this.pos++;
      return token;
    } else {
      throw new Error(`Expected token type: ${type}, but got ${token.type}`);
    }
  }

  // Parse the entire program
  public parse(): Program {
    const body: Statement[] = [];
    while (this.pos < this.tokens.length) {
      body.push(this.parseStatement());
    }
    return { type: "Program", body };
  }

  // Parse a statement
  private parseStatement(): Statement {
    if (
      this.currentToken().type === TokenType.Keyword &&
      this.currentToken().value === "let"
    ) {
      return this.parseVariableDeclaration();
    }
    throw new Error(`Unknown statement starting with ${this.currentToken().value}`);
  }

  // Parse a variable declaration
  private parseVariableDeclaration(): VariableDeclaration {
    this.consume(TokenType.Keyword); // Consume 'let'
    const id = this.parseIdentifier();
    this.consume(TokenType.Operator); // Consume '='
    const init = this.parseExpression();
    this.consume(TokenType.Punctuation); // Consume ';'
    return {
      type: "VariableDeclaration",
      declarations: [
        {
          type: "VariableDeclarator",
          id,
          init,
        },
      ],
      kind: "let",
    };
  }

  // Parse an identifier
  private parseIdentifier(): Identifier {
    const token = this.consume(TokenType.Identifier);
    return { type: "Identifier", name: token.value };
  }

  // Parse an expression
  private parseExpression(): Expression {
    let left = this.parsePrimaryExpression();
    while (
      this.currentToken().type === TokenType.Operator &&
      ["+", "-", "*", "/"].includes(this.currentToken().value)
    ) {
      const operator = this.consume(TokenType.Operator).value;
      const right = this.parsePrimaryExpression();
      left = {
        type: "BinaryExpression",
        operator,
        left,
        right,
      };
    }
    return left;
  }

  // Parse a primary expression (identifier or literal)
  private parsePrimaryExpression(): Expression {
    const token = this.currentToken();
    if (token.type === TokenType.Identifier) {
      return this.parseIdentifier();
    } else if (token.type === TokenType.Number) {
      const value = Number(this.consume(TokenType.Number).value);
      return { type: "Literal", value };
    }
    throw new Error(`Unknown primary expression: ${token.value}`);
  }
}

// Test the Parser
const parser = new Parser(tokens);
const ast = parser.parse();
console.log(JSON.stringify(ast, null, 2));

4. Code Explanation

This code is relatively simple, handling only variable declaration ASTs. However, variable declarations can take multiple forms: direct assignments (e.g., let x = 42;) or complex expressions (e.g., let y = x + 10;). Our implementation handles both cases.

While learning this code, focus on:

AST Node Types:
We define various AST node types like Program (root node), VariableDeclaration, and Identifier. These node types form the basic structure of the AST and are key to understanding syntax trees.
Recursive Parsing:
We recursively parse the Token sequence and convert it into an AST. Recursive parsing is the core idea of syntactic analysis, elegantly handling nested syntactic structures (e.g., operations within expressions).
Syntax Rules:
The code strictly follows JavaScript syntax rules, such as the format of variable declarations (let variable = expression;) and expression parsing logic (e.g., binary operations like x + 10). Understanding these rules helps you grasp how Tokens are mapped to AST nodes.

By running this code, you can see how variable declaration statements are transformed from Token sequences into ASTs. Try running the code and debugging it to observe how the data changes at each step. This will deepen your understanding of syntactic analysis.

5. Output

Running the above code will produce the following ast:

{
  "type": "Program",
  "body": [
    {
      "type": "VariableDeclaration",
      "declarations": [
        {
          "type": "VariableDeclarator",
          "id": { "type": "Identifier", "name": "x" },
          "init": { "type": "Literal", "value": 42 }
        }
      ],
      "kind": "let"
    },
    {
      "type": "VariableDeclaration",
      "declarations": [
        {
          "type": "VariableDeclarator",
          "id": { "type": "Identifier", "name": "y" },
          "init": {
            "type": "BinaryExpression",
            "operator": "+",
            "left": { "type": "Identifier", "name": "x" },
            "right": { "type": "Literal", "value": 10 }
          }
        }
      ],
      "kind": "let"
    }
  ]
}

Part 3: Extensions and Practice

1. Supporting More Syntax Features

You can extend the lexical analyzer and parser to support more syntax features, such as functions and conditional statements. For example, try parsing the following code:

if (x > 0) {
  return x;
}

I won’t provide the code here, but you can try implementing it yourself.

2. Using Tools to Generate ASTs

Existing tools like Babel and Acorn can generate and manipulate ASTs. For example, using Babel to parse JavaScript code:

const parser = require("@babel/parser");
const code = `let x = 42;`;
const ast = parser.parse(code, { sourceType: "module" });
console.log(ast);

3. Applications of ASTs

ASTs have wide-ranging applications in real-world development, such as:

Code Formatting: Reformatting code by manipulating the AST (e.g., ESLint).
Code Transformation: Converting one syntax to another, like transforming ES6 Class syntax to ES5 (e.g., Babel).
Static Analysis: Analyzing code structure and dependencies (e.g., ESLint, Prettier).