Demystifying Compiler Principles: From Tokens to AST
Have you ever wondered how your editor "understands" your code? Why do different words appear in different colors? How can tools like ESLint precisely identify issues in your code? Or even, how does Babel transform your JSX code into plain JavaScript and then downgrade it to older versions of JS? If you're curious about the answers to these questions, keep reading!
Compiler principles are a crucial field in computer science, focusing on how high-level programming languages are transformed into machine-executable code. In this process, lexical analysis (Tokenization) and syntactic analysis (AST generation) are two key steps.
- Lexical Analysis: Breaking down the source code string into meaningful Tokens (e.g., keywords, identifiers, operators, etc.).
- Syntactic Analysis: Converting the sequence of Tokens into an Abstract Syntax Tree (AST), a tree-like structure representing the syntactic structure of the code.
This article will guide you through implementing a simple JS lexical analyzer and parser using TypeScript, capable of parsing variable declaration statements, and help you understand how they work.
A Token is the smallest unit of source code, representing elements like keywords, identifiers, operators, and numbers in a language. For example, in JavaScript, the following code:
let x = 42;
Can be broken down into the following Tokens:
let
: Keywordx
: Identifier=
: Operator42
: Number;
: Punctuation
The lexical analyzer (Lexer) breaks down the source code string into a sequence of Tokens. Here's how it works:
- Read Characters One by One: The Lexer starts from the first character of the source code, reading characters sequentially.
- Identify Tokens: Based on the character type (letter, digit, symbol, etc.), the Lexer identifies the Token type.
- Generate Tokens: Combines the identified Token type and its corresponding value into a Token object.
Here’s a simple lexical analyzer implemented in TypeScript:
// Define Token types
enum TokenType {
Keyword = "Keyword", // Keywords like let
Identifier = "Identifier", // Identifiers like x
Number = "Number", // Numbers like 42
Operator = "Operator", // Operators like =, +
Punctuation = "Punctuation", // Punctuation like ;
}
// Define Token structure
type Token = {
type: TokenType;
value: string;
};
// Simple Lexical Analyzer
function tokenize(code: string): Token[] {
const tokens: Token[] = [];
let pos = 0;
// Regex matching rules
const regex = {
whitespace: /\s/,
number: /\d/,
identifier: /[a-zA-Z_]/,
operator: /[=+]/,
punctuation: /[;]/,
};
while (pos < code.length) {
let char = code[pos];
// Skip whitespace
if (regex.whitespace.test(char)) {
pos++;
continue;
}
// Match numbers
if (regex.number.test(char)) {
let num = "";
while (regex.number.test(code[pos])) {
num += code[pos];
pos++;
}
tokens.push({ type: TokenType.Number, value: num });
continue;
}
// Match identifiers or keywords
if (regex.identifier.test(char)) {
let id = "";
while (regex.identifier.test(code[pos])) {
id += code[pos];
pos++;
}
// Check if it's a keyword
if (id === "let") {
tokens.push({ type: TokenType.Keyword, value: id });
} else {
tokens.push({ type: TokenType.Identifier, value: id });
}
continue;
}
// Match operators
if (regex.operator.test(char)) {
tokens.push({ type: TokenType.Operator, value: char });
pos++;
continue;
}
// Match punctuation
if (regex.punctuation.test(char)) {
tokens.push({ type: TokenType.Punctuation, value: char });
pos++;
continue;
}
// Throw error for unknown characters
throw new Error(`Unknown character: ${char}`);
}
return tokens;
}
// Test the Lexical Analyzer
const code = `
let x = 42;
let y = x + 10;
`;
const tokens = tokenize(code);
console.log(tokens);
The code might seem lengthy, but the Token parsing logic is linear and doesn’t involve complex calls. It reads characters one by one and generates corresponding Tokens based on their type. You can copy the code into your editor, run it, and debug it while observing how the data changes. This hands-on approach will help you quickly grasp the flow of data and understand how lexical analysis works.
While debugging, focus on:
- Character Reading and Matching: Observe how
pos
changes and the value ofchar
. - Token Generation: See how the
tokens
array is populated step by step. - Regex Matching: Verify if the regex correctly identifies different character types.
Hands-on practice is the best way to learn programming, so don’t hesitate to try it out!
Running the above code will produce the following tokens
array:
[
{ type: "Keyword", value: "let" },
{ type: "Identifier", value: "x" },
{ type: "Operator", value: "=" },
{ type: "Number", value: "42" },
{ type: "Punctuation", value: ";" },
{ type: "Keyword", value: "let" },
{ type: "Identifier", value: "y" },
{ type: "Operator", value: "=" },
{ type: "Identifier", value: "x" },
{ type: "Operator", value: "+" },
{ type: "Number", value: "10" },
{ type: "Punctuation", value: ";" }
]
An Abstract Syntax Tree (AST) is a tree representation of source code, where each node represents a syntactic structure (e.g., expressions, statements). For example, the following code:
let x = 42;
Could have an AST like this:
{
"type": "Program",
"body": [
{
"type": "VariableDeclaration",
"declarations": [
{
"type": "VariableDeclarator",
"id": { "type": "Identifier", "name": "x" },
"init": { "type": "Literal", "value": 42 }
}
],
"kind": "let"
}
]
}
The parser converts a sequence of Tokens into an AST. Here's how it works:
- Recursive Parsing: Reads Tokens one by one and recursively builds the AST based on syntax rules.
- Syntax Rules: Generates corresponding AST nodes based on language syntax rules (e.g., variable declarations, expressions).
Here’s a simple parser implemented in TypeScript:
// Define AST node types
interface Program {
type: "Program";
body: Statement[];
}
interface VariableDeclaration {
type: "VariableDeclaration";
declarations: VariableDeclarator[];
kind: "let" | "const" | "var";
}
interface VariableDeclarator {
type: "VariableDeclarator";
id: Identifier;
init: Expression | null;
}
interface Identifier {
type: "Identifier";
name: string;
}
interface Literal {
type: "Literal";
value: number | string;
}
interface BinaryExpression {
type: "BinaryExpression";
operator: string;
left: Expression;
right: Expression;
}
// Define expression types
type Expression = Identifier | Literal | BinaryExpression;
// Define statement types
type Statement = VariableDeclaration;
// Parser
class Parser {
private tokens: Token[];
private pos: number = 0;
constructor(tokens: Token[]) {
this.tokens = tokens;
}
// Get current Token
private currentToken(): Token {
return this.tokens[this.pos];
}
// Consume current Token
private consume(type: TokenType): Token {
const token = this.currentToken();
if (token.type === type) {
this.pos++;
return token;
} else {
throw new Error(`Expected token type: ${type}, but got ${token.type}`);
}
}
// Parse the entire program
public parse(): Program {
const body: Statement[] = [];
while (this.pos < this.tokens.length) {
body.push(this.parseStatement());
}
return { type: "Program", body };
}
// Parse a statement
private parseStatement(): Statement {
if (
this.currentToken().type === TokenType.Keyword &&
this.currentToken().value === "let"
) {
return this.parseVariableDeclaration();
}
throw new Error(`Unknown statement starting with ${this.currentToken().value}`);
}
// Parse a variable declaration
private parseVariableDeclaration(): VariableDeclaration {
this.consume(TokenType.Keyword); // Consume 'let'
const id = this.parseIdentifier();
this.consume(TokenType.Operator); // Consume '='
const init = this.parseExpression();
this.consume(TokenType.Punctuation); // Consume ';'
return {
type: "VariableDeclaration",
declarations: [
{
type: "VariableDeclarator",
id,
init,
},
],
kind: "let",
};
}
// Parse an identifier
private parseIdentifier(): Identifier {
const token = this.consume(TokenType.Identifier);
return { type: "Identifier", name: token.value };
}
// Parse an expression
private parseExpression(): Expression {
let left = this.parsePrimaryExpression();
while (
this.currentToken().type === TokenType.Operator &&
["+", "-", "*", "/"].includes(this.currentToken().value)
) {
const operator = this.consume(TokenType.Operator).value;
const right = this.parsePrimaryExpression();
left = {
type: "BinaryExpression",
operator,
left,
right,
};
}
return left;
}
// Parse a primary expression (identifier or literal)
private parsePrimaryExpression(): Expression {
const token = this.currentToken();
if (token.type === TokenType.Identifier) {
return this.parseIdentifier();
} else if (token.type === TokenType.Number) {
const value = Number(this.consume(TokenType.Number).value);
return { type: "Literal", value };
}
throw new Error(`Unknown primary expression: ${token.value}`);
}
}
// Test the Parser
const parser = new Parser(tokens);
const ast = parser.parse();
console.log(JSON.stringify(ast, null, 2));
This code is relatively simple, handling only variable declaration ASTs. However, variable declarations can take multiple forms: direct assignments (e.g., let x = 42;
) or complex expressions (e.g., let y = x + 10;
). Our implementation handles both cases.
While learning this code, focus on:
-
AST Node Types:
We define various AST node types likeProgram
(root node),VariableDeclaration
, andIdentifier
. These node types form the basic structure of the AST and are key to understanding syntax trees. -
Recursive Parsing:
We recursively parse the Token sequence and convert it into an AST. Recursive parsing is the core idea of syntactic analysis, elegantly handling nested syntactic structures (e.g., operations within expressions). -
Syntax Rules:
The code strictly follows JavaScript syntax rules, such as the format of variable declarations (let variable = expression;
) and expression parsing logic (e.g., binary operations likex + 10
). Understanding these rules helps you grasp how Tokens are mapped to AST nodes.
By running this code, you can see how variable declaration statements are transformed from Token sequences into ASTs. Try running the code and debugging it to observe how the data changes at each step. This will deepen your understanding of syntactic analysis.
Running the above code will produce the following ast
:
{
"type": "Program",
"body": [
{
"type": "VariableDeclaration",
"declarations": [
{
"type": "VariableDeclarator",
"id": { "type": "Identifier", "name": "x" },
"init": { "type": "Literal", "value": 42 }
}
],
"kind": "let"
},
{
"type": "VariableDeclaration",
"declarations": [
{
"type": "VariableDeclarator",
"id": { "type": "Identifier", "name": "y" },
"init": {
"type": "BinaryExpression",
"operator": "+",
"left": { "type": "Identifier", "name": "x" },
"right": { "type": "Literal", "value": 10 }
}
}
],
"kind": "let"
}
]
}
You can extend the lexical analyzer and parser to support more syntax features, such as functions and conditional statements. For example, try parsing the following code:
if (x > 0) {
return x;
}
I won’t provide the code here, but you can try implementing it yourself.
Existing tools like Babel and Acorn can generate and manipulate ASTs. For example, using Babel to parse JavaScript code:
const parser = require("@babel/parser");
const code = `let x = 42;`;
const ast = parser.parse(code, { sourceType: "module" });
console.log(ast);
ASTs have wide-ranging applications in real-world development, such as:
- Code Formatting: Reformatting code by manipulating the AST (e.g., ESLint).
- Code Transformation: Converting one syntax to another, like transforming ES6 Class syntax to ES5 (e.g., Babel).
- Static Analysis: Analyzing code structure and dependencies (e.g., ESLint, Prettier).