
Demystifying Compiler Principles: From Tokens to AST
Have you ever wondered how your editor "understands" your code? Why do different words appear in different colors? How can tools like ESLint precisely identify issues in your code? Or even, how does Babel transform your JSX code into plain JavaScript and then downgrade it to older versions of JS? If you're curious about the answers to these questions, keep reading!
Compiler principles are a crucial field in computer science, focusing on how high-level programming languages are transformed into machine-executable code. In this process, lexical analysis (Tokenization) and syntactic analysis (AST generation) are two key steps.
- Lexical Analysis: Breaking down the source code string into meaningful Tokens (e.g., keywords, identifiers, operators, etc.).
- Syntactic Analysis: Converting the sequence of Tokens into an Abstract Syntax Tree (AST), a tree-like structure representing the syntactic structure of the code.
This article will guide you through implementing a simple JS lexical analyzer and parser using TypeScript, capable of parsing variable declaration statements, and help you understand how they work.
A Token is the smallest unit of source code, representing elements like keywords, identifiers, operators, and numbers in a language. For example, in JavaScript, the following code:
let x = 42;Can be broken down into the following Tokens:
let: Keywordx: Identifier=: Operator42: Number;: Punctuation
The lexical analyzer (Lexer) breaks down the source code string into a sequence of Tokens. Here's how it works:
- Read Characters One by One: The Lexer starts from the first character of the source code, reading characters sequentially.
- Identify Tokens: Based on the character type (letter, digit, symbol, etc.), the Lexer identifies the Token type.
- Generate Tokens: Combines the identified Token type and its corresponding value into a Token object.
Here’s a simple lexical analyzer implemented in TypeScript:
// Define Token types
The code might seem lengthy, but the Token parsing logic is linear and doesn’t involve complex calls. It reads characters one by one and generates corresponding Tokens based on their type. You can copy the code into your editor, run it, and debug it while observing how the data changes. This hands-on approach will help you quickly grasp the flow of data and understand how lexical analysis works.
While debugging, focus on:
- Character Reading and Matching: Observe how
poschanges and the value ofchar. - Token Generation: See how the
tokensarray is populated step by step. - Regex Matching: Verify if the regex correctly identifies different character types.
Hands-on practice is the best way to learn programming, so don’t hesitate to try it out!
Running the above code will produce the following tokens array:
[
{ type: "Keyword", value: "let" },
{ type: "Identifier", value: "x"An Abstract Syntax Tree (AST) is a tree representation of source code, where each node represents a syntactic structure (e.g., expressions, statements). For example, the following code:
let x = 42;Could have an AST like this:
{
"type": "Program",
"body": [
{
"type": "VariableDeclaration",
"declarations": [The parser converts a sequence of Tokens into an AST. Here's how it works:
- Recursive Parsing: Reads Tokens one by one and recursively builds the AST based on syntax rules.
- Syntax Rules: Generates corresponding AST nodes based on language syntax rules (e.g., variable declarations, expressions).
Here’s a simple parser implemented in TypeScript: