Abstract Syntax Tree (AST) Deep Dive: From Theory to Practical Compiler Implementation

Introduction In programming, parsing and transforming source code is a fundamental process. Whether it's syntax checking, formatting, or advanced operations like code minification and transpilation, a deep understanding of code structure is essential. At the heart of this process lies the Abstract Syntax Tree (AST)—a powerful tool for representing code in a structured, tree-like format. This article will guide you through the core concepts of AST and demonstrate its practical application by building a mini-compiler from scratch. By the end, you’ll learn: What is an AST? How does it represent code syntax? How to implement a compiler step-by-step, from parsing source code to generating output. Real-world applications of AST, including code transformation and syntax validation. 1. What is an AST? An Abstract Syntax Tree (AST) is a tree representation of source code where each node corresponds to a syntactic unit (e.g., variables, functions, operators). Its primary role is to convert raw text-based code into structured data, enabling programmatic analysis and manipulation. 2. Building a Mini-Compiler We’ll implement a compiler that transforms LISP-style function calls into C-style syntax: LISP: (add 2 (subtract 4 2)) C: add(2, subtract(4, 2)) 2.1. Compiler Workflow Parsing: Lexical Analysis: Convert source code into tokens. Syntax Analysis: Convert tokens into an AST. Transformation: Modify the AST (e.g., for language conversion). Code Generation: Generate new code from the transformed AST. 2.2. Lexical Analysis (Tokenizer) Convert source code into tokens: function tokenizer(input) { let current = 0; let tokens = []; while (current {};`; const ast = parser.parse(code); traverse.default(ast, { Identifier(path) { if (path.node.name === "hello") path.node.name = "world"; }, }); const result = generator.default(ast, {}, code); console.log(result.code); // Output: const world = () => {}; Conclusion ASTs are foundational to modern programming tools. By understanding them, you gain the ability to: Build custom compilers/transpilers. Analyze and refactor code programmatically. Create domain-specific languages (DSLs). For further reading, explore: Babel Handbook ESTree Specification AST Explorer Original article: https://webfem.com/post/ast. Translated to English.

Apr 29, 2025 - 05:18
 0
Abstract Syntax Tree (AST) Deep Dive: From Theory to Practical Compiler Implementation

Introduction

In programming, parsing and transforming source code is a fundamental process. Whether it's syntax checking, formatting, or advanced operations like code minification and transpilation, a deep understanding of code structure is essential. At the heart of this process lies the Abstract Syntax Tree (AST)—a powerful tool for representing code in a structured, tree-like format.

This article will guide you through the core concepts of AST and demonstrate its practical application by building a mini-compiler from scratch. By the end, you’ll learn:

  1. What is an AST? How does it represent code syntax?
  2. How to implement a compiler step-by-step, from parsing source code to generating output.
  3. Real-world applications of AST, including code transformation and syntax validation.

1. What is an AST?

An Abstract Syntax Tree (AST) is a tree representation of source code where each node corresponds to a syntactic unit (e.g., variables, functions, operators). Its primary role is to convert raw text-based code into structured data, enabling programmatic analysis and manipulation.

2. Building a Mini-Compiler

We’ll implement a compiler that transforms LISP-style function calls into C-style syntax:

LISP: (add 2 (subtract 4 2))  
C:    add(2, subtract(4, 2))  

2.1. Compiler Workflow

  1. Parsing:
    • Lexical Analysis: Convert source code into tokens.
    • Syntax Analysis: Convert tokens into an AST.
  2. Transformation: Modify the AST (e.g., for language conversion).
  3. Code Generation: Generate new code from the transformed AST.

2.2. Lexical Analysis (Tokenizer)

Convert source code into tokens:

function tokenizer(input) {
  let current = 0;
  let tokens = [];
  while (current < input.length) {
    let char = input[current];
    if (char === '(') tokens.push({ type: 'paren', value: '(' });
    else if (char === ')') tokens.push({ type: 'paren', value: ')' });
    else if (/\s/.test(char)) current++; // Skip whitespace
    else if (/[0-9]/.test(char)) {
      let value = '';
      while (/[0-9]/.test(char)) {
        value += char;
        char = input[++current];
      }
      tokens.push({ type: 'number', value });
    }
    else if (/[a-z]/i.test(char)) {
      let value = '';
      while (/[a-z]/i.test(char)) {
        value += char;
        char = input[++current];
      }
      tokens.push({ type: 'name', value });
    }
    else throw new TypeError(`Unknown character: ${char}`);
  }
  return tokens;
}

Output:

[
  { type: 'paren', value: '(' },
  { type: 'name', value: 'add' },
  { type: 'number', value: '2' },
  { type: 'paren', value: '(' },
  { type: 'name', value: 'subtract' },
  { type: 'number', value: '4' },
  { type: 'number', value: '2' },
  { type: 'paren', value: ')' },
  { type: 'paren', value: ')' },
]

2.3. Syntax Analysis (Parser)

Convert tokens into an AST:

function parser(tokens) {
  let current = 0;
  function walk() {
    let token = tokens[current];
    if (token.type === 'number') {
      current++;
      return { type: 'NumberLiteral', value: token.value };
    }
    if (token.type === 'paren' && token.value === '(') {
      token = tokens[++current];
      let node = { type: 'CallExpression', name: token.value, params: [] };
      token = tokens[++current];
      while (token.type !== 'paren' || token.value !== ')') {
        node.params.push(walk());
        token = tokens[current];
      }
      current++;
      return node;
    }
    throw new TypeError(`Unknown token: ${token.type}`);
  }
  let ast = { type: 'Program', body: [] };
  while (current < tokens.length) ast.body.push(walk());
  return ast;
}

AST Output:

{
  type: 'Program',
  body: [{
    type: 'CallExpression',
    name: 'add',
    params: [
      { type: 'NumberLiteral', value: '2' },
      {
        type: 'CallExpression',
        name: 'subtract',
        params: [
          { type: 'NumberLiteral', value: '4' },
          { type: 'NumberLiteral', value: '2' },
        ],
      },
    ],
  }],
}

2.4. Transformation

Modify the AST to match the target output structure:

function transformer(ast) {
  let newAst = { type: 'Program', body: [] };
  ast._context = newAst.body;
  traverser(ast, {
    NumberLiteral: {
      enter(node, parent) {
        parent._context.push({ type: 'NumberLiteral', value: node.value });
      },
    },
    CallExpression: {
      enter(node, parent) {
        let expression = {
          type: 'CallExpression',
          callee: { type: 'Identifier', name: node.name },
          arguments: [],
        };
        node._context = expression.arguments;
        if (parent.type !== 'CallExpression') {
          expression = {
            type: 'ExpressionStatement',
            expression: expression,
          };
        }
        parent._context.push(expression);
      },
    },
  });
  return newAst;
}

Transformed AST:

{
  type: 'Program',
  body: [{
    type: 'ExpressionStatement',
    expression: {
      type: 'CallExpression',
      callee: { type: 'Identifier', name: 'add' },
      arguments: [
        { type: 'NumberLiteral', value: '2' },
        {
          type: 'CallExpression',
          callee: { type: 'Identifier', name: 'subtract' },
          arguments: [
            { type: 'NumberLiteral', value: '4' },
            { type: 'NumberLiteral', value: '2' },
          ],
        },
      ],
    },
  }],
}

2.5. Code Generation

Convert the transformed AST into C-style code:

function codeGenerator(node) {
  switch (node.type) {
    case 'Program': return node.body.map(codeGenerator).join('\n');
    case 'ExpressionStatement': return `${codeGenerator(node.expression)};`;
    case 'CallExpression':
      return `${codeGenerator(node.callee)}(${node.arguments.map(codeGenerator).join(', ')})`;
    case 'Identifier': return node.name;
    case 'NumberLiteral': return node.value;
    default: throw new TypeError(`Unknown node type: ${node.type}`);
  }
}

Final Output:

add(2, subtract(4, 2));

3. Real-World Applications with Babel

Babel, a popular JavaScript compiler, uses AST for:

  • Transpiling modern JS (e.g., ES6+ → ES5).
  • Code optimization (e.g., dead code elimination).
  • Custom transformations (e.g., JSX → JS).

Example: Renaming a Function with Babel

const parser = require("@babel/parser");
const traverse = require("@babel/traverse");
const generator = require("@babel/generator");

const code = `const hello = () => {};`;
const ast = parser.parse(code);

traverse.default(ast, {
  Identifier(path) {
    if (path.node.name === "hello") path.node.name = "world";
  },
});

const result = generator.default(ast, {}, code);
console.log(result.code); // Output: const world = () => {};

Conclusion

ASTs are foundational to modern programming tools. By understanding them, you gain the ability to:

  • Build custom compilers/transpilers.
  • Analyze and refactor code programmatically.
  • Create domain-specific languages (DSLs).

For further reading, explore:

Original article: https://webfem.com/post/ast. Translated to English.