Making Sense of tree-sitter's C API

Hi there! I'm Shrijith Venkatrama, founder of Hexmos. Right now, I’m building LiveAPI, a first of its kind tool for helping you automatically index API endpoints across all your repositories. LiveAPI helps you discover, understand and use APIs in large tech infrastructures with ease. Tree-sitter is a powerful parsing library that generates syntax trees for code, making it a go-to for tools like code editors and linters. Its C API is the backbone for integrating Tree-sitter into projects, but it can feel daunting with its many types and functions. This guide breaks down the Tree-sitter C API, focusing on practical usage with clear examples. We'll explore how to set up a parser, parse code, navigate syntax trees, and query them, all while keeping things developer-friendly. Why Tree-sitter's C API Matters The C API is the core interface for Tree-sitter, offering fine-grained control over parsing and syntax tree manipulation. It's used by language bindings (like Rust or Python) and directly in C/C++ projects for performance-critical applications. Understanding it helps you: Integrate Tree-sitter into custom tools. Optimize parsing for specific use cases. Debug issues when higher-level bindings fall short. The API is defined in tree_sitter/api.h (available on GitHub). It revolves around a few key concepts: parsers, trees, nodes, and queries. Let’s dive into the essentials. Setting Up a Parser To use Tree-sitter, you first need a parser. The TSParser struct is your entry point, and setting it up involves creating it and assigning a language. Key Functions Function Description ts_parser_new Creates a new parser. ts_parser_set_language Assigns a language to the parser. ts_parser_delete Frees the parser. Example: Initializing a Parser Here’s how to set up a parser for JavaScript using a hypothetical tree_sitter_javascript language (you’d typically get this from a compiled language module). #include #include // Assume tree_sitter_javascript is defined elsewhere extern const TSLanguage *tree_sitter_javascript(); int main() { // Create parser TSParser *parser = ts_parser_new(); // Set language const TSLanguage *lang = tree_sitter_javascript(); if (!ts_parser_set_language(parser, lang)) { fprintf(stderr, "Language version mismatch\n"); ts_parser_delete(parser); return 1; } // Clean up ts_parser_delete(parser); return 0; } Output: No output if successful; prints an error if the language version is incompatible. Notes: Language versioning is critical. The API supports languages with ABI versions between TREE_SITTER_MIN_COMPATIBLE_LANGUAGE_VERSION (13) and TREE_SITTER_LANGUAGE_VERSION (15). Always check the return value of ts_parser_set_language to catch version mismatches. Parsing Code into a Syntax Tree Once you have a parser, you can parse code to create a TSTree. The tree represents the code’s structure, with nodes for each syntactic element (e.g., functions, variables). Key Functions Function Description ts_parser_parse_string Parses a string into a syntax tree. ts_tree_root_node Gets the root node of the tree. ts_tree_delete Frees the tree. Example: Parsing JavaScript Code This example parses a simple JavaScript function and prints the root node’s type. #include #include #include extern const TSLanguage *tree_sitter_javascript(); int main() { TSParser *parser = ts_parser_new(); ts_parser_set_language(parser, tree_sitter_javascript()); const char *code = "function hello() { return 'world'; }"; TSTree *tree = ts_parser_parse_string( parser, NULL, // No old tree for first parse code, strlen(code) ); if (tree == NULL) { fprintf(stderr, "Parsing failed\n"); ts_parser_delete(parser); return 1; } TSNode root = ts_tree_root_node(tree); printf("Root node type: %s\n", ts_node_type(root)); ts_tree_delete(tree); ts_parser_delete(parser); return 0; } Output: Root node type: program Notes: The NULL old_tree parameter is used for initial parses. For incremental parsing (e.g., after code edits), pass the previous tree. The root node’s type (program) is language-specific, defined in the language’s grammar. Navigating the Syntax Tree The syntax tree is a hierarchy of TSNode objects, each representing a syntactic construct. You can traverse the tree to inspect nodes, their types, and their positions. Key Functions Function Description ts_node_child Gets a child node by index. ts_node_named_child Gets a named child (excludes anonymous nodes like string literals). ts_node_type Returns the node’s type as a string. ts_node_start_point Gets the node’s start position (row, column). Example: Traversing a Tree This code parses a JavaScript function and prints its

May 5, 2025 - 18:53
 0
Making Sense of tree-sitter's C API

Hi there! I'm Shrijith Venkatrama, founder of Hexmos. Right now, I’m building LiveAPI, a first of its kind tool for helping you automatically index API endpoints across all your repositories. LiveAPI helps you discover, understand and use APIs in large tech infrastructures with ease.

Tree-sitter is a powerful parsing library that generates syntax trees for code, making it a go-to for tools like code editors and linters. Its C API is the backbone for integrating Tree-sitter into projects, but it can feel daunting with its many types and functions. This guide breaks down the Tree-sitter C API, focusing on practical usage with clear examples. We'll explore how to set up a parser, parse code, navigate syntax trees, and query them, all while keeping things developer-friendly.

Why Tree-sitter's C API Matters

The C API is the core interface for Tree-sitter, offering fine-grained control over parsing and syntax tree manipulation. It's used by language bindings (like Rust or Python) and directly in C/C++ projects for performance-critical applications. Understanding it helps you:

  • Integrate Tree-sitter into custom tools.
  • Optimize parsing for specific use cases.
  • Debug issues when higher-level bindings fall short.

The API is defined in tree_sitter/api.h (available on GitHub). It revolves around a few key concepts: parsers, trees, nodes, and queries. Let’s dive into the essentials.

Setting Up a Parser

To use Tree-sitter, you first need a parser. The TSParser struct is your entry point, and setting it up involves creating it and assigning a language.

Key Functions

Function Description
ts_parser_new Creates a new parser.
ts_parser_set_language Assigns a language to the parser.
ts_parser_delete Frees the parser.

Example: Initializing a Parser

Here’s how to set up a parser for JavaScript using a hypothetical tree_sitter_javascript language (you’d typically get this from a compiled language module).

#include 
#include 

// Assume tree_sitter_javascript is defined elsewhere
extern const TSLanguage *tree_sitter_javascript();

int main() {
    // Create parser
    TSParser *parser = ts_parser_new();

    // Set language
    const TSLanguage *lang = tree_sitter_javascript();
    if (!ts_parser_set_language(parser, lang)) {
        fprintf(stderr, "Language version mismatch\n");
        ts_parser_delete(parser);
        return 1;
    }

    // Clean up
    ts_parser_delete(parser);
    return 0;
}

Output: No output if successful; prints an error if the language version is incompatible.

Notes:

  • Language versioning is critical. The API supports languages with ABI versions between TREE_SITTER_MIN_COMPATIBLE_LANGUAGE_VERSION (13) and TREE_SITTER_LANGUAGE_VERSION (15).
  • Always check the return value of ts_parser_set_language to catch version mismatches.

Parsing Code into a Syntax Tree

Once you have a parser, you can parse code to create a TSTree. The tree represents the code’s structure, with nodes for each syntactic element (e.g., functions, variables).

Key Functions

Function Description
ts_parser_parse_string Parses a string into a syntax tree.
ts_tree_root_node Gets the root node of the tree.
ts_tree_delete Frees the tree.

Example: Parsing JavaScript Code

This example parses a simple JavaScript function and prints the root node’s type.

#include 
#include 
#include 

extern const TSLanguage *tree_sitter_javascript();

int main() {
    TSParser *parser = ts_parser_new();
    ts_parser_set_language(parser, tree_sitter_javascript());

    const char *code = "function hello() { return 'world'; }";
    TSTree *tree = ts_parser_parse_string(
        parser,
        NULL,  // No old tree for first parse
        code,
        strlen(code)
    );

    if (tree == NULL) {
        fprintf(stderr, "Parsing failed\n");
        ts_parser_delete(parser);
        return 1;
    }

    TSNode root = ts_tree_root_node(tree);
    printf("Root node type: %s\n", ts_node_type(root));

    ts_tree_delete(tree);
    ts_parser_delete(parser);
    return 0;
}

Output:

Root node type: program

Notes:

  • The NULL old_tree parameter is used for initial parses. For incremental parsing (e.g., after code edits), pass the previous tree.
  • The root node’s type (program) is language-specific, defined in the language’s grammar.

Navigating the Syntax Tree

The syntax tree is a hierarchy of TSNode objects, each representing a syntactic construct. You can traverse the tree to inspect nodes, their types, and their positions.

Key Functions

Function Description
ts_node_child Gets a child node by index.
ts_node_named_child Gets a named child (excludes anonymous nodes like string literals).
ts_node_type Returns the node’s type as a string.
ts_node_start_point Gets the node’s start position (row, column).

Example: Traversing a Tree

This code parses a JavaScript function and prints its named children.

#include 
#include 
#include 

extern const TSLanguage *tree_sitter_javascript();

int main() {
    TSParser *parser = ts_parser_new();
    ts_parser_set_language(parser, tree_sitter_javascript());

    const char *code = "function hello() { return 'world'; }";
    TSTree *tree = ts_parser_parse_string(parser, NULL, code, strlen(code));

    TSNode root = ts_tree_root_node(tree);
    uint32_t child_count = ts_node_named_child_count(root);

    printf("Named children of root (%s):\n", ts_node_type(root));
    for (uint32_t i = 0; i < child_count; i++) {
        TSNode child = ts_node_named_child(root, i);
        TSPoint start = ts_node_start_point(child);
        printf("  %u: %s at (%u, %u)\n", i, ts_node_type(child), start.row, start.column);
    }

    ts_tree_delete(tree);
    ts_parser_delete(parser);
    return 0;
}

Output:

Named children of root (program):
  0: function_declaration at (0, 0)

Notes:

  • Named vs. anonymous nodes: Named nodes (e.g., function_declaration) correspond to grammar rules, while anonymous nodes (e.g., "(") are literals.
  • Use ts_node_start_point and ts_node_end_point for precise code positions.

Using Tree Cursors for Efficient Traversal

For large trees, iterating with ts_node_child can be slow. The TSTreeCursor provides a more efficient way to traverse trees by maintaining state.

Key Functions

Function Description
ts_tree_cursor_new Creates a cursor starting at a node.
ts_tree_cursor_goto_first_child Moves to the first child.
ts_tree_cursor_goto_next_sibling Moves to the next sibling.
ts_tree_cursor_current_node Gets the current node.

Example: Using a Tree Cursor

This example traverses the tree to find all named nodes.

#include 
#include 
#include 

extern const TSLanguage *tree_sitter_javascript();

void traverse(TSNode node) {
    TSTreeCursor cursor = ts_tree_cursor_new(node);
    if (ts_tree_cursor_goto_first_child(&cursor)) {
        do {
            TSNode current = ts_tree_cursor_current_node(&cursor);
            if (ts_node_is_named(current)) {
                printf("Node: %s\n", ts_node_type(current));
                traverse(current);  // Recurse
            }
        } while (ts_tree_cursor_goto_next_sibling(&cursor));
    }
    ts_tree_cursor_delete(&cursor);
}

int main() {
    TSParser *parser = ts_parser_new();
    ts_parser_set_language(parser, tree_sitter_javascript());

    const char *code = "function hello() { return 'world'; }";
    TSTree *tree = ts_parser_parse_string(parser, NULL, code, strlen(code));

    TSNode root = ts_tree_root_node(tree);
    printf("Starting traversal:\n");
    traverse(root);

    ts_tree_delete(tree);
    ts_parser_delete(parser);
    return 0;
}

Output:

Starting traversal:
Node: function_declaration
Node: identifier
Node: formal_parameters
Node: statement_block
Node: return_statement
Node: string

Notes:

  • Cursors are faster than repeated ts_node_child calls because they cache traversal state.
  • Always call ts_tree_cursor_delete to avoid memory leaks.

Querying the Syntax Tree

Queries let you search for patterns in the syntax tree, like finding all function declarations. The TSQuery API uses S-expressions to define patterns.

Key Functions

Function Description
ts_query_new Creates a query from an S-expression.
ts_query_cursor_new Creates a cursor for executing queries.
ts_query_cursor_exec Runs the query on a node.
ts_query_cursor_next_match Gets the next match.

Example: Finding Function Declarations

This code queries for function declarations and prints their names.

#include 
#include 
#include 

extern const TSLanguage *tree_sitter_javascript();

int main() {
    TSParser *parser = ts_parser_new();
    ts_parser_set_language(parser, tree_sitter_javascript());

    const char *code = "function hello() { return 'world'; }\nfunction bye() {}";
    TSTree *tree = ts_parser_parse_string(parser, NULL, code, strlen(code));

    // Create query
    const char *query_str = "(function_declaration name: (identifier) @func-name)";
    uint32_t error_offset;
    TSQueryError error_type;
    TSQuery *query = ts_query_new(
        tree_sitter_javascript(),
        query_str,
        strlen(query_str),
        &error_offset,
        &error_type
    );
    if (!query) {
        fprintf(stderr, "Query error at offset %u\n", error_offset);
        return 1;
    }

    // Execute query
    TSQueryCursor *cursor = ts_query_cursor_new();
    ts_query_cursor_exec(cursor, query, ts_tree_root_node(tree));

    TSQueryMatch match;
    while (ts_query_cursor_next_match(cursor, &match)) {
        for (uint16_t i = 0; i < match.capture_count; i++) {
            TSQueryCapture capture = match.captures[i];
            char *name = ts_node_string(capture.node);
            printf("Found function: %s\n", name);
            free(name);
        }
    }

    ts_query_cursor_delete(cursor);
    ts_query_delete(query);
    ts_tree_delete(tree);
    ts_parser_delete(parser);
    return 0;
}

Output:

Found function: (identifier "hello")
Found function: (identifier "bye")

Notes:

  • The query (function_declaration name: (identifier) @func-name) captures the identifier node as func-name.
  • Check ts_query_new for errors, as invalid S-expressions will return NULL.
  • Learn more about query syntax in the Tree-sitter documentation.

Handling Code Edits

Tree-sitter supports incremental parsing, which is crucial for real-time applications like editors. You edit the tree to reflect code changes and reparse only the affected parts.

Key Functions

Function Description
ts_tree_edit Updates the tree for an edit.
ts_parser_parse Reparses with the old tree for efficiency.

Example: Updating a Tree

This code edits a JavaScript function and re-parses it.

#include 
#include 
#include 

extern const TSLanguage *tree_sitter_javascript();

int main() {
    TSParser *parser = ts_parser_new();
    ts_parser_set_language(parser, tree_sitter_javascript());

    const char *old_code = "function hello() { return 'world'; }";
    TSTree *tree = ts_parser_parse_string(parser, NULL, old_code, strlen(old_code));

    // Simulate edit: change "hello" to "greet"
    TSInputEdit edit = {
        .start_byte = 9,  // Start of "hello"
        .old_end_byte = 14,  // End of "hello"
        .new_end_byte = 14,  // End of "greet"
        .start_point = {0, 9},
        .old_end_point = {0, 14},
        .new_end_point = {0, 14}
    };
    ts_tree_edit(tree, &edit);

    const char *new_code = "function greet() { return 'world'; }";
    TSTree *new_tree = ts_parser_parse_string(parser, tree, new_code, strlen(new_code));

    TSNode root = ts_tree_root_node(new_tree);
    TSNode func = ts_node_named_child(root, 0);
    char *func_name = ts_node_string(ts_node_named_child(func, 0));
    printf("Updated function name: %s\n", func_name);
    free(func_name);

    ts_tree_delete(tree);
    ts_tree_delete(new_tree);
    ts_parser_delete(parser);
    return 0;
}

Output:

Updated function name: (identifier "greet")

Notes:

  • The TSInputEdit struct requires precise byte and point offsets, which you’d typically compute from an editor’s change events.
  • Incremental parsing is much faster than re-parsing from scratch.

Debugging and Logging

Tree-sitter provides tools to debug parsing, like logging and generating DOT graphs for visualization.

Key Functions

Function Description
ts_parser_set_logger Sets a callback for parse/lex logs.
ts_parser_print_dot_graphs Outputs DOT graphs to a file descriptor.

Example: Adding a Logger

This code logs parsing events to stderr.

#include 
#include 
#include 

extern const TSLanguage *tree_sitter_javascript();

void log_callback(void *payload, TSLogType type, const char *msg) {
    fprintf(stderr, "[%s] %s\n", type == TSLogTypeParse ? "PARSE" : "LEX", msg);
}

int main() {
    TSParser *parser = ts_parser_new();
    ts_parser_set_language(parser, tree_sitter_javascript());

    TSLogger logger = { .payload = NULL, .log = log_callback };
    ts_parser_set_logger(parser, logger);

    const char *code = "function hello() { return 'world'; }";
    TSTree *tree = ts_parser_parse_string(parser, NULL, code, strlen(code));

    ts_tree_delete(tree);
    ts_parser_delete(parser);
    return 0;
}

Output (example, varies by language):

[PARSE] parsing rule: program
[LEX] token: function
...

Notes:

  • Use ts_parser_print_dot_graphs with a file descriptor to visualize trees (pipe to dot -Tsvg for SVG output).
  • Logging is verbose but invaluable for debugging grammar issues.

Practical Tips for Using the C API

To wrap up, here are actionable tips for working with Tree-sitter’s C API:

  • Start small: Begin with simple parsing and traversal before tackling queries or incremental parsing.
  • Check return values: Functions like ts_parser_set_language and ts_query_new can fail silently if not checked.
  • Use cursors for traversal: They’re faster and cleaner than manual node iteration for large trees.
  • Leverage incremental parsing: For real-time applications, always edit and reuse trees to save time.
  • Debug with logs and graphs: Enable logging or DOT output to understand parsing issues.
  • Read the source: The api.h file is well-documented and the ultimate reference.

The C API is low-level but gives you total control over Tree-sitter’s capabilities. Whether you’re building a code editor, linter, or custom tool, mastering it unlocks powerful parsing features. Experiment with the examples, tweak them for your language, and you’ll be parsing like a pro in no time.