Turn any git repo into llm readable format

Recently I came across a repo that allows you to talk to any github repo, its called Talk to Github. After looking at their codebase, I found an interesting repo that makes this site possible, and that is git ingest. In this post I will be going into details on how this repo works and how I made a JS implementation with image/pdf processing. This repo turns any publicly available github repo and turns it into llm friendly format. Here is an example of turn gitingest repo into llm friendly format: First, it clones it repo into a temporary directory, then by traversing through every folder and files, it will convert the file name and file content into llm readable format. It also filters out unnecessary folders and files like .git folder an package-lock.json. One downside is that it also filters documents like images and pdfs. The program also allows you to ingest the codebase at any commit, it will first clone the repo and checkout the commit before converting them into llm format. Like any typical js dev, I decided to write my own implementation in js and add image/pdf processing with gemini and the latest mistral-ocr model. Heres what I need to do: Clone the input repo checkout to the commit/branch if given loop through every files, filters out the ignored patterns and process them converts them into llm readable format like gitingest deletes the cloned repo in disk 1. Clone the repo && 2. checkout commit/branch Since this requires shell command, I decided to use bun for the entire project. Bun has a bun in shell command $ which is super convenient to run shell command and get its output (output is not used in this case). My approach is to clone the repo into a random generated folder name under a tmp folder. This is how I clone the git repo, I decided to make it to accept any git provider as it shouldnt add much complexity import { $ } from 'bun' import { nanoid } from 'nanoid' const commit = '' // ur commit id (if given) const branch = '' // ur branch name (if given) const id = nanoid() const dir = `tmp/${id}` const cloneArgs = [] if (!commit) { cloneArgs.push('--depth=1') // save disk space } if (branch && !['main', 'master'].includes(branch)) { cloneArgs.push('--branch', branch) } await $`git clone ${repo} ${cloneArgs.join(' ')} tmp/${id}` if (commit) { await $`cd ${dir} && git checkout ${commit}` } 3. loop through every files, filters out the ignored patterns and process them The original implementation in gitingest repo using node system, and uses a recursive method to process the node (folder/file) and its children (if folder). When the node is a folder, it will call the process with each of its children as the argument, and when the node is a file, it will return its name and content. Things changed a bit when the file is an image or pdf, we have to use gemini to describe the image and mistral-ocr to accurately process the pdf. Here is the js implementation of it: ignore-patterns.ts import ignore = require('ignore') // from gitingest (removed images (images are readable now) and some dotfiles (more context on the project)) export const patterns = [ // Python '.pyc', '.pyo', '.pyd', 'pycache', '.pytest_cache', '.coverage', '.tox', '.nox', '.mypy_cache', '.ruff_cache', '.hypothesis', 'poetry.lock', 'Pipfile.lock', // JavaScript/FileSystemNode 'node_modules', 'bower_components', 'package-lock.json', 'yarn.lock', '.npm', '.yarn', '.pnpm-store', 'bun.lock', 'bun.lockb', // Java '.class', '.jar', '.war', '.ear', '.nar', '.gradle/', 'build/', '.settings/', '.classpath', 'gradle-app.setting', '.gradle', // IDEs and editors / Java '.project', // C/C++ '.o', '.obj', '.dll', '.dylib', '.exe', '.lib', '.out', '.a', '.pdb', // Swift/Xcode '.build/', '.xcodeproj/', '.xcworkspace/', '.pbxuser', '.mode1v3', '.mode2v3', '.perspectivev3', '.xcuserstate', 'xcuserdata/', '.swiftpm/', // Ruby '.gem', '.bundle/', 'vendor/bundle', 'Gemfile.lock', '.ruby-version', '.ruby-gemset', '.rvmrc', // Rust 'Cargo.lock', '**/.rs.bk', // Java / Rust 'target/', // Go 'pkg/', // .NET/C// 'obj/', '.suo', '.user', '.userosscache', '.sln.docstates', 'packages/', '.nupkg', // Go / .NET / C// 'bin/', // Version control '.git', '.svn', '.hg', // Virtual environments 'venv', '.venv', 'env', 'virtualenv', // Temporary and cache files '.log', '.bak', '.swp', '.tmp', '*.temp', '.cache', '.sass-cache', '.eslintcache', '.DS_Store', 'Thumbs.db', 'desktop.ini', // Build directories and artifacts 'build', 'dist', 'target'

May 2, 2025 - 09:15

Turn any git repo into llm readable format

Recently I came across a repo that allows you to talk to any github repo, its called Talk to Github. After looking at their codebase, I found an interesting repo that makes this site possible, and that is git ingest. In this post I will be going into details on how this repo works and how I made a JS implementation with image/pdf processing.

This repo turns any publicly available github repo and turns it into llm friendly format. Here is an example of turn gitingest repo into llm friendly format:

First, it clones it repo into a temporary directory, then by traversing through every folder and files, it will convert the file name and file content into llm readable format. It also filters out unnecessary folders and files like .git folder an package-lock.json. One downside is that it also filters documents like images and pdfs. The program also allows you to ingest the codebase at any commit, it will first clone the repo and checkout the commit before converting them into llm format.

Like any typical js dev, I decided to write my own implementation in js and add image/pdf processing with gemini and the latest mistral-ocr model.

Heres what I need to do:

Clone the input repo
checkout to the commit/branch if given
loop through every files, filters out the ignored patterns and process them
converts them into llm readable format like gitingest
deletes the cloned repo in disk

1. Clone the repo && 2. checkout commit/branch

Since this requires shell command, I decided to use bun for the entire project. Bun has a bun in shell command $ which is super convenient to run shell command and get its output (output is not used in this case). My approach is to clone the repo into a random generated folder name under a tmp folder.

This is how I clone the git repo, I decided to make it to accept any git provider as it shouldnt add much complexity

import { $ } from 'bun'
import { nanoid } from 'nanoid'

const commit = '' // ur commit id (if given)
const branch = '' // ur branch name (if given)

const id = nanoid()
const dir = `tmp/${id}`
const cloneArgs = []
if (!commit) {
    cloneArgs.push('--depth=1') // save disk space
}
if (branch && !['main', 'master'].includes(branch)) {
    cloneArgs.push('--branch', branch)
}
await $`git clone ${repo} ${cloneArgs.join(' ')} tmp/${id}`

if (commit) {
    await $`cd ${dir} && git checkout ${commit}`
}

3. loop through every files, filters out the ignored patterns and process them

The original implementation in gitingest repo using node system, and uses a recursive method to process the node (folder/file) and its children (if folder). When the node is a folder, it will call the process with each of its children as the argument, and when the node is a file, it will return its name and content. Things changed a bit when the file is an image or pdf, we have to use gemini to describe the image and mistral-ocr to accurately process the pdf. Here is the js implementation of it:

ignore-patterns.ts

import ignore = require('ignore')

// from gitingest (removed images (images are readable now) and some dotfiles (more context on the project))
export const patterns = [
    // Python
    '*.pyc',
    '*.pyo',
    '*.pyd',
    '__pycache__',
    '.pytest_cache',
    '.coverage',
    '.tox',
    '.nox',
    '.mypy_cache',
    '.ruff_cache',
    '.hypothesis',
    'poetry.lock',
    'Pipfile.lock',
    // JavaScript/FileSystemNode
    'node_modules',
    'bower_components',
    'package-lock.json',
    'yarn.lock',
    '.npm',
    '.yarn',
    '.pnpm-store',
    'bun.lock',
    'bun.lockb',
    // Java
    '*.class',
    '*.jar',
    '*.war',
    '*.ear',
    '*.nar',
    '.gradle/',
    'build/',
    '.settings/',
    '.classpath',
    'gradle-app.setting',
    '*.gradle',
    // IDEs and editors / Java
    '.project',
    // C/C++
    '*.o',
    '*.obj',
    '*.dll',
    '*.dylib',
    '*.exe',
    '*.lib',
    '*.out',
    '*.a',
    '*.pdb',
    // Swift/Xcode
    '.build/',
    '*.xcodeproj/',
    '*.xcworkspace/',
    '*.pbxuser',
    '*.mode1v3',
    '*.mode2v3',
    '*.perspectivev3',
    '*.xcuserstate',
    'xcuserdata/',
    '.swiftpm/',
    // Ruby
    '*.gem',
    '.bundle/',
    'vendor/bundle',
    'Gemfile.lock',
    '.ruby-version',
    '.ruby-gemset',
    '.rvmrc',
    // Rust
    'Cargo.lock',
    '**/*.rs.bk',
    // Java / Rust
    'target/',
    // Go
    'pkg/',
    // .NET/C//
    'obj/',
    '*.suo',
    '*.user',
    '*.userosscache',
    '*.sln.docstates',
    'packages/',
    '*.nupkg',
    // Go / .NET / C//
    'bin/',
    // Version control
    '.git',
    '.svn',
    '.hg',
    // Virtual environments
    'venv',
    '.venv',
    'env',
    'virtualenv',
    // Temporary and cache files
    '*.log',
    '*.bak',
    '*.swp',
    '*.tmp',
    '*.temp',
    '.cache',
    '.sass-cache',
    '.eslintcache',
    '.DS_Store',
    'Thumbs.db',
    'desktop.ini',
    // Build directories and artifacts
    'build',
    'dist',
    'target',
    'out',
    '*.egg-info',
    '*.egg',
    '*.whl',
    '*.so',
    // Documentation
    'site-packages',
    '.docusaurus',
    '.next',
    '.nuxt',
    // Other common patterns
    // Minified files
    '*.min.js',
    '*.min.css',
    // Source maps
    '*.map',
    // Terraform
    '.terraform',
    '*.tfstate*',
    // Dependencies in various languages
    'vendor/',
]

const ig = ignore().add(patterns)

export const isIgnored = (file: string) => {
    return ig.ignores(file)
}

import { createGoogleGenerativeAI } from '@ai-sdk/google'
import { Mistral } from '@mistralai/mistralai'
import { OCRResponse } from '@mistralai/mistralai/models/components'
import { generateText } from 'ai'
import * as path from 'path'
import { isIgnored } from './ignore-patterns'

const google = Bun.env.GEMINI_API_KEY
    ? createGoogleGenerativeAI({
            apiKey: Bun.env.GEMINI_API_KEY,
      })
    : null

const mistral = Bun.env.MISTRAL_API_KEY
    ? new Mistral({
            apiKey: Bun.env.MISTRAL_API_KEY,
      })
    : null

async function getAllFilesStats(rootPath: string, dirPath: string) {
    const files = await fs.readdir(dirPath)
    const arrayOfFiles: {
        path: string
        type: string
        content: string
        pdfParsed?: OCRResponse
        imageDescription?: string
    }[] = []

    for (const file of files) {
        const filePath = path.join(dirPath, file)
        const bunFile = Bun.file(filePath)
        const fileStat = await bunFile.stat()

        if (isIgnored(path.relative(rootPath, filePath))) {
            continue
        }

        if (fileStat.isDirectory()) {
            arrayOfFiles.push(
                ...(await getAllFilesStats(rootPath, filePath)),
            )
        } else {
            if (bunFile.type.startsWith('application/pdf') && mistral) {
                const base64 = (await bunFile.bytes()).toBase64()

                arrayOfFiles.push({
                    path: path.relative(rootPath, filePath),
                    type: bunFile.type,
                    content: await bunFile.text(),
                    pdfParsed: await mistral.ocr.process({
                        model: 'mistral-ocr-latest',
                        document: {
                            type: 'document_url',
                            documentUrl: 'data:application/pdf;base64,' + base64,
                        },
                        includeImageBase64: true,
                    }),
                })
            } else if (bunFile.type.startsWith('image/') && google) {
                const arrayBuffer = await bunFile.arrayBuffer()
                const { text } = await generateText({
                    model: google('gemini-2.0-flash'),
                    messages: [
                        {
                            role: 'user',
                            content: [
                                {
                                    type: 'text',
                                    text: `
                                        Description this image as detailed as possible
                                        Dont make any unneccessary comments like "Here's a detailed description of the image"
                                        The description is most likely going to be used to improve other llm's understanding of the image, so give as much details as possible
                                        Only generate the description of the image, no chatting
                                    `,
                                },
                                { type: 'image', image: arrayBuffer },
                            ],
                        },
                    ],
                })

                arrayOfFiles.push({
                    path: path.relative(rootPath, filePath),
                    type: bunFile.type,
                    content: await bunFile.text(),
                    imageDescription: text,
                })
            } else {
                arrayOfFiles.push({
                    path: path.relative(rootPath, filePath),
                    type: bunFile.type,
                    content: await bunFile.text(),
                })
            }
        }
    }

    return arrayOfFiles
}

4. converts them into llm readable format like gitingest

Since our data structure contains image description and parsed pdf information, we have to take that into account when convert each file information into llm format

import { OCRResponse } from '@mistralai/mistralai/models/components'

const formatFiles = (
    files: {
        path: string
        type: string
        content: string
        pdfParsed?: OCRResponse
        imageDescription?: string
    }[],
) => {
    const text = files
        .map((file) => {
            let output = '='.repeat(48)
            output += '\n'
            output += 'FILE: ' + file.path.split('/').pop()
            output += '\n'
            output += '='.repeat(48)
            output += '\n'
            output +=
                file.type.split(';')[0] === 'application/pdf'
                    ? JSON.stringify(file.pdfParsed)
                    : file.type.split(';')[0].startsWith('image/')
                    ? file.imageDescription
                    : file.content
            return output
        })
        .join('\n\n')
    return text
}

5. deletes the cloned repo in disk

import * as fs from 'node:fs/promises'

await fs.rm(dir, { recursive: true, force: true })

There are some improves that can be made to this, the obvious one is to also process the generated pdf data from mistral. We can also improves the performance by storing the result into a db and returns it to the user on repeated requests. I have the db implementation already in the github repo below.

JS Implementation: github

You can also directly deploy it on railway:

Thank you for reading, checkout my github profile https://github.com/TZGyn for my other open source projects. I love writing my own implementation of other projects to improve my knowledge and skill.