OSD700 - RAG on DuckDB

Table of Contents Preface Introduction How it works? Final Decision Preface In the last post dedicated to OSD700, I wrote that I've chosen a vector of the development for the rest of the term. If you don't remember, read it. If you are busy, I decided to implement the local prototype of RAG and based on the results decide whether to integrate it into ChatCraft.org. A month later I am coming back with a group of posts regarding the results, and as a spoiler, I can tell, they are promising! Introduction Going back to the previous post, I am going to attach the prototype issue here: Prototype RAG on DuckDB and File Attachments #803 humphd posted on Jan 27, 2025 ChatCraft has been expanded to include File Attachments and DuckDB, which supports querying files. The two features have been connected, so you can attach files, run SQL queries on them, get back results, download them, etc. Now that we have this foundation, I think we have most of what we need for building a RAG solution, when file attachments are too large to put into the chat context. I think the process would work like this: user attaches some files with text we can extract (PDF, source code, Word Doc, etc) somehow (UI? automatically based on file size) we decide when use these file attachments for RAG vs. embedding directly in the chat messages we take the set of RAG-attachment-files and "index" them in DuckDB. Maybe we use full-text search or maybe we use vector search (see part 1, part 2) when the user asks a question, we use their prompt to create a query, get back results from the indexed docs, and include relevant text context along with the original prompt The initial version of this can be crude, without proper UI, optimal indexing, etc. We need to play a bit to get this right. Likely, the best way to begin this work is to prototype it outside of ChatCraft using DuckDB and text files locally. View on GitHub Hopefully, it helped refresh your memory. Research took about 2 weeks, and it really helped to understand what's happening and how, at least locally. After those two weeks, I had to try to implement the feature locally and present it in the class. After the first failed attempt, I didn't give up and made it work. Here's the repository where you may find my find prototype solution: mulla028 / duckdb-rag-prototype CLI RAG prototype on DuckDB implemented for ChatCraft duckdb-rag-prototype CLI RAG prototype on DuckDB implemented for ChatCraft using vector search Getting Started Clone the Repository Install DuckDB on your local machine (Optional, used for testing duckdb -ui) Install the dependencies npm i Create .env file and add there your OPENAI_API_KEY How to use it? Once you have cloned the repo you will get the populated data inside of the targeted folder called documents. Obviously, you may add any text file to process. First of all you will need to process all the files using the command: npm run rag -- process This command will process all the files, segmenting them into the chunks of the sentences(default) or paragraphss. Eventually, it will generate vector embeddings of 1584 dimensions using text-embedding-3-small model. To use the paragraphs option: npm run rag -- process -c "paragraphs" npm run rag -- process --chunking "paragraphs" … View on GitHub If you are willing to try it, I have written README.md that guides on how to use this prototype. How it works? Essentially, the process that sounds really complex consists of multiple simple stages and one complex. Here they are: Receive text from the file. (We learnt it during the first semester) Chunk Text and store at DuckDB. (Write a logic that chunks text e.g., by paragraph, by sentence etc.) Generate the Vector Embeddings and store at DuckDB. (Using text-embedding-3-small openai model) Vector Search (Hard one) import DuckDB VSS extension Apply HNSW Indexing to Increase Speed of the Search (HNSW Indexing Provided by VSS extension) Generate Embeddings for User Query and Generate Answer Final Decision Countless hours of RAG research and prototyping eventually paid off, so professor liked the way I implemented/understood this problem, and we made a decision to try the integration to ChatCraft.org. However, firstly I needed to write the proposal implementation issue, which would clearly identify all the steps it requires in order to successfully implement the feature.

Apr 17, 2025 - 00:29

Preface
Introduction
How it works?
Final Decision

Preface

In the last post dedicated to OSD700, I wrote that I've chosen a vector of the development for the rest of the term. If you don't remember, read it. If you are busy, I decided to implement the local prototype of RAG and based on the results decide whether to integrate it into ChatCraft.org.

A month later I am coming back with a group of posts regarding the results, and as a spoiler, I can tell, they are promising!

Introduction

Going back to the previous post, I am going to attach the prototype issue here:

Prototype RAG on DuckDB and File Attachments #803

humphd posted on Jan 27, 2025

ChatCraft has been expanded to include File Attachments and DuckDB, which supports querying files. The two features have been connected, so you can attach files, run SQL queries on them, get back results, download them, etc.

Now that we have this foundation, I think we have most of what we need for building a RAG solution, when file attachments are too large to put into the chat context.

I think the process would work like this:

user attaches some files with text we can extract (PDF, source code, Word Doc, etc)
somehow (UI? automatically based on file size) we decide when use these file attachments for RAG vs. embedding directly in the chat messages
we take the set of RAG-attachment-files and "index" them in DuckDB. Maybe we use full-text search or maybe we use vector search (see part 1, part 2)
when the user asks a question, we use their prompt to create a query, get back results from the indexed docs, and include relevant text context along with the original prompt

The initial version of this can be crude, without proper UI, optimal indexing, etc. We need to play a bit to get this right.

Likely, the best way to begin this work is to prototype it outside of ChatCraft using DuckDB and text files locally.

View on GitHub

Hopefully, it helped refresh your memory.

Research took about 2 weeks, and it really helped to understand what's happening and how, at least locally.

After those two weeks, I had to try to implement the feature locally and present it in the class. After the first failed attempt, I didn't give up and made it work.

Here's the repository where you may find my find prototype solution:

mulla028 / duckdb-rag-prototype

CLI RAG prototype on DuckDB implemented for ChatCraft

duckdb-rag-prototype

CLI RAG prototype on DuckDB implemented for ChatCraft using vector search

Getting Started

Clone the Repository
Install DuckDB on your local machine (Optional, used for testing duckdb -ui)
Install the dependencies
- npm i
Create .env file and add there your OPENAI_API_KEY

How to use it?

Once you have cloned the repo you will get the populated data inside of the targeted folder called documents. Obviously, you may add any text file to process.

First of all you will need to process all the files using the command:

npm run rag -- process

This command will process all the files, segmenting them into the chunks of the sentences(default) or paragraphss. Eventually, it will generate vector embeddings of 1584 dimensions using text-embedding-3-small model.

To use the paragraphs option:

npm run rag -- process -c "paragraphs"
npm run rag -- process --chunking "paragraphs"

…

View on GitHub

If you are willing to try it, I have written README.md that guides on how to use this prototype.

How it works?

Essentially, the process that sounds really complex consists of multiple simple stages and one complex. Here they are:

Receive text from the file. (We learnt it during the first semester)
Chunk Text and store at DuckDB. (Write a logic that chunks text e.g., by paragraph, by sentence etc.)
Generate the Vector Embeddings and store at DuckDB. (Using text-embedding-3-small openai model)
Vector Search (Hard one)
- import DuckDB VSS extension
- Apply HNSW Indexing to Increase Speed of the Search (HNSW Indexing Provided by VSS extension)
Generate Embeddings for User Query and Generate Answer

Final Decision

Countless hours of RAG research and prototyping eventually paid off, so professor liked the way I implemented/understood this problem, and we made a decision to try the integration to ChatCraft.org.

However, firstly I needed to write the proposal implementation issue, which would clearly identify all the steps it requires in order to successfully implement the feature.