Build and Query Knowledge Graphs with LLMs

Going from document ingestion to smart queries — all with open tools and guided setup The post Build and Query Knowledge Graphs with LLMs appeared first on Towards Data Science.

May 2, 2025 - 21:12

Knowledge Graphs are relevant

A Knowledge Graph could be defined as a structured representation of information that connects concepts, entities, and their relationships in a way that mimics human understanding. It is often used to organise and integrate data from various sources, enabling machines to reason, infer, and retrieve relevant information more effectively.

In a previous post on Medium I made the point that this kind of structured representation can be used to enhance and perfect the performances of LLMs in Retrieval Augmented Generation applications. We could speak of GraphRAG as an ensemble of techniques and strategies employing a graph-based representation of knowledge to better serve information to LLMs compared to more standard approaches that could be taken for “Chat with your documents” use cases.

The “vanilla” RAG approach relies on vector similarity (and, sometimes, hybrid search) with the goal of retrieving from a vector database pieces of information (chunks of documents) that are similar to the user’s input, according to some similarity measure such as cosine or euclidean. These pieces of information are then passed to a Large Language Model that is prompted to use them as context to generate a relevant output to the user’s query.

My argument is that the biggest point of failure in those kind of applications is similarity search relying on explicit mentions in the knowledge base (intra-document level), leaving the LLM blind to cross-references between documents, or even to implied (implicit) and contextual references. In brief, the LLM is limited as it cannot reason at a inter-document level.

This can be addressed moving away from pure vector representations and vector stores to a more comprehensive way of organizing the knowledge base, extracting concepts from each piece of text and storing while keeping track of relationships between pieces of information.

Graph structure is in my opinion the best way of organizing a knowledge base with documents containing cross-references and implicit mentions to each other like it always happens inside organizations and enterprises. A graph main features are in fact

Entities (Nodes): they represent real-world objects like people, places, organizations, or abstract concepts;
Relationships (Edges): they define how entities are connected between them (i.e: “Bill → WORKS_AT → Microsoft”);
Attributes (Properties): provide additional details about entities (e.g., Microsoft’s founding year, revenue, or location) or relationships ( i.e. “Bill → FRIENDS_WITH {since: 2021} → Mark”).

A Knowledge Graph can then be defined as the Graph representation of corpora of documents coming from a coherent domain. But how exactly do we move from vector representation and vector databases to a Knowledge Graph?

Further, how do we even extract the key information to build a Knowledge Graph?

In this article, I will present my point of view on the subject, with code examples from a repository I developed while learning and experimenting with Knowledge Graphs. This repository is publicly available on my Github and contains:

the source code of the project
example notebooks written while building the repo
a Streamlit app to showcase work done until this point
a Docker file to built the image for this project without having to go through the manual installation of all the software needed to run the project.

The article will present the repo in order to cover the following topics:

Tech Stack Breakdown of the tools available, with a brief presentation of each of the components used to build the project.

How to get the Demo up and running in your own local environment.

How to perform the Ingestion Process of documents, including extracting concepts from them and assembling them into a Knowledge Graph.

How to query the Graph, with a focus on the variety of possible strategies that can be employed to perform semantic search, graph query language generation and hybrid search.

If you are a Data Scientist, a ML/AI Engineer or just someone curious on how to build smarter search systems, this guide will walk you through the full workflow with code, context and clarity.

Tech Stack Breakdown

As a Data Scientist who started learning programming in 2019/20, my main language is of course Python. Here, I am using its 3.12 version.

This project is built with a focus on open-source tools and free-tier accessibility both on the storage side as well as on the availability of Large Language Models. This makes it a good starting point for newcomers or for those who are not willing to pay for cloud infrastructure or for OpenAI’s API KEYs.

The source code is, however, written with production use cases in mind — focusing not just on quick demos, but on how to transition a project to real-world deployment. The code is therefore designed to be easily customizable, modular, and extendable, so it could be adapted to your own data sources, LLMs, and workflows with minimal friction.

Below is a breakdown of the key components and how they work together. You can also read the repo’s README.md for further information on how to get up and running with the demo app.