I Built an Open-Source Framework to Make LLM Data Extraction Dead Simple

After getting tired of writing endless boilerplate to extract structured data from documents with LLMs, I built ContextGem - a free, open-source framework that makes this radically easier. What makes it different? ✅ Automated dynamic prompts and data modeling ✅ Precise reference mapping to source content ✅ Built-in justifications for extractions ✅ Nested context extraction ✅ Works with any LLM provider and more built-in abstractions that save developer time. Simple LLM extraction in just a few lines: from contextgem import Aspect, Document, DocumentLLM # Define what to extract doc = Document(raw_text="Your document text here...") doc.aspects = [ Aspect( name="Intellectual property", description="Clauses on intellectual property rights", ) ] # Extract with any LLM llm = DocumentLLM(model="/", api_key="") doc = llm.extract_all(doc) # Get results print(doc.aspects[0].extracted_items) Features a native DOCX converter, support for multiple LLMs, and full serialization - all under Apache 2.0 permissive license. View project on GitHub: https://github.com/shcherbak-ai/contextgem Try it out and let me know your thoughts!

May 2, 2025 - 19:24

I Built an Open-Source Framework to Make LLM Data Extraction Dead Simple

After getting tired of writing endless boilerplate to extract structured data from documents with LLMs, I built ContextGem - a free, open-source framework that makes this radically easier.

What makes it different?

✅ Automated dynamic prompts and data modeling
✅ Precise reference mapping to source content
✅ Built-in justifications for extractions
✅ Nested context extraction
✅ Works with any LLM provider
and more built-in abstractions that save developer time.

Simple LLM extraction in just a few lines:

from contextgem import Aspect, Document, DocumentLLM

# Define what to extract
doc = Document(raw_text="Your document text here...")
doc.aspects = [
    Aspect(
        name="Intellectual property",
        description="Clauses on intellectual property rights",
    )
]

# Extract with any LLM
llm = DocumentLLM(model="/", api_key="")
doc = llm.extract_all(doc)

# Get results
print(doc.aspects[0].extracted_items)

Features a native DOCX converter, support for multiple LLMs, and full serialization - all under Apache 2.0 permissive license.

View project on GitHub: https://github.com/shcherbak-ai/contextgem

Try it out and let me know your thoughts!