Gemma 3 + MistralOCR + RAG Just Revolutionized Agent OCR Forever
Not a Month Ago, I made a video about Ollama-OCR. Many of you like this video One of the followers had a problem with OCR Chatbot and asked me if I could help, and I thought this video might help many developers. Good news! Mistral AI released Mistral OCR, a new product known as “the best OCR model in the world” Mistral OCR is an optical character recognition API that sets a new standard for document understanding. Unlike other models, Mistral OCR understands every document element (media, text, tables, formulas) with unprecedented accuracy and cognitive capabilities. It inputs images and PDFs and extracts content from ordered interleaved text and images. Mistral OCR is, therefore, an ideal model to be used in conjunction with RAG systems that take multimodal documents such as slides or complex PDFs as input. But Gao Mistral-OCR is not enough to create a powerful OCR Agent I know you are right. Just now, Google’s Gemma series of open-source models has been updated, and Gemma 3, which is optimized for multimodality and long context, has been released. The performance of the 27B version is comparable to Gemini-1.5-Pro. Google claims that Gemma 3 is the “world’s best single-accelerator model,” outperforming competitors such as Meta, DeepSeek, and OpenAI, especially on a host computer using a single GPU. The new model’s visual encoder has been enhanced to support high-resolution and non-square images. So, let me give you a quick demo of a live chatbot to show you what I mean. Check my video on YouTube I went to the streamlit app and input API keys for the Mistral and Google APIs via the sidebar. If valid, the Mistral client is initialized, and the Google API is checked. Once the API is connected, I will upload a PDF including a table, invoice, text, and charts. Also, in this video, I will demo the image, and I will click on the process PDF button once I upload the PDF, which will display in the sidebar. It creates a temporary directory to manage files. If anything goes wrong during the upload, it catches and raises a ValueError with a clear message. In case you upload an image, it will convert the image into markdown and loop through each key-value pair in images_dict. After replacing all image placeholders, it returns the modified Markdown string with the embedded base64 images. Then, it will process multiple pages of OCR-extracted Markdown and their respective images. It creates an empty list of markdowns to store processed Markdown content from each page. It iterates bypage.images extracting each image’s ID as the key and its base64-encoded string as the value. It then appends the updated Markdown to the markdowns list. and it combines all processed Markdown sections, ensuring a clean separation between pages. Then it inspects the document source type to determine how to process the document. By the end of this Video, you will understand what makes Mistral-OCR and Gemma 3 unique, how Gemma 3 is trained, and how we can use Gemma 3, Mistral-OCR, and RAG to create a powerful OCR agent It is naturally multilingual and multimodal, and its lightweight design makes it much faster than similar models. A single node can process up to 2,000 pages of documents per minute. and local deployment options keep sensitive data within reach. What’s more, it can convert the read data into the Markdown format. This is revolutionary because the AI model itself can easily understand the Markdown format data, so it can understand document data much better. What Makes Gemma 3 Unique? Gemma 3 is not just about size, it also has many features to help. Gemma 3 delivers state-of-the-art performance for its size. In preliminary evaluations, it supports over 35 languages out of the box and is pre-trained for over 140 languages. Seamlessly analyze images, text, and short videos, and Massive context windows of 128K tokens allow your applications to process and understand large amounts of data at once. How Gemma 3 It Trained Gemma 3 uses distillation techniques and performs optimization through reinforcement learning and model merging during pre-training and post-training. This approach can improve performance in mathematics, encoding, and instruction. Moreover, Gemma 3 uses a brand new tokenizer, provides support for more than 140 languages, and is trained on Google TPU using the JAX framework for 1B of 2T tokens, 4B of 4T tokens, 12B of 12T tokens, and 27B of 14T tokens. In the post-training stage, Gemma 3 mainly uses 4 components: Extract from a larger instruction model into a Gemma 3 pre-trained checkpoint Reinforcement learning with human feedback (RLHF) aligns model predictions with human preferences. Machine feedback reinforcement learning (RLMF), enhances mathematical reasoning. Reinforcement Learning Execution Feedback (RLEF) to improve coding ability. These updates significantly improved the model’s math, programming, and command-following capabilities, allowing Gemma 3 to score 1338

Not a Month Ago, I made a video about Ollama-OCR. Many of you like this video
One of the followers had a problem with OCR Chatbot and asked me if I could help, and I thought this video might help many developers.
Good news! Mistral AI released Mistral OCR, a new product known as “the best OCR model in the world”
Mistral OCR is an optical character recognition API that sets a new standard for document understanding. Unlike other models, Mistral OCR understands every document element (media, text, tables, formulas) with unprecedented accuracy and cognitive capabilities. It inputs images and PDFs and extracts content from ordered interleaved text and images.
Mistral OCR is, therefore, an ideal model to be used in conjunction with RAG systems that take multimodal documents such as slides or complex PDFs as input.
But Gao Mistral-OCR is not enough to create a powerful OCR Agent
I know you are right. Just now, Google’s Gemma series of open-source models has been updated, and Gemma 3, which is optimized for multimodality and long context, has been released. The performance of the 27B version is comparable to Gemini-1.5-Pro.
Google claims that Gemma 3 is the “world’s best single-accelerator model,” outperforming competitors such as Meta, DeepSeek, and OpenAI, especially on a host computer using a single GPU. The new model’s visual encoder has been enhanced to support high-resolution and non-square images.
So, let me give you a quick demo of a live chatbot to show you what I mean.
Check my video on YouTube
I went to the streamlit app and input API keys for the Mistral and Google APIs via the sidebar. If valid, the Mistral client is initialized, and the Google API is checked.
Once the API is connected, I will upload a PDF including a table, invoice, text, and charts. Also, in this video, I will demo the image, and I will click on the process PDF button once I upload the PDF, which will display in the sidebar. It creates a temporary directory to manage files.
If anything goes wrong during the upload, it catches and raises a ValueError with a clear message. In case you upload an image, it will convert the image into markdown and loop through each key-value pair in images_dict. After replacing all image placeholders, it returns the modified Markdown string with the embedded base64 images.
Then, it will process multiple pages of OCR-extracted Markdown and their respective images. It creates an empty list of markdowns to store processed Markdown content from each page. It iterates bypage.images extracting each image’s ID as the key and its base64-encoded string as the value. It then appends the updated Markdown to the markdowns list. and it combines all processed Markdown sections, ensuring a clean separation between pages. Then it inspects the document source type to determine how to process the document.
By the end of this Video, you will understand what makes Mistral-OCR and Gemma 3 unique, how Gemma 3 is trained, and how we can use Gemma 3, Mistral-OCR, and RAG to create a powerful OCR agent
It is naturally multilingual and multimodal, and its lightweight design makes it much faster than similar models. A single node can process up to 2,000 pages of documents per minute. and local deployment options keep sensitive data within reach.
What’s more, it can convert the read data into the Markdown format. This is revolutionary because the AI model itself can easily understand the Markdown format data, so it can understand document data much better.
What Makes Gemma 3 Unique?
Gemma 3 is not just about size, it also has many features to help. Gemma 3 delivers state-of-the-art performance for its size. In preliminary evaluations, it supports over 35 languages out of the box and is pre-trained for over 140 languages. Seamlessly analyze images, text, and short videos, and Massive context windows of 128K tokens allow your applications to process and understand large amounts of data at once.
How Gemma 3 It Trained
Gemma 3 uses distillation techniques and performs optimization through reinforcement learning and model merging during pre-training and post-training.
This approach can improve performance in mathematics, encoding, and instruction.
Moreover, Gemma 3 uses a brand new tokenizer, provides support for more than 140 languages, and is trained on Google TPU using the JAX framework for 1B of 2T tokens, 4B of 4T tokens, 12B of 12T tokens, and 27B of 14T tokens.
In the post-training stage, Gemma 3 mainly uses 4 components:
Extract from a larger instruction model into a Gemma 3 pre-trained checkpoint
Reinforcement learning with human feedback (RLHF) aligns model predictions with human preferences.
Machine feedback reinforcement learning (RLMF), enhances mathematical reasoning.
Reinforcement Learning Execution Feedback (RLEF) to improve coding ability.
These updates significantly improved the model’s math, programming, and command-following capabilities, allowing Gemma 3 to score 1338 in LMArena.
The fine-tuned version of Gemma 3 commands uses the same dialogue format as Gemma 2. Therefore, developers do not need to update the tools and can directly input plain text.
Let’s start coding
Let us now explore step by step and unravel the answer to creating a powerful reasoning app. We will install the libraries that support the model. For this, we will do a pip install requirements
pip install -r requirements.txt
Once installed, we import the important dependencies like Mistralai
,Google and streamlit
import streamlit as st
import base64
import tempfile
import os
from mistralai import Mistral
from PIL import Image
import io
from mistralai import DocumentURLChunk, ImageURLChunk
from mistralai.models import OCRResponse
from dotenv import find_dotenv, load_dotenv
import google.generativeai as genai
I designed this function to securely upload a PDF to Mistral’s OCR API and get a signed URL for further processing. I first check if the client object is provided; if it’s None, I raise an error since the function requires a properly initialized Mistral API client.
I create a temporary directory, define a file path, and write the PDF content. I then open the file in "rb" mode and upload it to the Mistral API using the client, specifying the filename, content, and "purpose" as "ocr".
After a successful upload, I retrieve a signed URL with client enabling access to the file. If an error occurs, I catch the exception and raise a ValueError with a clear message. Finally, I ensure cleanup by deleting the temporary file if it exists.
OCR Processing Functions
def upload_pdf(client, content, filename):
"""Uploads a PDF to Mistral's API and retrieves a signed URL for processing."""
if client is None:
raise ValueError("Mistral client is not initialized")
with tempfile.TemporaryDirectory() as temp_dir:
temp_path = os.path.join(temp_dir, filename)
with open(temp_path, "wb") as tmp:
tmp.write(content)
try:
with open(temp_path, "rb") as file_obj:
file_upload = client.files.upload(
file={"file_name": filename, "content": file_obj},
purpose="ocr"
)
signed_url = client.files.get_signed_url(file_id=file_upload.id)
return signed_url.url
except Exception as e:
raise ValueError(f"Error uploading PDF: {str(e)}")
finally:
if os.path.exists(temp_path):
os.remove(temp_path
)
Then I created the replace_images_in_markdown function, which takes a Markdown string and a dictionary mapping image names to base64-encoded images. I iterate through the dictionary, where each key represents an image placeholder and each value contains the corresponding base64 string.
I used .replace() to find occurrences img_name in the Markdown and replace them with img_name. This ensures that placeholders are converted into embedded base64 images. Finally, I returned the updated Markdown string with the images replaced.
def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
"""Replace image placeholders with base64 encoded images in markdown."""
for img_name, base64_str in images_dict.items():
markdown_str = markdown_str.replace(f"", f"")
return markdown_str
Let’s define the get_combined_markdown function to process multiple pages of OCR-extracted Markdown and their images. I create an empty list of markdowns to store the processed Markdown from each page. I iterate through ocr_response.pages, collecting image data by mapping image IDs to their base64-encoded representations.
I replace image placeholders in each page’s Markdown using replace_images_in_markdown and append the modified content to markdowns. Finally, I join all processed Markdown sections with "\n\n".join(markdowns), ensuring a clear separation between pages.
def get_combined_markdown(ocr_response: OCRResponse) -> str:
"""Combine markdown from all pages with their respective images."""
markdowns: list[str] = []
for page in ocr_response.pages:
image_data = {}
for img in page.images:
image_data[img.id] = img.image_base64
markdowns.append(replace_images_in_markdown(page.markdown, image_data))
return "\n\n".join(markdowns)
Then, I create the process_ocr function to check if the client is provided; if not, I raise an error because an initialized Mistral client is required. I inspect document_source to determine whether to process a document URL or an image URL. If it's a "document_url"I call client.ocr.process() with a Document URL Chunk, and if it's an "image_url"I use an ImageURLChunk.
I specify "mistral-ocr-latest" the model and enable it include_image_base64=True to include base64-encoded images. If the source type is unrecognized, I raise a ValueError with a clear error message.
def process_ocr(client, document_source):
"""Process document with OCR API based on source type"""
if client is None:
raise ValueError("Mistral client is not initialized")
if document_source["type"] == "document_url":
return client.ocr.process(
document=DocumentURLChunk(document_url=document_source["document_url"]),
model="mistral-ocr-latest",
include_image_base64=True
)
elif document_source["type"] == "image_url":
return client.ocr.process(
document=ImageURLChunk(image_url=document_source["image_url"]),
model="mistral-ocr-latest",
include_image_base64=True
)
else:
raise ValueError(f"Unsupported document source type: {document_source['type']}")
Let’s initialize the Google Gemini API by configuring it with an API key. I check if the context is empty or too short (less than 10 characters), and I return an error if it is true. I create a prompt that includes the document’s context and the query to guide the model’s response.
I configured the model with parameters like temperature, top_p, and safety settings. I generate the response using the model.generate_content(). If an error occurs, I catch it, print the error details, and return an error message.
def generate_response(context, query):
"""Generate a response using Google Gemini API"""
try:
# Initialize the Google Gemini API
genai.configure(api_key=google_api_key)
# Check for empty context
if not context or len(context) < 10:
return "Error: No document content available to answer your question."
# Create a prompt with the document content and query
prompt = f"""I have a document with the following content:
{context}
Based on this document, please answer the following question:
{query}
If you can find information related to the query in the document, please answer based on that information.
If the document doesn't specifically mention the exact information asked, please try to infer from related content or clearly state that the specific information isn't available in the document.
"""
# Print for debugging
print(f"Sending prompt with {len(context)} characters of context")
print(f"First 500 chars of context: {context[:500]}...")
# Generate response
model = genai.GenerativeModel('gemma-3-27b-it')
generation_config = {
"temperature": 0.4,
"top_p": 0.8,
"top_k": 40,
"max_output_tokens": 2048,
}
safety_settings = [
{
"category": "HARM_CATEGORY_HARASSMENT",
"threshold": "BLOCK_ONLY_HIGH"
},
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"threshold": "BLOCK_ONLY_HIGH"
},
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"threshold": "BLOCK_ONLY_HIGH"
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"threshold": "BLOCK_ONLY_HIGH"
},
]
response = model.generate_content(
prompt,
generation_config=generation_config,
safety_settings=safety_settings
)
return response.text
except Exception as e:
print(f"Error generating response: {str(e)}")
import traceback
print(traceback.format_exc())
return f"Error generating response: {str(e)}"
Then, we created the Streamlit app to enable users to upload documents or images for OCR processing. Users can provide API keys for the Mistral and Google APIs via the sidebar. If valid, the Mistral client is initialized, and the Google API is checked and can upload documents via PDF, image, or URL. The app processes the content using OCR, extracting text from each page and storing it for future use.
Once a document is loaded, users can ask questions about the content, and the app generates responses using the Google Gemini API. All chat messages are stored in the session state. Streamlit also handles errors, such as missing API keys or processing failures, and provides warnings when features are incomplete.
def main():
global api_key, google_api_key
st.set_page_config(page_title="Document OCR & Chat", layout="wide")
# Sidebar: Authentication for API keys
with st.sidebar:
st.header("Settings")
# API key inputs
api_key_tab1, api_key_tab2 = st.tabs(["Mistral API", "Google API"])
with api_key_tab1:
# Get Mistral API key from environment or user input
user_api_key = st.text_input("Mistral API Key", value=api_key if api_key else "", type="password")
if user_api_key:
api_key = user_api_key
os.environ["MISTRAL_API_KEY"] = api_key
with api_key_tab2:
# Get Google API key
user_google_api_key = st.text_input(
"Google API Key",
value=google_api_key if google_api_key else "",
type="password",
help="API key for Google Gemini to use for response generation"
)
if user_google_api_key:
google_api_key = user_google_api_key
os.environ["GOOGLE_API_KEY"] = google_api_key
# Initialize Mistral client with the API key
mistral_client = None
if api_key:
mistral_client = initialize_mistral_client(api_key)
if mistral_client:
st.sidebar.success("✅ Mistral API connected successfully")
# Google API key validation
if google_api_key:
is_valid, message = test_google_api(google_api_key)
if is_valid:
st.sidebar.success(f"✅ Google API {message}")
else:
st.sidebar.error(f"❌ Google API: {message}")
google_api_key = None
# Display warnings for missing API keys
if not api_key or mistral_client is None:
st.sidebar.warning("⚠️ Valid Mistral API key required for document processing")
if not google_api_key:
st.sidebar.warning("⚠️ Google API key required for chat functionality")
# Initialize session state
if "messages" not in st.session_state:
st.session_state.messages = []
if "document_content" not in st.session_state:
st.session_state.document_content = ""
if "document_loaded" not in st.session_state:
st.session_state.document_loaded = False
# Document upload section
st.subheader("Document Upload")
# Only show document upload if Mistral client is initialized
if mistral_client:
input_method = st.radio("Select Input Type:", ["PDF Upload", "Image Upload", "URL"])
document_source = None
if input_method == "URL":
url = st.text_input("Document URL:")
if url and st.button("Load Document from URL"):
document_source = {
"type": "document_url",
"document_url": url
}
elif input_method == "PDF Upload":
uploaded_file = st.file_uploader("Choose PDF file", type=["pdf"])
if uploaded_file and st.button("Process PDF"):
content = uploaded_file.read()
# Save the uploaded PDF temporarily for display purposes
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
tmp.write(content)
pdf_path = tmp.name
try:
# Prepare document source for OCR processing
document_source = {
"type": "document_url",
"document_url": upload_pdf(mistral_client, content, uploaded_file.name)
}
# Display the uploaded PDF
st.header("Uploaded PDF")
display_pdf(pdf_path)
except Exception as e:
st.error(f"Error processing PDF: {str(e)}")
# Clean up the temporary file
if os.path.exists(pdf_path):
os.unlink(pdf_path)
elif input_method == "Image Upload":
uploaded_image = st.file_uploader("Choose Image file", type=["png", "jpg", "jpeg"])
if uploaded_image and st.button("Process Image"):
try:
# Display the uploaded image
image = Image.open(uploaded_image)
st.image(image, caption="Uploaded Image", use_column_width=True)
# Convert image to base64
buffered = io.BytesIO()
image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue()).decode()
# Prepare document source for OCR processing
document_source = {
"type": "image_url",
"image_url": f"data:image/png;base64,{img_str}"
}
except Exception as e:
st.error(f"Error processing image: {str(e)}")
# Process document if source is provided
if document_source:
with st.spinner("Processing document..."):
try:
ocr_response = process_ocr(mistral_client, document_source)
if ocr_response and ocr_response.pages:
# Extract all text without page markers for clean content
raw_content = []
for page in ocr_response.pages:
page_content = page.markdown.strip()
if page_content: # Only add non-empty pages
raw_content.append(page_content)
# Join all content into one clean string for the model
final_content = "\n\n".join(raw_content)
# Also create a display version with page numbers for the UI
display_content = []
for i, page in enumerate(ocr_response.pages):
page_content = page.markdown.strip()
if page_content:
display_content.append(f"Page {i+1}:\n{page_content}")
display_formatted = "\n\n----------\n\n".join(display_content)
# Store both versions
st.session_state.document_content = final_content # Clean version for the model
st.session_state.display_content = display_formatted # Formatted version for display
st.session_state.document_loaded = True
# Show success information about extracted content
st.success(f"Document processed successfully! Extracted {len(final_content)} characters from {len(raw_content)} pages.")
else:
st.warning("No content extracted from document.")
except Exception as e:
st.error(f"Processing error: {str(e)}")
# Main area: Display chat interface
st.title("Document OCR & Chat")
# Document preview area
if "document_loaded" in st.session_state and st.session_state.document_loaded:
with st.expander("Document Content", expanded=False):
# Show the display version with page numbers
if "display_content" in st.session_state:
st.markdown(st.session_state.display_content)
else:
st.markdown(st.session_state.document_content)
# Chat interface
st.subheader("Chat with your document")
# Display chat messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# Input for user query
if prompt := st.chat_input("Ask a question about your document..."):
# Check if Google API key is available
if not google_api_key:
st.error("Google API key is required for generating responses. Please add it in the sidebar settings.")
else:
# Add user message to chat history
st.session_state.messages.append({"role": "user", "content": prompt})
# Display user message
with st.chat_message("user"):
st.markdown(prompt)
# Show thinking spinner
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
# Get document content from session state
document_content = st.session_state.document_content
# Generate response directly
response = generate_response(document_content, prompt)
# Display response
st.markdown(response)
# Add assistant message to chat history
st.session_state.messages.append({"role": "assistant", "content": response})
else:
# Show a welcome message if no document is loaded
st.info("