How to Run a Local Model for Text Recognition in Images

Want to extract text from images without relying on cloud services? You can run a powerful optical character recognition (OCR) model right on your own computer. This local approach gives you full control over the process and keeps your data private. In this article, we'll walk you through setting up and using a popular open-source OCR engine. You'll learn how to install the necessary libraries, load pre-trained models, and process images to recognize text in various languages. Whether you're working on a personal project or developing an application, this guide will help you get started with local text recognition quickly and easily. This guide uses Windows 11, the Ollama model runner, the Llama 3.2 Vision model, and Python. Let's get started! 1. Install Ollama First, head to https://ollama.com/download. Download the installer (it's about 768 MB) and run it to install Ollama. 2. Pull the Llama 3.2 Vision Model Open your command prompt or terminal. We'll download the Llama 3.2 Vision model using Ollama. You have two size options. The 11B and 90B parameters refer to the size of the Llama 3.2 Vision models, indicating the number of trainable parameters in each model: 11B model: This is the smaller version with 11 billion parameters. 90B model: This is the larger version with 90 billion parameters. Both models are designed for multimodal tasks, capable of processing both text and images. They excel in various applications such as: Document-level understanding Chart and graph analysis Image captioning Visual grounding Visual question answering The choice between the 11B and 90B models depends on the specific use case, available computational resources, and the desired level of performance for complex visual reasoning tasks. For the smaller model (11B, needs at least 8GB of VRAM (video memory)): ollama pull llama3.2-vision:11b For the larger model (90B, needs a whopping 64GB of VRAM): ollama pull llama3.2-vision:90b For home use, running the 90B model locally is extremely challenging due to its massive hardware requirements. 3. Run the Model Once the model is downloaded, run it locally with: ollama run llama3.2-vision 4. Install ollama-ocr To easily process images, we'll use the ollama-ocr Python library. Install it using pip: pip install ollama-ocr 5. Python Code for OCR Here's the Python code to recognize text in an image: from ollama_ocr import OCRProcessor ocr = OCRProcessor(model_name='llama3.2-vision:11b') result = ocr.process_image( image_path="./your_image.jpg", format_type="text" ) print(result) 6. Run the Code Replace "./your_image.jpg" with the actual path to your image file. Save the code as a .py file (e.g., ocr_script.py). Run the script from your command prompt: python ocr_script.py The script will send the image to your locally running Llama 3.2 Vision model, and the recognized text will be printed in your terminal. To complement our guide on using Llama 3.2 Vision locally, we conducted performance tests on a home desktop computer. Here are the results: Performance Test Results We ran the Llama 3.2 Vision 11B model on a home desktop with the following specifications: Processor: 13th Gen Intel(R) Core(TM) i7-13700K Graphics Card: Gigabyte RTX 3060 Gaming OC 12G RAM: 64.0 GB DDR4 Operating System: Windows 11 Pro 24H2 Image for Testing For testing, we chose this amusing image. Test Output Using our Python script, we tasked the model with recognizing text in an image using the standard system prompt. After running the script multiple times on a single test image, we observed processing times ranging from 16.78 to 47.23 seconds. It's worth noting that these results were achieved with the graphics card running at default settings, without any additional tuning or optimizations. The model successfully analyzed a black-and-white meme image with two panels featuring stick figures and speech bubbles. The image is a black-and-white meme featuring two panels with stick figures and speech bubbles. **Panel 1:** In the first panel, a stick figure on the left side of the image has its arms outstretched towards another stick figure in the center. The central figure holds a large circle labeled "WEEKEND" in bold white letters. The stick figure on the right side of the image is partially cut off by the edge of the frame. **Panel 2:** In the second panel, the same two stick figures are depicted again. However, this time, the central figure now holds a smaller circle labeled "MONDAY" instead of "WEEKEND." The stick figure on the left side of the image has its arms outstretched towards the central figure once more. **Text and Labels:** The text in both panels is presented in white letters with bold outlines. In the first panel, the labels read: * "ME" (on the stick figure's chest) * "WEEKEND" (inside the large circle) In the

Feb 14, 2025 - 20:31
 0
How to Run a Local Model for Text Recognition in Images

Want to extract text from images without relying on cloud services?

You can run a powerful optical character recognition (OCR) model right on your own computer. This local approach gives you full control over the process and keeps your data private. In this article, we'll walk you through setting up and using a popular open-source OCR engine. You'll learn how to install the necessary libraries, load pre-trained models, and process images to recognize text in various languages. Whether you're working on a personal project or developing an application, this guide will help you get started with local text recognition quickly and easily.

This guide uses Windows 11, the Ollama model runner, the Llama 3.2 Vision model, and Python. Let's get started!

1. Install Ollama

First, head to https://ollama.com/download. Download the installer (it's about 768 MB) and run it to install Ollama.

2. Pull the Llama 3.2 Vision Model

Open your command prompt or terminal. We'll download the Llama 3.2 Vision model using Ollama. You have two size options.

The 11B and 90B parameters refer to the size of the Llama 3.2 Vision models, indicating the number of trainable parameters in each model:

  • 11B model: This is the smaller version with 11 billion parameters.
  • 90B model: This is the larger version with 90 billion parameters.

Both models are designed for multimodal tasks, capable of processing both text and images. They excel in various applications such as:

  • Document-level understanding
  • Chart and graph analysis
  • Image captioning
  • Visual grounding
  • Visual question answering

The choice between the 11B and 90B models depends on the specific use case, available computational resources, and the desired level of performance for complex visual reasoning tasks.

For the smaller model (11B, needs at least 8GB of VRAM (video memory)):

ollama pull llama3.2-vision:11b

For the larger model (90B, needs a whopping 64GB of VRAM):

ollama pull llama3.2-vision:90b

For home use, running the 90B model locally is extremely challenging due to its massive hardware requirements.

3. Run the Model

Once the model is downloaded, run it locally with:

ollama run llama3.2-vision

4. Install ollama-ocr

To easily process images, we'll use the ollama-ocr Python library. Install it using pip:

pip install ollama-ocr

5. Python Code for OCR

Here's the Python code to recognize text in an image:

from ollama_ocr import OCRProcessor

ocr = OCRProcessor(model_name='llama3.2-vision:11b')

result = ocr.process_image(
    image_path="./your_image.jpg",
    format_type="text"
)
print(result)

6. Run the Code

Replace "./your_image.jpg" with the actual path to your image file. Save the code as a .py file (e.g., ocr_script.py). Run the script from your command prompt:

python ocr_script.py

The script will send the image to your locally running Llama 3.2 Vision model, and the recognized text will be printed in your terminal.

To complement our guide on using Llama 3.2 Vision locally, we conducted performance tests on a home desktop computer. Here are the results:

Performance Test Results

We ran the Llama 3.2 Vision 11B model on a home desktop with the following specifications:

  • Processor: 13th Gen Intel(R) Core(TM) i7-13700K
  • Graphics Card: Gigabyte RTX 3060 Gaming OC 12G
  • RAM: 64.0 GB DDR4
  • Operating System: Windows 11 Pro 24H2

Image for Testing

For testing, we chose this amusing image.

The image is a meme image for testing the locally run Llama 3.2 Vision 11B model

Test Output

Using our Python script, we tasked the model with recognizing text in an image using the standard system prompt. After running the script multiple times on a single test image, we observed processing times ranging from 16.78 to 47.23 seconds. It's worth noting that these results were achieved with the graphics card running at default settings, without any additional tuning or optimizations.

The model successfully analyzed a black-and-white meme image with two panels featuring stick figures and speech bubbles.

The image is a black-and-white meme featuring two panels with stick figures and speech bubbles.

**Panel 1:**
In the first panel, a stick figure on the left side of the image has its arms outstretched towards another stick figure in the center. The central figure holds a large circle labeled "WEEKEND" in bold white letters. The stick figure on the right side of the image is partially cut off by the edge of the frame.

**Panel 2:**
In the second panel, the same two stick figures are depicted again. However, this time, the central figure now holds a smaller circle labeled "MONDAY" instead of "WEEKEND." The stick figure on the left side of the image has its arms outstretched towards the central figure once more.

**Text and Labels:**
The text in both panels is presented in white letters with bold outlines. In the first panel, the labels read:

* "ME" (on the stick figure's chest)
* "WEEKEND" (inside the large circle)

In the second panel, the labels are:

* "MONDAY" (inside the smaller circle)
* "ME" (on the stick figure's chest)

**Overall:**
The meme humorously portrays the anticipation and excitement of approaching the weekend, as well as the disappointment that follows when it arrives. The use of simple yet expressive stick figures and speech bubbles effectively conveys this sentiment in a relatable and entertaining manner.

Conclusion

That's it! You're now running a local image text recognition system using Ollama and Python. Remember to experiment with different images and adjust your approach as needed for best results.

You can find the scripts referenced in this article in the repository at https://github.com/karavanjo/dev-content/tree/main/llama-local-run.

Demonstration of the model running in the video at the link.