Visual Grounding from Docling!

My experience testing for a demonstration “Visual Grounding” feature of Docling on Intel CPU. Introduction In discussions with our business partners, we always try to propose tools and solution to answer to specific business requirements. In a recent talk for a specific use-case I discovered “visual grounding” among other capacities of Docling set of toolings. “Visual Grounding”, in general terms is a task that aims at locating objects in an image based on a natural language query. This task, along with image captioning, visual question answering or content based image retrieval links image data with the text modality. This feature exists in Docling and there is a sample notebook provided as example. Implementation In order to test the visual grounding capacity use-case I tried to run the sample notebook, but as a Python app. Excerpt from the official documentation; This example showcases Docling’s visual grounding capabilities, which can be combined with any agentic AI / RAG framework. In this instance, we illustrate these capabilities leveraging the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings. Also it is mentioned that…

Apr 9, 2025 - 12:34
 0
Visual Grounding from Docling!

My experience testing for a demonstration “Visual Grounding” feature of Docling on Intel CPU.

Image description

Introduction

In discussions with our business partners, we always try to propose tools and solution to answer to specific business requirements. In a recent talk for a specific use-case I discovered “visual grounding” among other capacities of Docling set of toolings.

“Visual Grounding”, in general terms is a task that aims at locating objects in an image based on a natural language query. This task, along with image captioning, visual question answering or content based image retrieval links image data with the text modality. This feature exists in Docling and there is a sample notebook provided as example.

Implementation

In order to test the visual grounding capacity use-case I tried to run the sample notebook, but as a Python app.

Excerpt from the official documentation;

This example showcases Docling’s visual grounding capabilities, which can be combined with any agentic AI / RAG framework.
In this instance, we illustrate these capabilities leveraging the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings. Also it is mentioned that…