A beginner's guide to the Grounding-Dino model by Adirik on Replicate

This is a simplified guide to an AI model called Grounding-Dino maintained by Adirik. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter. Model overview grounding-dino is an AI model that can detect arbitrary objects in images using human text inputs such as category names or referring expressions. It combines a Transformer-based detector called DINO with grounded pre-training to achieve open-vocabulary and text-guided object detection. The model was developed by IDEA Research and is available as a Cog model on Replicate. Similar models include GroundingDINO, which also uses the Grounding DINO approach, as well as other object detection models like stable-diffusion and text-extract-ocr. Model inputs and outputs grounding-dino takes an image and a comma-separated list of text queries describing the objects you want to detect. It then outputs the detected objects with bounding boxes and predicted labels. The model also allows you to adjust the confidence thresholds for the box and text predictions. Inputs image: The input image to query query: Comma-separated text queries describing the objects to detect box_threshold: Confidence level threshold for object detection text_threshold: Confidence level threshold for predicted labels show_visualisation: Option to draw and visualize the bounding boxes on the image Outputs Detected objects with bounding boxes and predicted labels Capabilities grounding-dino can detect a wide var... Click here to read the full guide to Grounding-Dino

Apr 13, 2025 - 08:12

A beginner's guide to the Grounding-Dino model by Adirik on Replicate

This is a simplified guide to an AI model called Grounding-Dino maintained by Adirik. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Model overview

grounding-dino is an AI model that can detect arbitrary objects in images using human text inputs such as category names or referring expressions. It combines a Transformer-based detector called DINO with grounded pre-training to achieve open-vocabulary and text-guided object detection. The model was developed by IDEA Research and is available as a Cog model on Replicate.

Similar models include GroundingDINO, which also uses the Grounding DINO approach, as well as other object detection models like stable-diffusion and text-extract-ocr.

Model inputs and outputs

grounding-dino takes an image and a comma-separated list of text queries describing the objects you want to detect. It then outputs the detected objects with bounding boxes and predicted labels. The model also allows you to adjust the confidence thresholds for the box and text predictions.

Inputs

image: The input image to query
query: Comma-separated text queries describing the objects to detect
box_threshold: Confidence level threshold for object detection
text_threshold: Confidence level threshold for predicted labels
show_visualisation: Option to draw and visualize the bounding boxes on the image

Outputs

Detected objects with bounding boxes and predicted labels

Capabilities

grounding-dino can detect a wide var...

Click here to read the full guide to Grounding-Dino