AI still can’t count. I built a dataset to prove it: VisQuant
I’ve been experimenting with GPT-4V, Claude, and Gemini and realized something strange: They can describe art. Solve riddles. Explain GPTs. But ask: “How many pencils are on the table?” Or “Which object is left of the cup?” And they fall apart. So I built a benchmark to test that specifically: What is VisQuant? 100 synthetic images 40+ everyday object types Labeled object counts and spatial layout 2 reasoning Q&A pairs per image Grounded annotations in JSON and CSV Baseline tested on GPT-4V Entirely open-source What It Tests VisQuant isolates the visual intelligence primitives models often skip over: Counting Spatial relationships Left/right/stacked inference Multi-hop VQA from structured scenes Why? Because current benchmarks like VQAv2 or GQA are messy, noisy, and hide these weaknesses. VisQuant is small, clean, focused — and it exposes real gaps in model reasoning. Get It:

I’ve been experimenting with GPT-4V, Claude, and Gemini and realized something strange:
They can describe art. Solve riddles. Explain GPTs.
But ask: “How many pencils are on the table?”
Or “Which object is left of the cup?”
And they fall apart.
So I built a benchmark to test that specifically:
What is VisQuant?
- 100 synthetic images
- 40+ everyday object types
- Labeled object counts and spatial layout
- 2 reasoning Q&A pairs per image
- Grounded annotations in JSON and CSV
- Baseline tested on GPT-4V
- Entirely open-source
What It Tests
VisQuant isolates the visual intelligence primitives models often skip over:
- Counting
- Spatial relationships
- Left/right/stacked inference
- Multi-hop VQA from structured scenes
Why?
Because current benchmarks like VQAv2 or GQA are messy, noisy, and hide these weaknesses.
VisQuant is small, clean, focused — and it exposes real gaps in model reasoning.
Get It: