AI still can’t count. I built a dataset to prove it: VisQuant

I’ve been experimenting with GPT-4V, Claude, and Gemini and realized something strange: They can describe art. Solve riddles. Explain GPTs. But ask: “How many pencils are on the table?” Or “Which object is left of the cup?” And they fall apart. So I built a benchmark to test that specifically: What is VisQuant? 100 synthetic images 40+ everyday object types Labeled object counts and spatial layout 2 reasoning Q&A pairs per image Grounded annotations in JSON and CSV Baseline tested on GPT-4V Entirely open-source What It Tests VisQuant isolates the visual intelligence primitives models often skip over: Counting Spatial relationships Left/right/stacked inference Multi-hop VQA from structured scenes Why? Because current benchmarks like VQAv2 or GQA are messy, noisy, and hide these weaknesses. VisQuant is small, clean, focused — and it exposes real gaps in model reasoning. Get It:

Apr 2, 2025 - 21:12
 0
AI still can’t count. I built a dataset to prove it: VisQuant

I’ve been experimenting with GPT-4V, Claude, and Gemini and realized something strange:

They can describe art. Solve riddles. Explain GPTs.
But ask: “How many pencils are on the table?”
Or “Which object is left of the cup?”
And they fall apart.

So I built a benchmark to test that specifically:

What is VisQuant?

  • 100 synthetic images
  • 40+ everyday object types
  • Labeled object counts and spatial layout
  • 2 reasoning Q&A pairs per image
  • Grounded annotations in JSON and CSV
  • Baseline tested on GPT-4V
  • Entirely open-source

What It Tests
VisQuant isolates the visual intelligence primitives models often skip over:

  • Counting
  • Spatial relationships
  • Left/right/stacked inference
  • Multi-hop VQA from structured scenes

Why?
Because current benchmarks like VQAv2 or GQA are messy, noisy, and hide these weaknesses.
VisQuant is small, clean, focused — and it exposes real gaps in model reasoning.

Get It: