Proof or Bluff? Why Today's AI Still Fails the Math Olympiad Test
Can today’s most advanced AI models really solve math like a human genius? Recent math benchmarks have shown impressive results on problems like those in the AIME or HMMT competitions. But these tasks mostly need final answers — not full, rigorous proofs. That’s where the new study, “Proof or Bluff?” from ETH Zurich and INSAIT, steps in. Researchers challenged top-tier language models — including Gemini-2.5-PRO, Claude 3.7, and Grok-3 — with the 2025 USAMO (USA Mathematical Olympiad): a competition famous for its demand for deep thinking and bulletproof logic. The Verdict? AI Still Flops on Hard Math Even the best model, Gemini-2.5-PRO, scored only 10.1 out of 42 — that’s just about 24% accuracy. All others scored less than 5%. That’s not even close to human Olympiad-level performance. Why They Failed Human judges (all former IMO finalists) identified four common failure patterns: Flawed logic: Skipping reasoning steps or drawing false conclusions. Wrong assumptions: Using unsupported ideas to bridge gaps. Low creativity: Sticking to one (wrong) strategy across multiple runs. Hallucinations: Making up citations or boxing trivial answers due to training biases. Even more ironic — many models claimed they had solved the problem, even when their logic was clearly broken. The Hidden Bias of Optimization Training tricks like reinforcement learning (RLHF/GRPO) push models to "box the final answer," even when a box isn’t appropriate. Worse, models like QWQ and Gemini fabricated academic-sounding theorems that don’t exist — just to sound convincing. Automated Grading? Not Yet. The team also tried using LLMs to grade each other — a cool idea, but the results were inflated by up to 20x. Machines couldn’t distinguish a shallow bluff from real insight. What This Means for AI + Math This paper sends a clear signal: today’s LLMs aren’t ready for formal math reasoning that demands proof, creativity, and logical precision. We’re seeing polished performance in shallow tasks, but in-depth reasoning remains out of reach. The Road Ahead To build truly trustworthy AI mathematicians, we need a next-gen leap — beyond pattern matching and into genuine, provable reasoning. Whether through better alignment, curriculum learning, or symbolic tools — the future of math + AI is still wide open. Resources: matharena github TL;DR: AI models can bluff their way through simple math, but when it comes to real, Olympiad-level proofs — they break down. We’re not at the age of automated mathematicians yet — but this research is a solid step toward that future.

Can today’s most advanced AI models really solve math like a human genius? Recent math benchmarks have shown impressive results on problems like those in the AIME or HMMT competitions. But these tasks mostly need final answers — not full, rigorous proofs.
That’s where the new study, “Proof or Bluff?” from ETH Zurich and INSAIT, steps in. Researchers challenged top-tier language models — including Gemini-2.5-PRO, Claude 3.7, and Grok-3 — with the 2025 USAMO (USA Mathematical Olympiad): a competition famous for its demand for deep thinking and bulletproof logic.
The Verdict? AI Still Flops on Hard Math
Even the best model, Gemini-2.5-PRO, scored only 10.1 out of 42 — that’s just about 24% accuracy. All others scored less than 5%. That’s not even close to human Olympiad-level performance.
Why They Failed
Human judges (all former IMO finalists) identified four common failure patterns:
Flawed logic: Skipping reasoning steps or drawing false conclusions.
Wrong assumptions: Using unsupported ideas to bridge gaps.
Low creativity: Sticking to one (wrong) strategy across multiple runs.
Hallucinations: Making up citations or boxing trivial answers due to training biases.
Even more ironic — many models claimed they had solved the problem, even when their logic was clearly broken.
The Hidden Bias of Optimization
Training tricks like reinforcement learning (RLHF/GRPO) push models to "box the final answer," even when a box isn’t appropriate. Worse, models like QWQ and Gemini fabricated academic-sounding theorems that don’t exist — just to sound convincing.
Automated Grading? Not Yet.
The team also tried using LLMs to grade each other — a cool idea, but the results were inflated by up to 20x. Machines couldn’t distinguish a shallow bluff from real insight.
What This Means for AI + Math
This paper sends a clear signal: today’s LLMs aren’t ready for formal math reasoning that demands proof, creativity, and logical precision. We’re seeing polished performance in shallow tasks, but in-depth reasoning remains out of reach.
The Road Ahead
To build truly trustworthy AI mathematicians, we need a next-gen leap — beyond pattern matching and into genuine, provable reasoning. Whether through better alignment, curriculum learning, or symbolic tools — the future of math + AI is still wide open.
Resources:
TL;DR: AI models can bluff their way through simple math, but when it comes to real, Olympiad-level proofs — they break down. We’re not at the age of automated mathematicians yet — but this research is a solid step toward that future.