Phi 4 Reasoning Benchmarks, Model Specs, and Comparisons

Originally shared here: Phi 4 Reasoning Benchmarks, Model Specs, and Comparisons Phi 4 Reasoning and Phi 4 Mini Reasoning benchmarks and comparisons bestcodes.dev Meet the Phi 4 Reasoning series by Microsoft Released on April 30th, 2025, the Phi 4 Reasoning model series is the latest generation of LLMs from Microsoft. It's the first model in the Phi series designed for reasoning tasks. TL;DR: The benchmarks look great, but my experience didn't match. Let's check it out! Model Sizes and Details Here's a quick look at what you can choose from: Model Params Max Context phi4-reasoning-plus 14.7B 32K phi4-reasoning 14.7B 32K phi4-mini-reasoning 3.8B 32K Some key model details: All models are licensed under MIT. This makes them quite accessible for various applications. Notably, the context window isn't that great, especially for a reasoning model. Even the 8B variant of the recently released Qwen3 model has a larger context window. However, it should work for most local purposes. The training data cutoff is March 2025, but the model itself claims a cutoff date of October 2023. Benchmarks and Comparisons Well, at least the benchmarks look good. My experience with general use was not so wonderful. Maybe this excerpt from the model description on HuggingFace helps explain why: This model is designed and tested for math reasoning only. Regardless, let's take a look. Here, you can see that Phi 4 Reasoning performs quite well for its size. It significantly outperforms the Phi 4 base model, as well as DeepSeek Distill 70B and o1-mini in certain scenarios: Phi 4 Mini Reasoning also demonstrates strong performance, outperforming its base model and most others (with o1 being an exception in a couple of benchmarks) specifically on math reasoning tasks: My Personal Experience This section is biased. As some have pointed out, the model isn't designed to be good at chatting, it's supposed to reason and do math. Keep that in mind when reading this section. My experience with the model was okay, but it's really a math reasoning model, not a general model. For example, let's tell Phi 4 Reasoning "Hello!". That should be easy, right? Apparently, "We need to produce a greeting response" is hard not to keep thinking about, because I hit a token limit before it responded. I did tweak the model temperature and a few params and get this working better, but I was not impressed with the initial results. With typical benchmark tests, however, the model does a great job! Here's the "Is 9.11 or 9.9 larger" test: Sometimes, as with many reasoning models, the answer is okay, but the reasoning to get there is a bit weird: And then ask: "What time is it?" Where did that come from? LOL. At least it answered correctly in the end. Conclusion Phi 4 Reasoning is definitely interesting and performs well for its size, but it doesn't quite take the lead in the general LLM landscape. I wouldn't recommend it for broad, general-purpose applications at this point.

May 4, 2025 - 14:58
 0
Phi 4 Reasoning Benchmarks, Model Specs, and Comparisons

Originally shared here:

Phi 4 Reasoning Benchmarks, Model Specs, and Comparisons

Phi 4 Reasoning and Phi 4 Mini Reasoning benchmarks and comparisons

favicon bestcodes.dev

Meet the Phi 4 Reasoning series by Microsoft

Released on April 30th, 2025, the Phi 4 Reasoning model series is the latest generation of LLMs from Microsoft. It's the first model in the Phi series designed for reasoning tasks.
TL;DR: The benchmarks look great, but my experience didn't match.

Let's check it out!

Model Sizes and Details

Here's a quick look at what you can choose from:

Model Params Max Context
phi4-reasoning-plus 14.7B 32K
phi4-reasoning 14.7B 32K
phi4-mini-reasoning 3.8B 32K

Some key model details:

  • All models are licensed under MIT. This makes them quite accessible for various applications.
  • Notably, the context window isn't that great, especially for a reasoning model. Even the 8B variant of the recently released Qwen3 model has a larger context window. However, it should work for most local purposes.
  • The training data cutoff is March 2025, but the model itself claims a cutoff date of October 2023.

Benchmarks and Comparisons

Well, at least the benchmarks look good. My experience with general use was not so wonderful. Maybe this excerpt from the model description on HuggingFace helps explain why:

This model is designed and tested for math reasoning only.

Regardless, let's take a look.

Here, you can see that Phi 4 Reasoning performs quite well for its size. It significantly outperforms the Phi 4 base model, as well as DeepSeek Distill 70B and o1-mini in certain scenarios:

Phi 4 Reasoning Benchmarks

Phi 4 Mini Reasoning also demonstrates strong performance, outperforming its base model and most others (with o1 being an exception in a couple of benchmarks) specifically on math reasoning tasks:

Phi 4 Mini Reasoning Benchmarks

My Personal Experience

This section is biased. As some have pointed out, the model isn't designed to be good at chatting, it's supposed to reason and do math. Keep that in mind when reading this section.

My experience with the model was okay, but it's really a math reasoning model, not a general model.

For example, let's tell Phi 4 Reasoning "Hello!". That should be easy, right?

Hello chat with Phi 4

Apparently, "We need to produce a greeting response" is hard not to keep thinking about, because I hit a token limit before it responded. I did tweak the model temperature and a few params and get this working better, but I was not impressed with the initial results.

With typical benchmark tests, however, the model does a great job! Here's the "Is 9.11 or 9.9 larger" test:

Is 9.11 or 9.9 larger

Sometimes, as with many reasoning models, the answer is okay, but the reasoning to get there is a bit weird:

Weird reasoning

And then ask: "What time is it?"

Where did that come from? LOL. At least it answered correctly in the end.

Conclusion

Phi 4 Reasoning is definitely interesting and performs well for its size, but it doesn't quite take the lead in the general LLM landscape. I wouldn't recommend it for broad, general-purpose applications at this point.