Supercharging Language Models: What I Learned Testing LLMs with Tools

LLMs are great at creative writing and language tasks, but they often stumble on basic knowledge retrieval and math. Popular tests like counting r's in "strawberry" or doing simple arithmetic often trip them up. This is where tools come into the picture. Why we use tools with LLMs? Simply put, we're giving LLMs capabilities they don't naturally have, helping them deliver better answers. Three Surprising Things I Found While Testing I ran some tests with local models on Ollama and noticed some interesting patterns: 1) Even the Best Models Get Math Wrong Sometimes I tested various models with this straightforward financial question: "I have initially 100 USD in an account that gives 3.42% interest/year for the first 2 years then switches to a 3% interest/year. How much will I have after 5 years?" The correct answer is 100 * 1.0342^2 * 1.03^3 = 116.8747624. What surprised me was that top models like Gemini 2.5 Pro and GPT-4o "understood" the approach but messed up the actual calculations. Gemini calculated 106.954164 * 1.092727 = 116.881 - close, but not quite right. This is a good reminder to double-check LLM calculations, especially for important decisions like financial planning. Interestingly, even a small local model like Qwen3 4B could nail this when given a calculator tool - showing that the right tools can make a huge difference. 2) Tools Can Supercharge Performance The difference tools make is pretty dramatic: Qwen3 4B without tools: Spent 12 minutes thinking only to get the wrong answer (somehow turning 100 USD into 1000 USD) Same model with a calculator: Got the right answer in just over 2 minutes I saw similar improvements across other local models like Llama 3.2 3B and 3.3 70B. I assume we'd reach the same conclusions with cloud-based LLMs. 3) LLMs Can Fill in the Gaps When Tools Fall Short What's fascinating is how LLMs handle imperfect tool results. I experimented by simulating tool calls and controlling what they returned - often deliberately giving back information that wasn't quite what the LLM requested. For example, when I gave an LLM a weather tool that only showed temperatures in Fahrenheit but asked for Celsius, it just did the conversion itself without missing a beat. In another experiment, I simulated returning interest calculations for the wrong time period (e.g., 3 years instead of 5). The LLM recognized the mismatch and tried to adapt the information to solve the original problem. Sometimes it would request additional calculations, and other times it would attempt to extrapolate from what it received. These experiments show that LLMs don't just blindly use tool outputs - they evaluate the results, determine if they're helpful, and find ways to work with what they have. This adaptability makes tools even more powerful, as they don't need to be perfectly aligned with every request. 4) Smaller Models Don't Always Use Tools When They Should You might expect smaller models with limited knowledge to eagerly embrace tools as a crutch, but my testing revealed something quite different and fascinating. The tiniest model I tested, Qwen 0.6B, was surprisingly stubborn about using its own capabilities. Even when explicitly told about available tools that could help solve a problem, it consistently tried to work things out on its own - often with poor results. It's almost as if it lacked the self-awareness to recognise its own limitations. Llama 3.2 3B showed a different pattern. It attempted to use tools, showing they recognised the need for external help, but applied them incorrectly. For instance, when trying to solve our compound interest problem, it would call the calculator tool but input the wrong formula or misinterpret the results. Larger models seem to be more reliable in their calculations - sometimes rightfully so, but other times this confidence was misplaced as they still made errors. It still make sense to use tools have a deterministic output grounded. This pattern suggests that effective tool use might not emerge naturally as models get smaller - it may require specific fine-tuning to teach smaller models when and how to leverage external tools. Perhaps smaller models need explicit training to recognise their own limitations and develop the "humility" to rely on tools when appropriate? What's Next to Explore I'm particularly interested in understanding tool selection strategies (how models choose between multiple viable tools), tool chaining for complex problems, and whether smaller models can be specifically fine-tuned to better recognise when they need external help. The sweet spot in tool design is another critical area - finding the right balance between verbose outputs with explanations versus minimal outputs that are easier to parse could dramatically improve how effectively LLMs leverage external capabilities. Want to play around with this yourself? Check out my Node CLI a

May 4, 2025 - 22:13

Supercharging Language Models: What I Learned Testing LLMs with Tools

LLMs are great at creative writing and language tasks, but they often stumble on basic knowledge retrieval and math. Popular tests like counting r's in "strawberry" or doing simple arithmetic often trip them up. This is where tools come into the picture.

Why we use tools with LLMs?

Simply put, we're giving LLMs capabilities they don't naturally have, helping them deliver better answers.

Three Surprising Things I Found While Testing

I ran some tests with local models on Ollama and noticed some interesting patterns:

1) Even the Best Models Get Math Wrong Sometimes

I tested various models with this straightforward financial question: "I have initially 100 USD in an account that gives 3.42% interest/year for the first 2 years then switches to a 3% interest/year. How much will I have after 5 years?"

The correct answer is 100 * 1.0342^2 * 1.03^3 = 116.8747624.

What surprised me was that top models like Gemini 2.5 Pro and GPT-4o "understood" the approach but messed up the actual calculations. Gemini calculated 106.954164 * 1.092727 = 116.881 - close, but not quite right.

This is a good reminder to double-check LLM calculations, especially for important decisions like financial planning.

Interestingly, even a small local model like Qwen3 4B could nail this when given a calculator tool - showing that the right tools can make a huge difference.

2) Tools Can Supercharge Performance

The difference tools make is pretty dramatic:

Qwen3 4B without tools: Spent 12 minutes thinking only to get the wrong answer (somehow turning 100 USD into 1000 USD)
Same model with a calculator: Got the right answer in just over 2 minutes

I saw similar improvements across other local models like Llama 3.2 3B and 3.3 70B. I assume we'd reach the same conclusions with cloud-based LLMs.

3) LLMs Can Fill in the Gaps When Tools Fall Short

What's fascinating is how LLMs handle imperfect tool results. I experimented by simulating tool calls and controlling what they returned - often deliberately giving back information that wasn't quite what the LLM requested.

For example, when I gave an LLM a weather tool that only showed temperatures in Fahrenheit but asked for Celsius, it just did the conversion itself without missing a beat.

In another experiment, I simulated returning interest calculations for the wrong time period (e.g., 3 years instead of 5). The LLM recognized the mismatch and tried to adapt the information to solve the original problem. Sometimes it would request additional calculations, and other times it would attempt to extrapolate from what it received.

These experiments show that LLMs don't just blindly use tool outputs - they evaluate the results, determine if they're helpful, and find ways to work with what they have. This adaptability makes tools even more powerful, as they don't need to be perfectly aligned with every request.

4) Smaller Models Don't Always Use Tools When They Should

You might expect smaller models with limited knowledge to eagerly embrace tools as a crutch, but my testing revealed something quite different and fascinating.

The tiniest model I tested, Qwen 0.6B, was surprisingly stubborn about using its own capabilities. Even when explicitly told about available tools that could help solve a problem, it consistently tried to work things out on its own - often with poor results. It's almost as if it lacked the self-awareness to recognise its own limitations.

Llama 3.2 3B showed a different pattern. It attempted to use tools, showing they recognised the need for external help, but applied them incorrectly. For instance, when trying to solve our compound interest problem, it would call the calculator tool but input the wrong formula or misinterpret the results.

Larger models seem to be more reliable in their calculations - sometimes rightfully so, but other times this confidence was misplaced as they still made errors. It still make sense to use tools have a deterministic output grounded.

This pattern suggests that effective tool use might not emerge naturally as models get smaller - it may require specific fine-tuning to teach smaller models when and how to leverage external tools. Perhaps smaller models need explicit training to recognise their own limitations and develop the "humility" to rely on tools when appropriate?

What's Next to Explore

I'm particularly interested in understanding tool selection strategies (how models choose between multiple viable tools), tool chaining for complex problems, and whether smaller models can be specifically fine-tuned to better recognise when they need external help.

The sweet spot in tool design is another critical area - finding the right balance between verbose outputs with explanations versus minimal outputs that are easier to parse could dramatically improve how effectively LLMs leverage external capabilities.

Want to play around with this yourself? Check out my Node CLI app:
GitHub Repository