Qwen 3 Vs Deep Seek R1 - Evaluation Notes

Introduction The Alibaba recently team has released the Qwen 3 Series, including two standout models: the 235B parameter MoE model (with 22B active parameters) and a lightweight 30B version (3B active). As per official docs, the Qwen3-235B-A22B model take on giants like DeepSeek R1, Grok-3, and Gemini 2.5 Pro—and it is doing it with fewer active parameters, faster inference, and open-source accessibility. On the other hand, on a lighter note the Qwen3-30B-A3B outcompetes the previous QwQ-32B with 10 times of activated parameters, and a small model like Qwen3-4B can rival the performance of Qwen2.5-72B-Instruct. Best part, takes fraction of cost per million input and output token, compared to SOTA models. Impressive, isn’t? Let’s look at some of the key details of new Qwen 3 and then put it to test! TLDR; Alibaba released Qwen 3, featuring efficient MoE models (235B & 30B) claimed to outperform giants like DeepSeek R1 and GPT-4o with fewer active parameters. Qwen 3 consistently produced better, more functional, and user-friendly code for tasks. Both Qwen 3 and DeepSeek R1 correctly solved logic puzzles but had better output structure. While both performed well on simpler math, DeepSeek R1 more accurately solved a complex multi-step calculation problem where Qwen 3 slightly missed, suggesting varied results. Qwen 3 excelled at researching, summarizing and structuring text. Overall, Qwen3 is a highly capable, efficient open-source choice, strong in coding/writing, while DeepSeek R1 holds an edge in complex math and reasoning speed. Qwen3: Key Details Here are all the important key details for Qwen3 you need to know: Feature Details Moto Think Deeper, Act Faster Variants (XXB - Total Parameters, AXXB - Active Parameters) Qwen3-235B-A22B (MoE) Qwen-3-30B (LMoE) 0.6B, 1.8B, 4B, 7B, 14B, and 32B (Dense fully activated) Benchmarks Beaten (Qwen3-235B-A22B & 30B-A3B) HumanEval (Coding) MATH (Mathematical Problem Solving) GSM8K (Grade-School Math Work Problems) Big-Bench Hard (General Reasoning) BoolQ (Boolean QA / Reading Comprehension) ARC-Challenge (Scientific/Multiple-Choice Reasoning) Mixture of Experts (MoE) 10% active parameters = Huge inference savings Modes Hybrid Thinking Mode to switch between instant answers and step-by-step reasoning Language Support Supports 119 languages MCP Support Improved MCP Support Pretraining Pretrained on 36T tokens, twice the size of Qwen 2.5 Open Source Apache 2.0 license with local deployment option Local Support Available for all except 253B variants Additionally, Qwen3-235B-A22B outperformed others on task like: Coding – beat models like DeepSeek R1, Gemini 2.5 Pro, OpenAI O3 Mini, and 01 Mathematics – surpassed all major competitors in problem-solving and step-by-step accuracy General Reasoning – outcompeted top-tier models in logical and multi-step reasoning tasks Reading Comprehension – excelled in integrating multi-section ideas and inferencing Logical Deduction (Puzzles) – solved complex deduction tasks better than many others & Qwen3-30B-A3B performed competitively on task like: Coding & Math – matched or exceeded lighter models like GPT-4 Omni, Gemma 3, DCV3 Efficiency Benchmarks – achieved strong performance with only 3B active params General Use Benchmarks – handled a wide variety of tasks comparably to larger models With such impressive results, benchmarks and tasks support, it easy to get lost. So, let’s evaluate the model performance for real-world use cases across different domain, and compare results against SOTA like Deep Seek R1. By the end of the article, you will have a clear understanding, if Qwen3 is right for your use case or you can go for Deep Seek R1. So, Let’s begin! CODING The first task we will check model on is coding, nothing fancies here, let’s get straight to testing 1. Functional Sticky Note App (Skill Tested: Front-end UI development, DOM manipulation, interactivity) Sticks notes are very good productivity tool, and I use them quite often, so let’s see how both models perform on generating functional stick note webapp. Prompt Create a frontend for a modern note taking app. Make it so that you can add sticky notes. Add snip / notes drag functionality too. Just added functionality to drag and drop. Let’s check the result. Qwen 3 Code: note_taker_qwen.html Output: (From artifacts window) Qwen3 - Notes Taker Deep Seek R1 Code: note_taker_deepseek.html Output deepseekr1_notes_taker Comparing, both models did a fine job but output of Qwen3 was more fast, consistent and user friendly. So, I would definitely like to go for Qwen 3 for simple tasks as it understands nuances and requirement well. Now on to next test 2. Conway Game of Life (Code) (Skill Tested: Matrix logic, algorithm implementation, terminal rendering) Conway's Game of Life is a cellular au

May 4, 2025 - 14:58

Qwen 3 Vs Deep Seek R1 - Evaluation Notes

Introduction

The Alibaba recently team has released the Qwen 3 Series, including two standout models: the 235B parameter MoE model (with 22B active parameters) and a lightweight 30B version (3B active).

As per official docs, the Qwen3-235B-A22B model take on giants like DeepSeek R1, Grok-3, and Gemini 2.5 Pro—and it is doing it with fewer active parameters, faster inference, and open-source accessibility.

On the other hand, on a lighter note the Qwen3-30B-A3B outcompetes the previous QwQ-32B with 10 times of activated parameters, and a small model like Qwen3-4B can rival the performance of Qwen2.5-72B-Instruct.

Best part, takes fraction of cost per million input and output token, compared to SOTA models.

Impressive, isn’t?
Let’s look at some of the key details of new Qwen 3 and then put it to test!

TLDR;

Alibaba released Qwen 3, featuring efficient MoE models (235B & 30B) claimed to outperform giants like DeepSeek R1 and GPT-4o with fewer active parameters.
Qwen 3 consistently produced better, more functional, and user-friendly code for tasks.
Both Qwen 3 and DeepSeek R1 correctly solved logic puzzles but had better output structure.
While both performed well on simpler math, DeepSeek R1 more accurately solved a complex multi-step calculation problem where Qwen 3 slightly missed, suggesting varied results.
Qwen 3 excelled at researching, summarizing and structuring text.
Overall, Qwen3 is a highly capable, efficient open-source choice, strong in coding/writing, while DeepSeek R1 holds an edge in complex math and reasoning speed.

Qwen3: Key Details

Here are all the important key details for Qwen3 you need to know:

Feature	Details
Moto	Think Deeper, Act Faster
Variants (XXB - Total Parameters, AXXB - Active Parameters)
	Qwen3-235B-A22B (MoE)
	Qwen-3-30B (LMoE)
	0.6B, 1.8B, 4B, 7B, 14B, and 32B (Dense fully activated)
Benchmarks Beaten (Qwen3-235B-A22B & 30B-A3B)	HumanEval (Coding)
	MATH (Mathematical Problem Solving)
	GSM8K (Grade-School Math Work Problems)
	Big-Bench Hard (General Reasoning)
	BoolQ (Boolean QA / Reading Comprehension)
	ARC-Challenge (Scientific/Multiple-Choice Reasoning)
Mixture of Experts (MoE)	10% active parameters = Huge inference savings
Modes	Hybrid Thinking Mode to switch between instant answers and step-by-step reasoning
Language Support	Supports 119 languages
MCP Support	Improved MCP Support
Pretraining	Pretrained on 36T tokens, twice the size of Qwen 2.5
Open Source	Apache 2.0 license with local deployment option
Local Support	Available for all except 253B variants

Additionally, Qwen3-235B-A22B outperformed others on task like:

Coding – beat models like DeepSeek R1, Gemini 2.5 Pro, OpenAI O3 Mini, and 01
Mathematics – surpassed all major competitors in problem-solving and step-by-step accuracy
General Reasoning – outcompeted top-tier models in logical and multi-step reasoning tasks
Reading Comprehension – excelled in integrating multi-section ideas and inferencing
Logical Deduction (Puzzles) – solved complex deduction tasks better than many others

& Qwen3-30B-A3B performed competitively on task like:

Coding & Math – matched or exceeded lighter models like GPT-4 Omni, Gemma 3, DCV3
Efficiency Benchmarks – achieved strong performance with only 3B active params
General Use Benchmarks – handled a wide variety of tasks comparably to larger models

With such impressive results, benchmarks and tasks support, it easy to get lost.

So, let’s evaluate the model performance for real-world use cases across different domain, and compare results against SOTA like Deep Seek R1.

By the end of the article, you will have a clear understanding, if Qwen3 is right for your use case or you can go for Deep Seek R1.

So, Let’s begin!

CODING

The first task we will check model on is coding, nothing fancies here, let’s get straight to testing

1. Functional Sticky Note App

(Skill Tested: Front-end UI development, DOM manipulation, interactivity)

Sticks notes are very good productivity tool, and I use them quite often, so let’s see how both models perform on generating functional stick note webapp.

Prompt

Create a frontend for a modern note taking app. 
Make it so that you can add sticky notes. 
Add snip / notes drag functionality too.

Just added functionality to drag and drop. Let’s check the result.

Qwen 3

Code: note_taker_qwen.html

Output: (From artifacts window)
Qwen3 - Notes Taker

Deep Seek R1

Code: note_taker_deepseek.html

Output
deepseekr1_notes_taker

Comparing, both models did a fine job but output of Qwen3 was more fast, consistent and user friendly.

So, I would definitely like to go for Qwen 3 for simple tasks as it understands nuances and requirement well.

Now on to next test

2. Conway Game of Life (Code)

(Skill Tested: Matrix logic, algorithm implementation, terminal rendering)

Conway's Game of Life is a cellular automation game with the following rules:

Any live cell with fewer than two live neighbors dies (underpopulation).
Any live cell with more than three live neighbors dies (overpopulation).
Any live cell with two or three live neighbors continues to live.
Any dead cell with exactly three live neighbors comes to life.

Credits : Pinterest

Game Workings
The starting layout is called the seed of the system.

The first generation is created by applying the rules to every cell in the seed at the same time, whether the cell is alive or dead.

All changes—cells being born or dying—happen at once in a step called a tick.

Each new generation depends only on the one before it.

The same rules are then used again and again to create the next generation till no of iterations are met.

Seeing the complexity and decision making in the game, it suited me as a good question to test models coding, decision making in code ability.

Prompt

Create a Python implementation of Conway's Game of Life that runs in the terminal. 
The program should accept an initial state, run for a specified number of generations, 
and display each generation in the terminal.

No specific mention of testing the code, let’s check the results

Qwen3

Code: game_of_live_qwen.py

Output
game_of_life_qwen

Deep Seek R1

Code: game_of_life_deepseek.py , test_file*: blinker.txt ,* terminal_command : terminal_command.bash

Output
game_of_life_deepseekr1

One thing I liked straightaway of Qwen3 is that it provided me the sample test case, making it easier to test the code and a simple implementation. Just single script. Good for proto-typing things out.

This was not the case with Deep Seek R1, it provided 2 files, one main and another test (after asking how to run - assuming was vibe coding). Also, implementation was not similar to Qwen3, performance issues were there.

So again, I would go to Qwen3 for the coding tasks as its fast, reliable, more accurate and optimized code.

Now let’s move onto next test

3. Butterfly SVG Generation

(Skill Tested: SVG generation, geometric reasoning, visual symmetry)

SVG stands for Standard Vector Graphics and quite prominent in graphics design and core development field. So, let’s code a butterfly one for ourselves.

Prompt

Generate SVG code for a simple butterfly shape. 
The butterfly should have symmetrical wings and basic body parts (head, thorax, abdomen).
Use basic SVG shapes and keep the design relatively simple but recognizable.

Specific mention of symmetry in body part to check visual symmetry, let’s check the results

Qwen 3

Code: **svg_butterfly_qwen3.xml** (change name to svg_butterfly_qwen3.svg when run locally)

Output

DeepSeek R1

Code: **svg_butterfly_deepseek.xml** (change name to svg_butterfly_deepseek.svgwhen run locally)

Output

Both models produced symmetric and parts specific results

However, Qwen3 produced better output then Deep Seek R1 and was more nuanced in following the instruction as visible in Code.

Deep Seek output look more cartoonish style and code also shows, model skips following instruction.

Honestly, never expected Qwen3 to generate output better than Grok3 and, it did really beat it for good. Proof (Grok3 Results - even worse than Deep Seek R1)