QwQ-32B vs DeepSeek-R1-671B

Qwen is a series of LLMs released and maintained by Alibaba Cloud. QwQ is the model with reasoning capabilities in Qwen series. A while ago, the team released a preview version of this model and now, they’ve released QwQ-32B model completely. It is available in Huggingface and Ollama model repository. Links https://huggingface.co/Qwen/QwQ-32B https://ollama.com/library/qwq They’ve used reinforcement learning (RL) scaling approach driven by outcome-based rewards. As mentioned in their blog post, instead of a traditional reward model, an accuracy verifier is used in training this model. It is trained with rewards from general reward model and some rule-based verifiers. You can use QwQ-32B via Hugging Face Transformers and Alibaba Cloud DashScope API. Example Code with Transformers from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/QwQ-32B" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "How many r's are in the word \"strawberry\"" messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=32768 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response) Example Code with DashScope API from openai import OpenAI import os # Initialize OpenAI client client = OpenAI( # If the environment variable is not configured, replace with your API Key: api_key="sk-xxx" # How to get an API Key:https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key api_key=os.getenv("DASHSCOPE_API_KEY"), base_url="https://dashscope.aliyuncs.com/compatible-mode/v1" ) reasoning_content = "" content = "" is_answering = False completion = client.chat.completions.create( model="qwq-32b", messages=[ {"role": "user", "content": "Which is larger, 9.9 or 9.11?"} ], stream=True, # Uncomment the following line to return token usage in the last chunk # stream_options={ # "include_usage": True # } ) print("\n" + "=" * 20 + "reasoning content" + "=" * 20 + "\n") for chunk in completion: # If chunk.choices is empty, print usage if not chunk.choices: print("\nUsage:") print(chunk.usage) else: delta = chunk.choices[0].delta # Print reasoning content if hasattr(delta, 'reasoning_content') and delta.reasoning_content is not None: print(delta.reasoning_content, end='', flush=True) reasoning_content += delta.reasoning_content else: if delta.content != "" and is_answering is False: print("\n" + "=" * 20 + "content" + "=" * 20 + "\n") is_answering = True # Print content print(delta.content, end='', flush=True) content += delta.content Performance Evaluation Below is the evaluation chart to show how this 32B model is competing against other reasoning model, especially DeepSeek-R1–671B. It is heavily competing against DeepSeek-R1–671B model in all the five benchmarks and outperforming OpenAI-o1-mini (except for IFEval). I was wondering what OpenAI’s ChatGPT would think about it

Mar 6, 2025 - 07:11
 0
QwQ-32B vs DeepSeek-R1-671B

Qwen is a series of LLMs released and maintained by Alibaba Cloud. QwQ is the model with reasoning capabilities in Qwen series. A while ago, the team released a preview version of this model and now, they’ve released QwQ-32B model completely. It is available in Huggingface and Ollama model repository.

Image generated by ChatGPT

Links

https://huggingface.co/Qwen/QwQ-32B
https://ollama.com/library/qwq

They’ve used reinforcement learning (RL) scaling approach driven by outcome-based rewards. As mentioned in their blog post, instead of a traditional reward model, an accuracy verifier is used in training this model. It is trained with rewards from general reward model and some rule-based verifiers. You can use QwQ-32B via Hugging Face Transformers and Alibaba Cloud DashScope API.

Example Code with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/QwQ-32B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r's are in the word \"strawberry\""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Example Code with DashScope API

from openai import OpenAI
import os

# Initialize OpenAI client
client = OpenAI(
    # If the environment variable is not configured, replace with your API Key: api_key="sk-xxx"
    # How to get an API Key:https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

reasoning_content = ""
content = ""

is_answering = False

completion = client.chat.completions.create(
    model="qwq-32b",
    messages=[
        {"role": "user", "content": "Which is larger, 9.9 or 9.11?"}
    ],
    stream=True,
    # Uncomment the following line to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)

print("\n" + "=" * 20 + "reasoning content" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print reasoning content
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content is not None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "content" + "=" * 20 + "\n")
                is_answering = True
            # Print content
            print(delta.content, end='', flush=True)
            content += delta.content

Performance Evaluation

Below is the evaluation chart to show how this 32B model is competing against other reasoning model, especially DeepSeek-R1–671B.

Image from blog post by Qwen

It is heavily competing against DeepSeek-R1–671B model in all the five benchmarks and outperforming OpenAI-o1-mini (except for IFEval).

I was wondering what OpenAI’s ChatGPT would think about it