Alibaba Qwen 3 is the fastest LLM ever, Microsoft's byte-sized open source model, DeepSeek Prover is GREAT at maths, and more

Hello AI Enthusiasts! Welcome to the seventeenth edition of "This Week in AI Engineering"! Alibaba's Qwen3 sets new benchmark records with dual-mode thinking, Microsoft's BitNet runs AI with just 1-bit weights using 96% less energy, Adobe Firefly and GPT-4o produce nearly identical images, DeepSeek Prover V2 solves mathematical proofs with unprecedented accuracy, and OpenAI integrates shopping recommendations into ChatGPT search. With this, we'll also be talking about some must-know tools to make developing AI agents and apps easier. Alibaba’s Qwen3 is the Fastest LLM Ever Alibaba Cloud has unveiled Qwen3, its next-generation language model family that introduces both dense and mixture-of-experts (MoE) architectures. What makes these models special? They've achieved some of the highest scores ever recorded on industry-standard benchmarks while using a revolutionary dual-mode thinking approach. Breaking Benchmark Records (And Why It Matters) The flagship Qwen3-235B-A22B is dominating the leaderboards with exceptional results: 95.6 on ArenaHard (complex reasoning challenges) – even higher than GPT-4o's 89.0 and OpenAI's specialized reasoning model o1 at 92.1 85.7 on AIME'24 (American Invitational Mathematics Examination) – a standardized competition math test where even the best human students struggle 70.7 on LiveCodeBench (real-world coding challenges) – matching the performance of tech giants' flagship models like Gemini 2.5 Pro 2056 ELO rating on CodeForces – a competitive programming platform where higher numbers reflect better problem-solving ability against other models Smart Architecture: Two Thinking Modes in One Model What truly sets Qwen3 apart is its innovative "brain-switching" capability: Think Like a Mathematician When Needed, Chat Like a Human When Preferred Toggle between deep analytical thinking for complex problems Switch to efficient conversation mode for everyday interactions All without changing models or configurations Do More With Less Through MoE Technology The smaller Qwen3-30B-A3B activates only 3 billion parameters at a time Yet it scores 91.0 on ArenaHard, outperforming many larger models This means faster responses and lower computing costs Beyond the Benchmarks: Practical Power Qwen3 brings improvements that make it immediately useful in diverse scenarios: Speaks Your Language – Fluent in 100+ languages with natural translation abilities Works Well With Others – Seamlessly controls external tools and follows complex instructions Understands Human Preferences – Excels at creative writing and maintains character consistency Microsoft’s Byte-Sized Open Source Model Microsoft Research has released BitNet b1.58-2B-4T, the first open-source language model using 1-bit weights instead of the standard 16 or 32 bits. This breakthrough dramatically reduces the resources needed to run AI. The Numbers That Matter Let's break down what BitNet achieves: Memory: Just 0.4GB needed—five times less than similar models. This means AI can run on devices with limited RAM. Speed: Generates text in 29ms per token—faster than LLaMA 3 (48ms) and MiniCPM (124ms). This creates smoother, more responsive experiences. Energy: Uses only 0.028J of power—4% of what other models consume. Lower energy means longer battery life and reduced costs. Training: Built on 4 trillion tokens of data, giving it a solid foundation of knowledge. Performance That Competes With Bigger Models Despite its efficiency, BitNet performs surprisingly well: 49.91 on ARC-Challenge—higher than LLaMA 3 and Gemma 3 80.18 on BoolQ—nearly matching the top score of 80.67 77.09 on PIQA—leading all compared models on physical reasoning These scores show BitNet can handle complex reasoning and comprehension while using far fewer resources. How It Works BitNet uses a clever approach that limits each weight in the neural network to just three values: -1, 0, or +1. This radical simplification, combined with specialized "BitLinear" layers, creates a model that's both efficient and capable. This innovation makes advanced AI accessible in scenarios where memory and power are limited—from mobile devices to large-scale deployments where efficiency translates to significant cost savings. Adobe Firefly vs GPT-4o: The AI Art Twins That Are Hard to Tell Apart Adobe's new Firefly Image Model 4 and OpenAI's GPT-4o image generator produce results so similar they could be created by students from the same art school. While both create impressive images, their striking similarities reveal how AI image generators are converging in style and capabilities. Head-to-Head Comparisons Show Remarkable Overlap Testing both models with identical prompts reveals just how close they've become: Portrait Photography: When asked to create "a red-haired woman with freckles in a sunflower field," both AIs produced nearly identical faces, hair textures, and hat shapes—suggesting they trained

May 4, 2025 - 06:02

Alibaba Qwen 3 is the fastest LLM ever, Microsoft's byte-sized open source model, DeepSeek Prover is GREAT at maths, and more

Hello AI Enthusiasts!

Welcome to the seventeenth edition of "This Week in AI Engineering"!

Alibaba's Qwen3 sets new benchmark records with dual-mode thinking, Microsoft's BitNet runs AI with just 1-bit weights using 96% less energy, Adobe Firefly and GPT-4o produce nearly identical images, DeepSeek Prover V2 solves mathematical proofs with unprecedented accuracy, and OpenAI integrates shopping recommendations into ChatGPT search.

With this, we'll also be talking about some must-know tools to make developing AI agents and apps easier.

Alibaba’s Qwen3 is the Fastest LLM Ever

Alibaba Cloud has unveiled Qwen3, its next-generation language model family that introduces both dense and mixture-of-experts (MoE) architectures. What makes these models special? They've achieved some of the highest scores ever recorded on industry-standard benchmarks while using a revolutionary dual-mode thinking approach.

Breaking Benchmark Records (And Why It Matters)

The flagship Qwen3-235B-A22B is dominating the leaderboards with exceptional results:

95.6 on ArenaHard (complex reasoning challenges) – even higher than GPT-4o's 89.0 and OpenAI's specialized reasoning model o1 at 92.1
85.7 on AIME'24 (American Invitational Mathematics Examination) – a standardized competition math test where even the best human students struggle
70.7 on LiveCodeBench (real-world coding challenges) – matching the performance of tech giants' flagship models like Gemini 2.5 Pro
2056 ELO rating on CodeForces – a competitive programming platform where higher numbers reflect better problem-solving ability against other models

Smart Architecture: Two Thinking Modes in One Model

What truly sets Qwen3 apart is its innovative "brain-switching" capability:

Think Like a Mathematician When Needed, Chat Like a Human When Preferred

Toggle between deep analytical thinking for complex problems
Switch to efficient conversation mode for everyday interactions
All without changing models or configurations

Do More With Less Through MoE Technology

The smaller Qwen3-30B-A3B activates only 3 billion parameters at a time
Yet it scores 91.0 on ArenaHard, outperforming many larger models
This means faster responses and lower computing costs

Beyond the Benchmarks: Practical Power

Qwen3 brings improvements that make it immediately useful in diverse scenarios:

Speaks Your Language – Fluent in 100+ languages with natural translation abilities
Works Well With Others – Seamlessly controls external tools and follows complex instructions
Understands Human Preferences – Excels at creative writing and maintains character consistency

Microsoft’s Byte-Sized Open Source Model

Microsoft Research has released BitNet b1.58-2B-4T, the first open-source language model using 1-bit weights instead of the standard 16 or 32 bits. This breakthrough dramatically reduces the resources needed to run AI.

The Numbers That Matter

Let's break down what BitNet achieves:

Memory: Just 0.4GB needed—five times less than similar models. This means AI can run on devices with limited RAM.
Speed: Generates text in 29ms per token—faster than LLaMA 3 (48ms) and MiniCPM (124ms). This creates smoother, more responsive experiences.
Energy: Uses only 0.028J of power—4% of what other models consume. Lower energy means longer battery life and reduced costs.
Training: Built on 4 trillion tokens of data, giving it a solid foundation of knowledge.

Performance That Competes With Bigger Models

Despite its efficiency, BitNet performs surprisingly well:

49.91 on ARC-Challenge—higher than LLaMA 3 and Gemma 3
80.18 on BoolQ—nearly matching the top score of 80.67
77.09 on PIQA—leading all compared models on physical reasoning

These scores show BitNet can handle complex reasoning and comprehension while using far fewer resources.

How It Works

BitNet uses a clever approach that limits each weight in the neural network to just three values: -1, 0, or +1. This radical simplification, combined with specialized "BitLinear" layers, creates a model that's both efficient and capable.

This innovation makes advanced AI accessible in scenarios where memory and power are limited—from mobile devices to large-scale deployments where efficiency translates to significant cost savings.

Adobe Firefly vs GPT-4o: The AI Art Twins That Are Hard to Tell Apart

Adobe's new Firefly Image Model 4 and OpenAI's GPT-4o image generator produce results so similar they could be created by students from the same art school. While both create impressive images, their striking similarities reveal how AI image generators are converging in style and capabilities.

Head-to-Head Comparisons Show Remarkable Overlap

Testing both models with identical prompts reveals just how close they've become:

Portrait Photography: When asked to create "a red-haired woman with freckles in a sunflower field," both AIs produced nearly identical faces, hair textures, and hat shapes—suggesting they trained on overlapping photo datasets.
Sci-Fi Laboratory: For a chaotic lab scene, the models diverged slightly in focus—Firefly emphasized malfunctioning robots while GPT-4o highlighted escaping alien specimens—but both successfully created busy, detailed environments.
Food Photography: Both models handled a breakfast spread prompt by placing excessive berries alongside pancakes and featuring nearly identical latte art—both created fern patterns with hearts at the top.
Fantasy Illustration: The "majestic dragon" prompt resulted in creatures with remarkably similar facial structures and dinosaur-like tails, though GPT-4o handled the "fiery text" requirement more effectively.

What This Means for Users

The convergence of these leading image generators suggests we're entering a new phase in AI art where:

Model selection may soon depend more on pricing and integration with other tools than image quality
Different AI systems are developing similar "default aesthetics" for common subjects
Technical capabilities are becoming more standardized across competing platforms

For creators and businesses, this means focusing less on which AI to use and more on how to craft prompts that achieve your specific vision—as the underlying technologies grow increasingly similar.

DeepSeek Prover V2: Breaking Barriers in Automated Mathematical Proof

DeepSeek has released Prover V2, an open-source model that represents a breakthrough in formal theorem proving—the ability to automatically verify complex mathematical statements with perfect rigor. This specialized model is setting new standards in a field long considered one of AI's most difficult challenges.

Record-Breaking Mathematical Verification

DeepSeek Prover V2-671B achieves unprecedented results on formal proof benchmarks:

88.9% pass rate on MiniF2F-test – substantially outperforming all competitors including Kimina-Prover (80.7%) and BFS-Prover (73.0%)
49 solved problems on PutnamBench – more than double the next best model's 23 solved problems
8 solved problems on AIME competitions – handling complex high-school competition math that requires multiple steps of sophisticated reasoning

These numbers represent the model's ability to construct complete, formally verified proofs that would satisfy the most rigorous mathematical standards.

Innovative Two-Stage Training Approach

What makes Prover V2 special is its unique development process:

Recursive Proof Search – The model breaks down complex theorems into smaller subgoals, solving them individually before integrating the solutions
Synthetic Cold-Start Data – Successful subgoal proofs are combined with natural language reasoning to create training examples that connect informal thinking to formal verification

This approach mimics how human mathematicians work—first reasoning informally about the general approach, then carefully constructing a rigorous proof.

A New Benchmark for Mathematical AI

Alongside the model, DeepSeek has introduced ProverBench—a collection of 325 formalized problems including:

15 problems from recent AIME competitions (American Invitational Mathematics Examination)
310 problems from undergraduate mathematics textbooks and tutorials

This benchmark provides a standardized way to evaluate how well AI systems can bridge the gap between human-style mathematical reasoning and formal verification in the Lean 4 proof assistant.

With these advances, DeepSeek Prover V2 represents a significant step toward AI systems that can not only solve complex mathematical problems but also provide absolute certainty in their correctness through formal verification.

OpenAI Upgrades ChatGPT Search with Shopping Recommendations

OpenAI has expanded ChatGPT's search functionality to include product recommendations when users express shopping intent. This new feature marks a significant evolution in how ChatGPT interfaces with commercial content online.

Shopping Integration in Natural Language Search

When users ask questions like "gifts for someone who loves cooking" or "best noise-cancelling headphones under $200," ChatGPT now directly surfaces relevant products within its responses. Importantly, OpenAI emphasizes these are not paid placements:

Products are selected algorithmically based on relevance
Recommendations are not advertisements
Any website or merchant can appear in results

How Websites Can Optimize for ChatGPT Discovery

OpenAI has provided clear guidelines for merchants and publishers who want their products discovered:

Allow the OAI-SearchBot crawler – Check your robots.txt file to ensure you're not blocking OpenAI's web crawler
Track ChatGPT traffic – The system automatically adds "utm_source=chatgpt.com" to referral URLs, making it easy to identify visitors coming from ChatGPT in analytics platforms
Product feed submission coming soon – OpenAI is developing a system for merchants to directly submit product information, ensuring more accurate and current listings

What This Means for the Search Ecosystem

This update represents OpenAI's growing position as a potential alternative to traditional search engines for commercial queries. For consumers, it streamlines the shopping research process by combining ChatGPT's natural language understanding with direct product discovery.

For merchants and publishers, it creates a new channel to reach consumers through optimizing content for AI discovery rather than traditional SEO. As this feature expands, businesses will need to consider both traditional search optimization and AI-specific content strategies.

Tools & Releases YOU Should Know About

AI Code Playground: This platform offers a live coding editor where users can write, test, and visualize code in real time. It features an extensive Python library and AI-powered code generation, making it ideal for both learning and rapid prototyping. Users can add comments, types, and suggest fixes, promoting collaborative and efficient coding

AutoRegex: AutoRegex uses AI to convert plain English descriptions into regular expressions (Regex). This simplifies the creation of complex text patterns, making Regex accessible even for those unfamiliar with its syntax. It’s user-friendly and supports instant output, though users should verify results for accuracy

Trelent: Trelent leverages deep learning to automatically generate docstrings for your code, focusing on explaining the “why” behind functions. It supports multiple languages, enhances documentation clarity, and boosts developer efficiency by saving time on manual documentation

SeaGOAT: SeaGOAT is a local-first, semantic code search engine. It uses vector embeddings to understand code meaning, enabling powerful, AI-driven searches within your codebase. All processing is done locally, ensuring privacy and fast results, and it supports both semantic and Regex-based queries

And that wraps up this issue of "This Week in AI Engineering", brought to you by jam.dev — your flight recorder for AI apps! Non-deterministic AI issues are hard to repro, unless you have Jam! Instant replay the session, prompt + logs to debug ⚡️

Thank you for tuning in! Be sure to share this with your fellow AI enthusiasts and follow for more weekly updates.

Until next time, happy building!