Alibaba's Claude 3.7 killer, Anthropic's FULL Large Language Model guide, and more
Hello AI Enthusiasts! Welcome to the thirteenth edition of "This Week in AI Engineering"! Alibaba releases QVQ-Max visual reasoning with extended thinking, Anthropic reveals how LLMs think through circuit tracing, OpenAI improves GPT-4o's technical problem-solving, UCLA releases groundbreaking OpenVLThinker-7B, and Google launches TxGemma to accelerate drug discovery. With this, we'll also be talking about some must-know tools to make developing AI agents and apps easier. Alibaba QVQ-Max: Advanced Visual Reasoning Model with Extended Thinking Alibaba has officially released QVQ-Max, their first production version of a visual reasoning model following the experimental QVQ-72B-Preview introduced last December. The model combines sophisticated visual understanding with reasoning capabilities, allowing it to process and analyze information from images and videos to solve complex problems. Core Capabilities Detailed Observation: Parses complex charts, images, and videos to identify key elements including objects, textual labels, and subtle details Deep Reasoning: Analyzes visual information combined with background knowledge to solve problems requiring logical thinking Flexible Application: Performs tasks ranging from problem-solving to creative applications like illustration design and video script generation Technical Implementation Test-Time Scaling: Shows consistent improvement with increased thinking length, reaching 48.1% accuracy on MathVision at 24k tokens Progressive Scaling: Performance increases from 43.5% (4k tokens) to 45.6% (8k), 46.7% (16k), and 48.1% (24k tokens) Grounding Techniques: Uses validation processes to enhance recognition accuracy of visual content Application Domains Workplace Tool: Data analysis, information organization, and code writing capabilities Learning Assistant: Helps solve complex mathematical and physics problems with diagrammatic reasoning Life Helper: Offers practical advice for everyday scenarios including fashion recommendations and cooking guidance QVQ-Max is positioned as a visual agent that possesses both "vision" and "intellect," with Alibaba stating that the current release is just the first iteration with several key development areas planned, including more accurate observations through grounding techniques, enhanced visual agent capabilities for multi-step tasks, and expanded interactive modalities beyond text. How LLMs Think: Anthropic's Method for Peering Inside Large Language Models Anthropic has released "On the Biology of a Large Language Model" , introducing a powerful methodology for reverse-engineering how models like Claude work internally. The approach uses circuit tracing to map the connections between interpretable features in the model, revealing the hidden mechanisms driving model behavior. Attribution Graphs: LLM Microscopy Cross-Layer Transcoders: Replace model neurons with more interpretable "features" that activate for specific concepts Feature Visualization: Shows dataset examples where features activate most strongly, revealing their meaning Attribution Graphs: Map causal connections between features, showing how information flows from input to output Intervention Validation: Test hypotheses by inhibiting or activating specific features to verify causal relationships Key Mechanisms Discovered Multi-Step Reasoning: The model performs genuine multi-hop reasoning by activating intermediate concepts (e.g., "Dallas" → "Texas" → "Austin") Planning in Poetry: Features representing potential rhyming words activate before writing lines, showing the model plans ahead Multilingual Circuits: The model uses both language-specific and language-agnostic features, with English functioning as a "default" output language Hallucination Detection: Identified "known entity" features that inhibit refusal responses for familiar topics, with hallucinations occurring when these features misfire Circuit Analysis Methods Feature Networks: 30 million interpretable features traced across all model layers Visualization Interface: Interactive tools to explore attribution paths inside the model Pruning Techniques: Methods to simplify complex computational graphs while preserving key mechanisms Combined Approach: Integration of automated circuit discovery with human interpretation of mechanisms Anthropic's researchers note their methods are still limited, working well for about 25% of prompts they've tried, with complex reasoning chains being more difficult to fully trace. The approach represents a significant step toward understanding the emergent capabilities and safety properties of large language models by methodically examining their internal mechanics rather than treating them as black boxes. ChatGPT 4o: Significant Enhancements to Problem-Solving and Instruction Following OpenAI has released a small update to GPT-4o, focusing on improvements to technical problem-solving, instruction

Hello AI Enthusiasts!
Welcome to the thirteenth edition of "This Week in AI Engineering"!
Alibaba releases QVQ-Max visual reasoning with extended thinking, Anthropic reveals how LLMs think through circuit tracing, OpenAI improves GPT-4o's technical problem-solving, UCLA releases groundbreaking OpenVLThinker-7B, and Google launches TxGemma to accelerate drug discovery.
With this, we'll also be talking about some must-know tools to make developing AI agents and apps easier.
Alibaba QVQ-Max: Advanced Visual Reasoning Model with Extended Thinking
Alibaba has officially released QVQ-Max, their first production version of a visual reasoning model following the experimental QVQ-72B-Preview introduced last December. The model combines sophisticated visual understanding with reasoning capabilities, allowing it to process and analyze information from images and videos to solve complex problems.
Core Capabilities
- Detailed Observation: Parses complex charts, images, and videos to identify key elements including objects, textual labels, and subtle details
- Deep Reasoning: Analyzes visual information combined with background knowledge to solve problems requiring logical thinking
- Flexible Application: Performs tasks ranging from problem-solving to creative applications like illustration design and video script generation
Technical Implementation
- Test-Time Scaling: Shows consistent improvement with increased thinking length, reaching 48.1% accuracy on MathVision at 24k tokens
- Progressive Scaling: Performance increases from 43.5% (4k tokens) to 45.6% (8k), 46.7% (16k), and 48.1% (24k tokens)
- Grounding Techniques: Uses validation processes to enhance recognition accuracy of visual content
Application Domains
- Workplace Tool: Data analysis, information organization, and code writing capabilities
- Learning Assistant: Helps solve complex mathematical and physics problems with diagrammatic reasoning
- Life Helper: Offers practical advice for everyday scenarios including fashion recommendations and cooking guidance
QVQ-Max is positioned as a visual agent that possesses both "vision" and "intellect," with Alibaba stating that the current release is just the first iteration with several key development areas planned, including more accurate observations through grounding techniques, enhanced visual agent capabilities for multi-step tasks, and expanded interactive modalities beyond text.
How LLMs Think: Anthropic's Method for Peering Inside Large Language Models
Anthropic has released "On the Biology of a Large Language Model" , introducing a powerful methodology for reverse-engineering how models like Claude work internally. The approach uses circuit tracing to map the connections between interpretable features in the model, revealing the hidden mechanisms driving model behavior.
Attribution Graphs: LLM Microscopy
- Cross-Layer Transcoders: Replace model neurons with more interpretable "features" that activate for specific concepts
- Feature Visualization: Shows dataset examples where features activate most strongly, revealing their meaning
- Attribution Graphs: Map causal connections between features, showing how information flows from input to output
- Intervention Validation: Test hypotheses by inhibiting or activating specific features to verify causal relationships
Key Mechanisms Discovered
- Multi-Step Reasoning: The model performs genuine multi-hop reasoning by activating intermediate concepts (e.g., "Dallas" → "Texas" → "Austin")
- Planning in Poetry: Features representing potential rhyming words activate before writing lines, showing the model plans ahead
- Multilingual Circuits: The model uses both language-specific and language-agnostic features, with English functioning as a "default" output language
- Hallucination Detection: Identified "known entity" features that inhibit refusal responses for familiar topics, with hallucinations occurring when these features misfire
Circuit Analysis Methods
- Feature Networks: 30 million interpretable features traced across all model layers
- Visualization Interface: Interactive tools to explore attribution paths inside the model
- Pruning Techniques: Methods to simplify complex computational graphs while preserving key mechanisms
- Combined Approach: Integration of automated circuit discovery with human interpretation of mechanisms
Anthropic's researchers note their methods are still limited, working well for about 25% of prompts they've tried, with complex reasoning chains being more difficult to fully trace. The approach represents a significant step toward understanding the emergent capabilities and safety properties of large language models by methodically examining their internal mechanics rather than treating them as black boxes.
ChatGPT 4o: Significant Enhancements to Problem-Solving and Instruction Following
OpenAI has released a small update to GPT-4o, focusing on improvements to technical problem-solving, instruction following, and overall user experience. The March 27th release introduces several targeted enhancements to the model's capabilities.
Technical Improvements
- Code Generation: Produces cleaner, more functional frontend code that consistently compiles and runs
- Code Analysis: More accurately identifies necessary changes when examining existing code bases
- STEM Problem-Solving: Enhanced capabilities for tackling complex technical challenges
- Classification Accuracy: Higher precision when categorizing or labeling content
User Experience Refinements
- Instruction Adherence: Better follows detailed instructions, especially with multiple or complex requests
- Format Compliance: More precise generation of outputs according to requested formats
- Communication Style: More concise responses with fewer markdown hierarchies and emojis
- Intent Understanding: Improved ability to grasp the implied meaning behind user prompts
The updated model is now available in both ChatGPT and the API as the newest snapshot of chatgpt-4o-latest, with plans to bring these improvements to a dated model in the API in the coming weeks. These enhancements particularly benefit developers and technical users who rely on accurate code generation and complex problem-solving capabilities.
OpenVLThinker-7B: UCLA's Breakthrough in Visual Reasoning
UCLA researchers have released OpenVLThinker-7B, a vision-language model that significantly advances multimodal reasoning capabilities. The model addresses a critical limitation in current vision-language systems: their inability to perform multi-step reasoning when interpreting images alongside text.
Technical Architecture
- Base Model: Built on Qwen2.5-VL-7B foundation with specialized training pipeline
- Training Approach: Iterative combination of supervised fine-tuning (SFT) and reinforcement learning
- Data Processing: Initial captions generated with Qwen2.5-VL-3B, then processed by distilled DeepSeek-R1 for structured reasoning chains
- Optimization Method: Group Relative Policy Optimization (GRPO) used for reinforcement learning phases
Performance Metrics
- MathVista: 70.2% accuracy (versus 50.2% in base Qwen2.5-VL-7B)
- MathVerse: 68.5% accuracy (up from 46.8%)
- MathVision Full Test: 29.6% accuracy (improved from 24.0%)
- MathVision TestMini: 30.4% accuracy (up from 25.3%)
Training Methodology
- First SFT Phase: 25,000 examples from datasets including FigureQA, Geometry3K, TabMWP, and VizWiz
- First GRPO Phase: 5,000 harder samples, boosting accuracy from 62.5% to 65.6% on MathVista
- Second SFT Phase: Additional 5,000 high-quality examples, raising accuracy to 66.1%
- Second GRPO Phase: Final reinforcement learning round, pushing performance to 69.4%
The model generates clear reasoning traces that are both logically consistent and interpretable, demonstrating significant progress in bringing R1-style multi-step reasoning capabilities to multimodal systems. This advance has important applications in educational technology, visual analytics, and assistive technologies requiring complex visual reasoning.
Google TxGemma: Open Models for Accelerating Drug Discovery and Development
Google DeepMind has released TxGemma, a collection of open language models specifically designed to improve therapeutic development efficiency. Built on the Gemma 2 foundation models, TxGemma aims to accelerate the traditionally slow, costly, and risky process of drug discovery and development.
Model Architecture
- Base Foundation: Fine-tuned from Gemma 2 using 7 million therapeutic training examples
- Model Sizes: Available in three parameter scales - 2B, 9B, and 27B parameters
- Specialized Versions: Each size includes a dedicated "predict" version for narrow therapeutic tasks
- Conversational Models: 9B and 27B "chat" versions support reasoning explanations and multi-turn discussions
Technical Capabilities
- Classification Tasks: Predicts properties like blood-brain barrier penetration and toxicity
- Regression Tasks: Estimates binding affinity and other quantitative drug properties
- Generation Tasks: Produces reactants for given products and other synthetic chemistry tasks
- Fine-Tuning Framework: Includes example notebooks for adapting models to proprietary datasets
Performance Metrics
- Benchmark Results: The 27B predict version outperforms or matches the previous Tx-LLM on 64 of 66 tasks
- Task-Specific Comparisons: Equals or exceeds specialized single-task models on 50 out of 66 tasks
- Trade-Offs:** Chat versions sacrifice some raw performance for **explanation capabilities
Agentic Integration
- Agentic-Tx System: TxGemma integrates with a therapeutics-focused agent powered by Gemini 2.0 Pro
- Tool Ecosystem: The agent leverages 18 specialized tools including PubMed search, molecular tools, and gene/protein tools
- Reasoning Performance: Achieves state-of-the-art results on chemistry and biology tasks from Humanity's Last Exam and ChemBench
TxGemma models are now available through both Vertex AI Model Garden and Hugging Face, accompanied by notebooks demonstrating inference, fine-tuning, and agent integration. This release represents a significant step toward democratizing advanced AI tools for therapeutic research, potentially reducing the 90% failure rate of drug candidates beyond phase 1 trials.
Tools & Releases YOU Should Know About
- Pieces is an on-device copilot that helps developers capture, enrich, and reuse code by providing contextual understanding of their workflow. The AI tool maintains long-term memory of your entire workstream, collecting live context from browsers, IDEs, and collaboration tools. It processes data locally for enhanced security while allowing you to organize and share code snippets, reference previous errors, and avoid cold starts. Unlike other solutions, Pieces keeps your code on your device while still integrating with multiple LLMs to provide sophisticated assistance.
- Quack AI is a VS Code extension designed to help developers adhere to project coding guidelines. The tool automatically scans code to identify violations of project-specific standards and provides suggestions for bringing code into compliance. By enforcing consistent coding practices across teams, Quack AI helps maintain code quality and reduces the time spent on code reviews. The extension can be customized to match specific project requirements and integrates seamlessly with existing development workflows.
- Supermaven offers a VS Code extension for autocomplete with an impressive 300,000-token context window. This dramatically exceeds the context limitations of most coding assistants, allowing the AI to understand much larger portions of your codebase when generating suggestions. The extension can analyze entire projects to provide more contextually relevant completions, understand complex dependencies, and generate code that fits seamlessly with existing architecture. Supermaven's large context window helps developers maintain consistency across extensive codebases and reduces the need to manually refresh the AI's understanding of project structure.
- Amazon Q Developer (formerly Amazon CodeWhisperer) is an AI coding assistant with comprehensive IDE and CLI integrations. The tool extends beyond basic coding assistance by offering support for VS Code, IntelliJ IDEA, AWS Cloud9, MacOS Terminal, iTerm2, and the built-in VS Code Terminal. Q Developer not only generates code but also scans existing code to identify and define security issues, helping developers address vulnerabilities early in the development process. With its broad integration capabilities and security-focused features, Q Developer provides a comprehensive solution for AI-assisted software development across multiple environments.
And that wraps up this issue of "This Week in AI Engineering", brought to you by jam.dev— your flight recorder for AI apps! Non-deterministic AI issues are hard to repro, unless you have Jam! Instant replay the session, prompt + logs to debug ⚡️
Thank you for tuning in! Be sure to share this newsletter with your fellow AI enthusiasts and subscribe to get the latest updates directly in your inbox.
Until next time, happy building!