Which LLMs Are (and Aren't) Ready for Secure Code?

Initial Results from our New LLM Security Leaderboard on Hugging Face AI models are becoming an integral part of software development, but there’s still no standard way to evaluate the coding security intelligence of the language models developers are using. To help fill that gap, we've launched the LLM Security Leaderboard on Hugging Face: an open, community-driven leaderboard that evaluates large language models across an initial set of four critical security dimensions. Our early results highlight a sobering reality: even the most popular open-source models struggle with fundamental secure coding tasks. Not a single model scored above 50% across all four security dimensions we tested in our initial benchmark. As AI-generated code finds its way into production, reducing and identifying security weaknesses is more critical than ever for the long-term safety and resilience of our codebases. Read more about our initial methodology and our plans to further expand and refine our approach with input from the community. Key Findings No Model Crosses the 50% Threshold: While average scores clustered in the 40% to 47% range, there’s ample room for improvement. The highest overall performer, Meta's Llama-3.2-3B-Instruct, achieved just 46.6% average across the four security dimensions. Llama 3.2-3B Leads in Bad Package Detection: Although Llama-3.2-3B-Instruct was best at detecting malicious packages, correctly flagging ~29% of bad NPM and PyPI packages, outperforming other models by nearly 19%, all the results in this category highlight significant room for improvement. The majority of models detected fewer than 5% of bad packages, indicating an important gap in securing software supply chains against AI-generated code. CVE Knowledge is Alarmingly Low: Awareness of Common Vulnerabilities and Exposures (CVEs) in dependencies is a basic requirement for secure code. Yet most models scored between 8% and 18% accuracy in this category. Qwen2.5-Coder-3B-Instruct was the leader, but still scored low at 18.25%. These results suggest that the depth and consistency of CVE knowledge should be significantly improved. Insecure Code Recognition - A Mixed Bag: Top models like Qwen2.5-Coder-32B-Instruct and microsoft/phi-4 successfully identified vulnerabilities in roughly half of the code snippets presented. Lower-performing models recognized vulnerabilities in fewer than a quarter of cases, highlighting significant inconsistency and underscoring the need for more targeted training on secure coding practices. SafeTensors Usage - A Bright Spot: Every model evaluated scored 100% for using the SafeTensors format, showing promising community-wide progress against arbitrary code execution risks from unsafe serialization formats. Model Size != Security: While larger models often perform better on general benchmarks, security-specific performance varied significantly. Smaller models like Llama-3.2-3B-Instruct and IBM's Granite 3.3-2B-Instruct punched above their weight, reinforcing that sheer model size is not decisive and that architecture, training methodologies, and datasets play crucial roles in security capabilities. Newer != Better: Surprisingly newer models like Qwen2.4-Coder-32b (knowledge cutoff June 24) and Granite-3.3-2b-Instruct (knowledge cutoff April 24) have about the same or lower bad package and CVE detection capabilities as compared to older models like Llama-3.2-3b-Instruct (knowledge cutoff March 23), suggesting that these newer models were not trained on the latest bad package and CVE knowledge. Spotlight: Popular Models Llama-3.2-3B-Instruct (46.6%): Leads the pack but still detects bad packages less than 30% of the time, shows low CVE knowledge (15%) and recognizes only 42% of insecure code examples. Llama-3.1-8B-Instruct (42.9%) and Granite 3.3‑2B‑Instruct (42.6%): Tightly clustered in second and third place. They both outperform the leader in recognizing insecure code, but do substantially worse at bad package recognition and fare similarly at CVE knowledge. Qwen2.5-Coder-3B-Instruct: Leading in CVE knowledge, but still far from ideal at 18%. Granite 3.3-2B-Instruct: Notable for being a smaller model performing on par with larger competitors. Deepseek models (including deepseek-Coder-6.7b-Instruct): Not in the top ten and showing inconsistent results, highlighting potential opportunities for targeted improvements to align secure coding performance more closely with their popularity and adoption. Google’s Gemma models: Gemma models are also lower in the rankings. Gemma-3-27b-it demonstrated relatively strong performance in insecure code recognition but both Gemma models completely failed at bad package detection. What This Means for Developers and Researchers These findings should guide how teams approach secure AI adoption for software development: Select models thoughtfully, especially when using LLMs in security-sensitive codegen workflows. Prioritize secure prompting techniques -

Apr 22, 2025 - 20:06

Which LLMs Are (and Aren't) Ready for Secure Code?

Initial Results from our New LLM Security Leaderboard on Hugging Face

AI models are becoming an integral part of software development, but there’s still no standard way to evaluate the coding security intelligence of the language models developers are using. To help fill that gap, we've launched the LLM Security Leaderboard on Hugging Face: an open, community-driven leaderboard that evaluates large language models across an initial set of four critical security dimensions.

Our early results highlight a sobering reality: even the most popular open-source models struggle with fundamental secure coding tasks. Not a single model scored above 50% across all four security dimensions we tested in our initial benchmark. As AI-generated code finds its way into production, reducing and identifying security weaknesses is more critical than ever for the long-term safety and resilience of our codebases. Read more about our initial methodology and our plans to further expand and refine our approach with input from the community.

Key Findings

No Model Crosses the 50% Threshold: While average scores clustered in the 40% to 47% range, there’s ample room for improvement. The highest overall performer, Meta's Llama-3.2-3B-Instruct, achieved just 46.6% average across the four security dimensions.
Llama 3.2-3B Leads in Bad Package Detection: Although Llama-3.2-3B-Instruct was best at detecting malicious packages, correctly flagging ~29% of bad NPM and PyPI packages, outperforming other models by nearly 19%, all the results in this category highlight significant room for improvement. The majority of models detected fewer than 5% of bad packages, indicating an important gap in securing software supply chains against AI-generated code.
CVE Knowledge is Alarmingly Low: Awareness of Common Vulnerabilities and Exposures (CVEs) in dependencies is a basic requirement for secure code. Yet most models scored between 8% and 18% accuracy in this category. Qwen2.5-Coder-3B-Instruct was the leader, but still scored low at 18.25%. These results suggest that the depth and consistency of CVE knowledge should be significantly improved.
Insecure Code Recognition - A Mixed Bag: Top models like Qwen2.5-Coder-32B-Instruct and microsoft/phi-4 successfully identified vulnerabilities in roughly half of the code snippets presented. Lower-performing models recognized vulnerabilities in fewer than a quarter of cases, highlighting significant inconsistency and underscoring the need for more targeted training on secure coding practices.
SafeTensors Usage - A Bright Spot: Every model evaluated scored 100% for using the SafeTensors format, showing promising community-wide progress against arbitrary code execution risks from unsafe serialization formats.
Model Size != Security: While larger models often perform better on general benchmarks, security-specific performance varied significantly. Smaller models like Llama-3.2-3B-Instruct and IBM's Granite 3.3-2B-Instruct punched above their weight, reinforcing that sheer model size is not decisive and that architecture, training methodologies, and datasets play crucial roles in security capabilities.
Newer != Better: Surprisingly newer models like Qwen2.4-Coder-32b (knowledge cutoff June 24) and Granite-3.3-2b-Instruct (knowledge cutoff April 24) have about the same or lower bad package and CVE detection capabilities as compared to older models like Llama-3.2-3b-Instruct (knowledge cutoff March 23), suggesting that these newer models were not trained on the latest bad package and CVE knowledge.

Spotlight: Popular Models

Llama-3.2-3B-Instruct (46.6%): Leads the pack but still detects bad packages less than 30% of the time, shows low CVE knowledge (15%) and recognizes only 42% of insecure code examples.
Llama-3.1-8B-Instruct (42.9%) and Granite 3.3‑2B‑Instruct (42.6%): Tightly clustered in second and third place. They both outperform the leader in recognizing insecure code, but do substantially worse at bad package recognition and fare similarly at CVE knowledge.
Qwen2.5-Coder-3B-Instruct: Leading in CVE knowledge, but still far from ideal at 18%.
Granite 3.3-2B-Instruct: Notable for being a smaller model performing on par with larger competitors.
Deepseek models (including deepseek-Coder-6.7b-Instruct): Not in the top ten and showing inconsistent results, highlighting potential opportunities for targeted improvements to align secure coding performance more closely with their popularity and adoption.
Google’s Gemma models: Gemma models are also lower in the rankings. Gemma-3-27b-it demonstrated relatively strong performance in insecure code recognition but both Gemma models completely failed at bad package detection.

What This Means for Developers and Researchers

These findings should guide how teams approach secure AI adoption for software development:

Select models thoughtfully, especially when using LLMs in security-sensitive codegen workflows.
Prioritize secure prompting techniques - careless prompting can exacerbate vulnerabilities.
Complement LLMs with security-aware tools, like Stacklok's open-source project CodeGate, to reinforce defenses.
Augment LLMs with Retrieval-Augmented Generation (RAG), using knowledge from leading vulnerability datasets such as NVD, OSV, Stacklok Insight, etc.
Push for better fine-tuning and training on security datasets across the community.

Get Involved

This is just the beginning. The LLM Security Leaderboard is live at Hugging Face, and we're inviting the community to submit models, suggest new evaluation methods, and contribute to a stronger, safer AI ecosystem.

Explore the leaderboard. Submit your models. Join the conversation.

Let's build a future where AI coding is safe and secure.