# SandboxAQ Releases Massive Synthetic Molecule Dataset to Accelerate AI Drug Discovery

SandboxAQ, an AI and quantum tech startup spun out of Alphabet and backed by Nvidia, has released a large-scale dataset of synthetic 3D molecular structures aimed at revolutionizing small molecule drug discovery. With over 5.2 million conformers representing diverse chemical scaffolds, the dataset is designed to help researchers train and validate machine learning models capable of predicting drug-target interactions and pharmacological properties. This dataset includes not only optimized 3D geometries but also labels such as protein binding affinity, chemical class, and structural metadata derived from physics-based simulations and available experimental records. The goal is to create a gold standard training resource for geometric deep learning models in drug design. What’s Inside the Dataset Each entry in the dataset includes: 3D atomic coordinates in SDF format SMILES strings and InChIKeys Protein target annotations (when available) Simulated and experimental binding scores Molecular descriptors and class labels SandboxAQ collaborated with pharma partners and used advanced quantum and AI simulations to generate accurate 3D conformers. These molecules were further filtered for drug-likeness, diversity, and relevance to common therapeutic targets in oncology, neurology, and immunology. Developer Integration Developers and researchers can integrate the dataset with graph neural networks, transformer-based models, and molecular docking software. The files are structured for compatibility with frameworks like PyTorch Geometric, DeepChem, and RDKit. Sample Usage from sandboxaq_loader import MoleculeDataset dataset = MoleculeDataset("sandboxaq_5m.sdf") for mol in dataset: print(mol.smiles, mol.binding_score) Modelers can train regression or classification models to predict bioactivity, solubility, or ADMET properties. By using the structural information embedded in 3D coordinates, AI systems can learn features that generalize across diverse chemical families. Why This Matters The release marks one of the largest publicly available synthetic datasets for AI-powered drug discovery. In an industry where early-stage research is slowed by high failure rates and expensive experiments, synthetic data is becoming critical. Models trained on SandboxAQ’s dataset can reduce false positives and help prioritize promising leads, even before any physical assays are performed. By combining AI-generated structures with high-quality annotations, SandboxAQ provides a blueprint for how simulation and data-driven modeling can coexist. The result is a faster, cheaper, and more scalable approach to identifying new drug candidates. Next Steps and Community Access The dataset is available now for academic and non-commercial research, with plans to expand access to industry users via subscription. SandboxAQ has also announced upcoming benchmarks for protein-ligand prediction tasks based on this dataset. As the community begins building and testing models using this resource, SandboxAQ invites contributions, feedback, and proposals for collaborative validation studies. Sources https://www.reuters.com/business/healthcare-pharmaceuticals/nvidia-backed-ai-startup-sandboxaq-creates-new-data-speed-up-drug-discovery-2025-06-18/ https://www.sandboxaq.com/press/2025-drug-discovery-dataset-release

Jun 21, 2025 - 02:10
 0
# SandboxAQ Releases Massive Synthetic Molecule Dataset to Accelerate AI Drug Discovery

SandboxAQ, an AI and quantum tech startup spun out of Alphabet and backed by Nvidia, has released a large-scale dataset of synthetic 3D molecular structures aimed at revolutionizing small molecule drug discovery. With over 5.2 million conformers representing diverse chemical scaffolds, the dataset is designed to help researchers train and validate machine learning models capable of predicting drug-target interactions and pharmacological properties.

This dataset includes not only optimized 3D geometries but also labels such as protein binding affinity, chemical class, and structural metadata derived from physics-based simulations and available experimental records. The goal is to create a gold standard training resource for geometric deep learning models in drug design.

What’s Inside the Dataset

Each entry in the dataset includes:

  • 3D atomic coordinates in SDF format
  • SMILES strings and InChIKeys
  • Protein target annotations (when available)
  • Simulated and experimental binding scores
  • Molecular descriptors and class labels

SandboxAQ collaborated with pharma partners and used advanced quantum and AI simulations to generate accurate 3D conformers. These molecules were further filtered for drug-likeness, diversity, and relevance to common therapeutic targets in oncology, neurology, and immunology.

Developer Integration

Developers and researchers can integrate the dataset with graph neural networks, transformer-based models, and molecular docking software. The files are structured for compatibility with frameworks like PyTorch Geometric, DeepChem, and RDKit.

Sample Usage

from sandboxaq_loader import MoleculeDataset

dataset = MoleculeDataset("sandboxaq_5m.sdf")
for mol in dataset:
    print(mol.smiles, mol.binding_score)

Modelers can train regression or classification models to predict bioactivity, solubility, or ADMET properties. By using the structural information embedded in 3D coordinates, AI systems can learn features that generalize across diverse chemical families.

Why This Matters

The release marks one of the largest publicly available synthetic datasets for AI-powered drug discovery. In an industry where early-stage research is slowed by high failure rates and expensive experiments, synthetic data is becoming critical. Models trained on SandboxAQ’s dataset can reduce false positives and help prioritize promising leads, even before any physical assays are performed.

By combining AI-generated structures with high-quality annotations, SandboxAQ provides a blueprint for how simulation and data-driven modeling can coexist. The result is a faster, cheaper, and more scalable approach to identifying new drug candidates.

Next Steps and Community Access

The dataset is available now for academic and non-commercial research, with plans to expand access to industry users via subscription. SandboxAQ has also announced upcoming benchmarks for protein-ligand prediction tasks based on this dataset.

As the community begins building and testing models using this resource, SandboxAQ invites contributions, feedback, and proposals for collaborative validation studies.

Sources

https://www.reuters.com/business/healthcare-pharmaceuticals/nvidia-backed-ai-startup-sandboxaq-creates-new-data-speed-up-drug-discovery-2025-06-18/

https://www.sandboxaq.com/press/2025-drug-discovery-dataset-release