Google Massive Cloud Outage Linked to API Management System

Google Cloud experienced one of its most significant outages in recent history on June 12, 2025, when a critical failure in its API management system brought down dozens of services worldwide for up to seven hours. The incident affected millions of users across Google Cloud Platform (GCP) and Google Workspace products due to a null […] The post Google Massive Cloud Outage Linked to API Management System appeared first on Cyber Security News.

Jun 16, 2025 - 12:40

Google Massive Cloud Outage Linked to API Management System

Google Cloud experienced one of its most significant outages in recent history on June 12, 2025, when a critical failure in its API management system brought down dozens of services worldwide for up to seven hours.

The incident affected millions of users across Google Cloud Platform (GCP) and Google Workspace products due to a null pointer exception in the Service Control binary that manages API authorization and quota policies.

Binary Crashes Cause Global Outage

The outage originated from Google’s Service Control system, a regional service responsible for authorizing API requests and enforcing quota policies across the company’s infrastructure.

On May 29, 2025, engineers had deployed a new feature for additional quota policy checks, but the code lacked proper error handling and feature flag protection.

The crisis began when a policy change containing unintended blank fields was inserted into regional Spanner tables that Service Control relies on for policy data. Due to the global nature of quota management, this corrupted metadata replicated worldwide within seconds.

When Service Control attempted to process these blank fields, it triggered the unprotected code path, causing a null pointer exception that sent the binaries into a crash loop across all regions simultaneously.

“The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash,” Google explained in its incident report.

The company’s Site Reliability Engineering team identified the root cause within 10 minutes and deployed a “red-button” kill switch within 40 minutes to disable the problematic serving path.

While most regions recovered within two hours, the us-central1 region experienced prolonged difficulties.

As Service Control tasks restarted in this major region, they created a “herd effect” on the underlying Spanner infrastructure, overwhelming the database with simultaneous requests.

Google engineers discovered that Service Control lacked proper randomized exponential backoff mechanisms to prevent this cascading failure. The company had to throttle task creation and route traffic to multi-regional databases to reduce the load on the overloaded infrastructure.

This extended recovery process affected critical services including Google Compute Engine, BigQuery, Cloud Storage, and numerous other products that form the backbone of many enterprises’ digital operations.

Mitigations

In response to the widespread disruption, Google has outlined extensive remediation measures to prevent similar incidents.

The company immediately froze all changes to the Service Control stack and manual policy pushes pending complete system remediation.

Key improvements include modularizing Service Control’s architecture to fail open rather than closed, ensuring that API requests can still be served even when individual checks fail.

Google also committed to auditing all systems consuming globally replicated data and enforcing feature flag protection for all critical binary changes.

The incident affected over 60 Google Cloud and Workspace products, including Gmail, Google Drive, Google Meet, App Engine, Cloud Functions, and Vertex AI services.

Google emphasized that existing streaming and Infrastructure-as-a-Service resources remained operational, though customers experienced intermittent API and user interface access issues throughout the duration of the outage.

Live Credential Theft Attack Unmask & Instant Defense – Free Webinar

The post Google Massive Cloud Outage Linked to API Management System appeared first on Cyber Security News.