On July 18, 2025, between 12:40 PM and 2:37 PM CT, customers in the US region experienced degraded Platform performance, including slow-loading pages and intermittent HTTP failures. The issue was confined to the US region, with services in other regions operating normally.
The degradation was caused by CPU resource starvation, which impacted some Platform services and contributed to the observed latency and availability issues.
A background service responsible for processing audit session data unexpectedly scaled up and consumed a disproportionate share of cluster resources. This led to resource contention within the cluster, affecting the performance of other services that rely on shared capacity. As a result, some Platform services experienced slow responses and intermittent failures.
The issue was mitigated by reducing the audit workload’s resources, which released significant CPU capacity back to the cluster. Service performance improved shortly afterward and returned to normal by 2:37 PM CT.
To prevent recurrence, the following actions are in progress: