Incident Overview
On Monday, July 7, 2025, from 8:20 AM to 11:00 AM CT, a subset of customers experienced elevated latency and intermittent service failures while interacting with platform services hosted in the US region. The impact was limited to workloads running in a specific cluster that encountered DNS resolution failures. Affected services were unable to reliably communicate with other internal services or reach external dependencies, resulting in delayed responses, timeouts, and occasional service unavailability.
Root Cause
The incident was caused by a DNS resolution failure within one of the system nodes in a Kubernetes cluster. During automated platform maintenance performed by the cloud provider, the underlying system node lost connectivity to the internal DNS resolver. As a result, the DNS service pod running on that node was unable to forward DNS queries to the upstream resolver. This caused internal service communication failures and disrupted connections to external services such as messaging and database endpoints.
According to the cloud provider’s engineering team, live migration of virtual machines during maintenance is intermittently failing to restore proper network connectivity on the new host. This leads to DNS resolution issues for pods scheduled on the affected nodes.
Draining and replacing the affected system nodes allowed the DNS services to restore full resolution capabilities, which mitigated the issue.
Preventive Actions
To reduce the impact going forward, the following steps have been taken: