Platform: US - users are experiencing slow page loads

Incident Report for Delinea

Postmortem

Incident Overview

On Monday, July 7, 2025, from 8:20 AM to 11:00 AM CT, a subset of customers experienced elevated latency and intermittent service failures while interacting with platform services hosted in the US region. The impact was limited to workloads running in a specific cluster that encountered DNS resolution failures. Affected services were unable to reliably communicate with other internal services or reach external dependencies, resulting in delayed responses, timeouts, and occasional service unavailability.

Root Cause

The incident was caused by a DNS resolution failure within one of the system nodes in a Kubernetes cluster. During automated platform maintenance performed by the cloud provider, the underlying system node lost connectivity to the internal DNS resolver. As a result, the DNS service pod running on that node was unable to forward DNS queries to the upstream resolver. This caused internal service communication failures and disrupted connections to external services such as messaging and database endpoints.

According to the cloud provider’s engineering team, live migration of virtual machines during maintenance is intermittently failing to restore proper network connectivity on the new host. This leads to DNS resolution issues for pods scheduled on the affected nodes.

Draining and replacing the affected system nodes allowed the DNS services to restore full resolution capabilities, which mitigated the issue.

Preventive Actions

To reduce the impact going forward, the following steps have been taken:

  • New monitors have been implemented to detect DNS resolution failures proactively within clusters.
  • An auto-remediation workflow has been deployed to detect DNS failures and automatically mitigate them.
  • Platform architecture leverages multiple clusters per region, which helped limit the overall blast radius during this incident.
  • Delinea’s engineering team continues to work closely with the cloud provider to drive a permanent fix for the underlying issue within the managed Kubernetes service.
Posted Jul 14, 2025 - 14:33 EDT

Resolved

A subset of Platform customers in US region experienced elevated latency and service disruptions. Customers in other regions were not impacted.
Access has been fully restored and the issue is now resolved. We are continuing to monitor the platform to ensure stability. If you have any questions or require further assistance, please contact our support team at https://support.delinea.com.
Posted Jul 07, 2025 - 10:30 EDT