Incident Overview
On April 7, 2025, at 22:41 UTC, we began experiencing issues affecting Secret Server Cloud instances in the Australia region. The impact included disruptions to core services such as heartbeats, password changes, and proxy session initiation. These issues were traced to failures within the Azure Service Bus infrastructure used by our platform. The incident was mitigated, and service was fully restored by April 8, 2025, at 17:20 UTC.
Root Cause
The root cause is still under investigation by Microsoft’s Azure product group. However, the following contributing factors have been identified:
Service Bus Namespace and Queue Failures
- Some of our Service Bus namespaces hosted in Azure as a PaaS (Platform as a Service) service became impaired, leading to failures in publishing messages. This impacted essential operations like heartbeats, password changes, and session launches.
- The password change response message queue became non-functional and could not be deleted or re-created, preventing the platform from processing completed password changes.
Mitigation and Resolution
- Attempts to recover the faulty queue through deletion and re-creation failed due to the unresponsive state of the Azure Service Bus. Since it is a PaaS service, we had to escalate to the Azure team to resolve underlying issues that we do not have control over. A high-priority support case was escalated to Microsoft Azure team.
- At 17:20 UTC on April 8, we initiated a failover to the secondary Azure region in Australia, which allowed normal processing of RPCs.
- Additional compute resources were provisioned and worker services were scaled up to handle the backlog of messages. This allowed message queues to drain normally.
- Azure product team remediated the underlying issue with the Service Bus infrastructure on April 9, 2025.
- Subsequently, traffic was redirected to the primary AU region on April 9, 2025, where services are now stable.
Preventative Actions
- We continue to work with Microsoft to determine the root cause and obtain a permanent fix for the Service Bus instability in the AU region.
- Improvements are made to our failover documentation to ensure quicker regional transitions in similar scenarios.
- A follow-up update will be posted once we receive further analysis from the Azure product team.