Secret Server Cloud: Australia Region - RPC and Heartbeat issues

Incident Report for Delinea

Postmortem

Incident Overview

On April 7, 2025, at 22:41 UTC, we began experiencing issues affecting Secret Server Cloud instances in the Australia region. The impact included disruptions to core services such as heartbeats, password changes, and proxy session initiation. These issues were traced to failures within the Azure Service Bus infrastructure used by our platform. The incident was mitigated, and service was fully restored by April 8, 2025, at 17:20 UTC.

Root Cause

The root cause is still under investigation by Microsoft’s Azure product group. However, the following contributing factors have been identified:

Service Bus Namespace and Queue Failures

  1. Some of our Service Bus namespaces hosted in Azure as a PaaS (Platform as a Service) service became impaired, leading to failures in publishing messages. This impacted essential operations like heartbeats, password changes, and session launches.
  2. The password change response message queue became non-functional and could not be deleted or re-created, preventing the platform from processing completed password changes.

Mitigation and Resolution

  • Attempts to recover the faulty queue through deletion and re-creation failed due to the unresponsive state of the Azure Service Bus. Since it is a PaaS service, we had to escalate to the Azure team to resolve underlying issues that we do not have control over. A high-priority support case was escalated to Microsoft Azure team.
  • At 17:20 UTC on April 8, we initiated a failover to the secondary Azure region in Australia, which allowed normal processing of RPCs.
  • Additional compute resources were provisioned and worker services were scaled up to handle the backlog of messages. This allowed message queues to drain normally.
  • Azure product team remediated the underlying issue with the Service Bus infrastructure on April 9, 2025.
  • Subsequently, traffic was redirected to the primary AU region on April 9, 2025, where services are now stable.

Preventative Actions

  • We continue to work with Microsoft to determine the root cause and obtain a permanent fix for the Service Bus instability in the AU region.
  • Improvements are made to our failover documentation to ensure quicker regional transitions in similar scenarios.
  • A follow-up update will be posted once we receive further analysis from the Azure product team.
Posted Apr 17, 2025 - 18:28 EDT

Resolved

This incident has been resolved. Services have been stable since the failover to the secondary region in Australia.

We are continuing to work with Microsoft to determine the root cause of the faulty message bus in the primary region. Once identified and addressed, we will initiate a seamless failback to the primary region with no expected downtime.

A detailed postmortem will be shared once our investigation is complete.
Posted Apr 08, 2025 - 18:15 EDT

Monitoring

We are actively monitoring the AU region. We will provide further updates as more information becomes available.
Posted Apr 08, 2025 - 15:28 EDT

Update

As part of our mitigation efforts, we have failed over to our secondary region in Australia. Remote Password Changing (RPC) and heartbeats are now processing, and customers should see these activities reflected in their logs.

We continue to work with Microsoft to investigate the underlying issue affecting our primary region. We will provide further updates as more information becomes available.
Posted Apr 08, 2025 - 13:48 EDT

Identified

The issue has been identified and a fix is being implemented.
Posted Apr 08, 2025 - 13:05 EDT

Update

We are actively investigating the ongoing issue. We appreciate your patience as we work to resolve it.
Posted Apr 08, 2025 - 11:08 EDT

Investigating

We are currently investigating reports from customers in the AU region who are experiencing issues with their Secret Server Cloud instances. The reported issues include:

- RPC heartbeat failures
- Heartbeats and password changes becoming unresponsive
- Connection Manager timeout errors
- Session recording malfunctions

Our team is actively working to identify the root cause and will provide updates as more information becomes available.
Posted Apr 08, 2025 - 04:05 EDT
This incident affected: AU (Secret Server Cloud).