On May 28, 2025, between 7:30 PM and 7:55 PM Central, a subset of customers in the US region experienced intermittent login issues when accessing their tenants. Affected users may have encountered slow responses or HTTP 504 errors during login attempts. The issue was isolated to the US region; customers in other regions were not affected. Service was fully restored by 7:55 PM Central.
The incident was triggered by a recent change in how Active Directory (AD) user change events are processed. A fix in the latest release corrected a bug that had previously routed AD change events directly through the API. With the fix in place, these events were correctly routed through the message queueing system to be processed by worker services.
However, due to the significantly higher volume of AD user changes in the production environment (compared to testing environments), this caused a surge in messages that overwhelmed the message queueing instance. As a result, the instance ran out of disk space. This prevented other platform services from reading or writing to the shared message bus, leading to request timeouts and intermittent login failures for impacted users.
Our operations team resolved the issue by clearing a backlog of system messages on the message queueing instance, which restored normal login behavior. No customer data was lost during this process, as the cleared messages were temporary and used only for background processing.
To prevent recurrence and improve resilience, we are taking the following actions:
We sincerely apologize for the disruption and are committed to continuing to strengthen the reliability of our platform.