Impact:
Beginning on 2/23/24 at approx. 21:10 UTC, engines that received the new version and had proxying turned on failed to start and were unavailable. No messages were processed by these engines and they were not reachable for any capability.
Issue:
An update to the engine SSH proxy implementation generated an assembly version conflict. This caused engines to fail loading the SSH proxy library.
Resolution:
The issue was resolved by reverting the change and releasing a new version of the Distributed Engine. Since the failure prevented the engine update check, SSH proxying was temporarily disabled and re-enabled for affected customers to allow engines to auto-upgrade. By 00:30 UTC on 2/24 the large majority of customer engines were updated to version 8.4.25.0. A very small percentage of engines had upgrade issues that could not be resolved from the back end. This list was given to support for direct resolution with the customers.
Action Items:
To address this and prevent future occurrences:
Add automation tests to cover additional scenarios with engine upgrade and proxying.
Improve Engine upgrade process so that an assembly load failure does not block upgrades.
Investigate options for better back-end tracking of engine issues.
Review and improve customer communication process.
Incident Start Time: 2/23/24 21:10 UTC
Incident End Time: 2/24/24 00:30 UTC