Dear Customers,
Spontaneous node reboot led to split-brain situation on some internal gateways clusters. This caused overload of upstream gateways and halt of request processing.
Timeline:
09:35 CET - Node reboot
09:36 CET - Investigation started
10:25 CET – Most of services back online but infrastructure is not in operational state.
11:00 CET – Stalled balancing nodes identified and rebooted
11:15 CET – Majority of system operational
12:15 CET – GDS channels recovered. Systems fully operational.
Next Steps:
Detailed analysis and instruction for Operations Team – DONE
Redesign balancer architecture to allow faster failover/recover in case of a node failure or reboot – IN PROGRESS.
Decommission suspicious node – IN PROGRESS.
Kind regards
Worldticket Team