Some part of the system are down

Incident Report for WorldTicket Operational Status

Postmortem

Dear Customers,

Spontaneous node reboot led to split-brain situation on some internal gateways clusters. This caused overload of upstream gateways and halt of request processing.

Timeline:

09:35 CET - Node reboot
09:36 CET - Investigation started
10:25 CET – Most of services back online but infrastructure is not in operational state.
11:00 CET – Stalled balancing nodes identified and rebooted
11:15 CET – Majority of system operational
12:15 CET – GDS channels recovered. Systems fully operational.

Next Steps:
Detailed analysis and instruction for Operations Team – DONE
Redesign balancer architecture to allow faster failover/recover in case of a node failure or reboot – IN PROGRESS.
Decommission suspicious node – IN PROGRESS.

Kind regards
Worldticket Team

Posted Jul 30, 2021 - 15:16 CEST

Resolved

Systems fully operational since 12:15 CET
Posted Jul 30, 2021 - 15:15 CEST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jul 30, 2021 - 11:16 CEST

Identified

The issue has been identified and a fix is being implemented.
Posted Jul 30, 2021 - 09:42 CEST
This incident affected: SMS Modules (Reservations, Queues, Pricing, Payment & Ticketing, Inventory & Schedules, Security (Roles & Users), Reporting, Localization).