One of our virtual machines hosting the primary mail server (MX) for outgoing emails was lost during the internal migration process performed by our hosting vendor (unable to boot, unable to access via KVM etc.)
This wasn’t detected in a timely manner, because this primary MX was also used to send alarms by our monitoring system.
Please see a timeline report :
2021.11.04 08:41:00 UTC.
In order to mitigate this issue we had to switch to our backup mail MX server.
2021.11.04 08:41:00 UTC
After switching to the secondary server we noticed new emails sent normally, but some email vendors deferred accepting emails and those were queued on our side, up to 10% of total volume mostly by outlook hosted emails. This happened because our backup server was a cold backup and it’s IP didn’t have any reputation and it was immediately greylisted by some servers.
2021.11.04 10:06 UTC
We assigned the IP of the primary email server to the secondary email server and all queued emails were sent out.
2021.11.04 15:00 UTC
All ticket confirmation emails were sent. We are working on reactivating sendgrid email transport for customers that were using it.
We are now working on improving the reliability of our email transport subsystem.