Incident start: 15:39 CET
Incident end: 16:31 CET
Duration: 53 minutes
Database connection pool exhaustion for user-management-service due to misconfiguration led to very high failure rate. This led to situation when all http threads for customer instances were busy waiting for response from user-management-service.
Initially this was identified as DDOS or huge spike of load. Further investigation showed that service exhausted its database connection pool and after failing to obtain connection from a pool for 20 seconds was returning error response.
Root-cause – due to misconfiguration, service was using default values for database connection pool size instead of explicitly configured size.
Best Regards,
Worldticket Team