Summary: Beginning at 0458 CDT July 27th, 2015 api.tokenex.com suffered a service disruption lasting 56 minutes. All other services were unaffected and fully operational during this time.
0458 - first external monitoring alert received
0520 - incident confirmed
0520 - team dispatched
0547 - issue identified with Database
0554 - issue resolved and service fully restored
Root Cause: A database maintenance task was in a "hung" state which resulted in a degradation of service. The database was active and responding during this time, but performance was severely degraded which caused the API to become unresponsive. The tool which monitors database maintenance tasks had its alerting threshold set too high for this job which caused no internal alert to be generated. Had the alert been generated initially, the job would have been terminated before impacting performance .
Resolution: The DBAs manually terminated the failing job.
Prevention: The DBA team will continue to monitor this job very closely and has revised the threshold value for this monitor so alerts are received more timely. Also, as a precautionary measure the team has performed an exhaustive review of all database maintenance tasks to ensure proper alert thresholds are in place.