Summary of impact:
· Client authentication and connectivity issues for some customers (Outlook, Mobile and OWA)
· Hosted Exchange customer management via the Customer Portal (cp.appriver.com)
· Delay in message delivery (internal and external)
Root cause and mitigation:
On 9-13-2018, we ran our regular purge and delete cycle on EXG7. When a domain is deleted, it causes the mail-flow routing tables to update across all Exchange servers. During the update, the routing tables became corrupt and were causing higher than normal queries to our domain controllers. However, since traffic on Thursday and Friday was lower than normal, the extra queries didn't cause the CPU usage to reach a level of concern.
At around 8:25 AM CST on 9-17-2018, the extra queries from the corrupt routing tables caused all the domain controllers to begin to climb, creating issues for the Customer Portal. Support began receiving calls about Customer Portal time-outs for Hosted Exchange shortly after. This prompted our Engineering team to investigate a possible Customer Portal and Hosted Exchange correlation. By 9:25 AM, the CPU had spiked up to 100% and Engineering began working to disable anything that queries Hosted Exchange, outside of Exchange itself (Customer Portal/SecureTide). By 9:30 AM, all external connections (Customer Portal/SecureTide) had been disabled and the CPU numbers started to drop. Unfortunately, the numbers began to rise again and had returned to 100% by 9:50 AM, due to the corrupt routing tables taking advantage of the freed-up CPU.
By 2:00 PM, Engineering instated a process of elimination to determine what was causing the high CPU usage. They discovered that the excessive routing table lookups were unable to complete. At this time, Engineering restarted all domain controllers in EXG7 and reset all routing table queries.
Around 2:00 PM, client connectivity had been restored and Engineering began to bring all routing services back online. By 2:50 PM, all queues on the delivery servers had caught up and any delayed messages were successfully delivered to EXG7 mailboxes. Routing table queries had also returned to normal at this time.
At 4:00 PM, we were able to confirm that all internal queues had returned to 0 and that the client connectivity issues had been resolved.
Next steps:
We understand that email is mission critical to business and sincerely apologize for the impact to affected customers! We are continuously taking steps to improve the Hosted Exchange platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Enhancing validation at the time of maintenance and automatic detection of configuration anomalies post maintenance
2. Attempting to optimize throttling of queries to prevent system overload
3. Fine-tuning alerts to provide quicker detection and remediation
4. Continuing to analyze detailed event data to determine if additional systems modifications are required