We sincerely apologize for any inconvenience that yesterday’s Exchange outage caused for our customers. Here’s the root cause analysis, along with our future mitigation strategy:
Incident timeline: 11:15 AM CT to 2:30 PM CT
Symptoms/Impacts: During the incident, there were three instances of a 15-20 minute delay in mail flow and one instance of nearly an hour delay in mail flow. Also some customers would have experienced a few login prompts from Outlook and forced reconnects in OWA. However, that did not impact a wide range of customers and should have mostly occurred for an hour out of the outage window.
Root cause: Bulk disabling of Split Domain Routing caused the Microsoft Exchange message routing table to become corrupt.
Troubleshooting/Corrective Actions: Our monitoring system did catch that there was a performance problem and alerted us properly. Finding and fixing the root cause of the problem took most of the outage time as the routing table was replicated to most of the servers in our Hosting Exchange environment
Follow Up (Actions and Changes): The bulk disabling has happened daily for the past several years without an incident, another reason it took so long to track down the root cause. To ensure something like this doesn’t impact us in the future, we have modified that process to only run on the weekend.