Hosted Exchange issue

Incident Report for Zix | AppRiver

Postmortem

Summary of impact:

· Client authentication and connectivity issues for some customers (Outlook, Mobile and OWA)

· Hosted Exchange customer management via the Customer Portal (cp.appriver.com)

· Delay in message delivery (internal and external)

‌

Root cause and mitigation:

On 9-13-2018, we ran our regular purge and delete cycle on EXG7. When a domain is deleted, it causes the mail-flow routing tables to update across all Exchange servers. During the update, the routing tables became corrupt and were causing higher than normal queries to our domain controllers. However, since traffic on Thursday and Friday was lower than normal, the extra queries didn't cause the CPU usage to reach a level of concern.

At around 8:25 AM CST on 9-17-2018, the extra queries from the corrupt routing tables caused all the domain controllers to begin to climb, creating issues for the Customer Portal. Support began receiving calls about Customer Portal time-outs for Hosted Exchange shortly after. This prompted our Engineering team to investigate a possible Customer Portal and Hosted Exchange correlation. By 9:25 AM, the CPU had spiked up to 100% and Engineering began working to disable anything that queries Hosted Exchange, outside of Exchange itself (Customer Portal/SecureTide). By 9:30 AM, all external connections (Customer Portal/SecureTide) had been disabled and the CPU numbers started to drop. Unfortunately, the numbers began to rise again and had returned to 100% by 9:50 AM, due to the corrupt routing tables taking advantage of the freed-up CPU.

By 2:00 PM, Engineering instated a process of elimination to determine what was causing the high CPU usage. They discovered that the excessive routing table lookups were unable to complete. At this time, Engineering restarted all domain controllers in EXG7 and reset all routing table queries.

Around 2:00 PM, client connectivity had been restored and Engineering began to bring all routing services back online. By 2:50 PM, all queues on the delivery servers had caught up and any delayed messages were successfully delivered to EXG7 mailboxes. Routing table queries had also returned to normal at this time.

At 4:00 PM, we were able to confirm that all internal queues had returned to 0 and that the client connectivity issues had been resolved.

Next steps:

We understand that email is mission critical to business and sincerely apologize for the impact to affected customers! We are continuously taking steps to improve the Hosted Exchange platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Enhancing validation at the time of maintenance and automatic detection of configuration anomalies post maintenance

2. Attempting to optimize throttling of queries to prevent system overload

3. Fine-tuning alerts to provide quicker detection and remediation

4. Continuing to analyze detailed event data to determine if additional systems modifications are required

Posted Sep 17, 2018 - 17:37 CDT

Resolved

This issue has been resolved and mail flow has returned to normal. Our team is currently working on gathering post mortem information. We understand that email is mission critical and sincerely apologize for the inconvenience this outage may have caused your business today.

Posted Sep 17, 2018 - 15:34 CDT

Monitoring

We are still monitoring the situation, however, email flow is returning to normal. No mail or data loss is expected. We will continue to update this page as more information is received.

Posted Sep 17, 2018 - 14:55 CDT

Update

As we work through troubleshooting and reboots of domain controllers, we're seeing steadily improving connectivity. OWA (Outlook Web Access) is currently more responsive, however, still facing slow response times.

Posted Sep 17, 2018 - 14:31 CDT

Update

At this time, we are continuing to troubleshoot the domain controllers. We will update this page once we have more information.

Posted Sep 17, 2018 - 13:24 CDT

Update

We are currently experiencing network and authentication issues across our Hosted Exchange platform. We are seeing high CPU usage across our domain controllers in our Exchange environment. This is causing slow load times across OWA (Outlook Web Access) and Outlook, as well as no Exchange access via the Customer Portal. We have narrowed down the cause of the CPU usage to a process within Exchange itself. There are no data integrity issues, but there may be short instances of delayed mail as we work through the issue. We will continue to update this page as we receive more information.

Posted Sep 17, 2018 - 12:44 CDT

Update

Our team is continuing to investigate this issue We will update this page as soon as we have more information.

Posted Sep 17, 2018 - 12:10 CDT

Update

Our team is still investigating this issue and will update the status page as soon as we have more information.

Posted Sep 17, 2018 - 10:36 CDT

Update

This issue is also affecting Exchange connections. We're still investigating and will update this page as soon as we have more information.

Posted Sep 17, 2018 - 09:22 CDT

Investigating

Our team is currently looking into an issue with the Customer Portal not displaying Exchange information for customers. We will update this page once we have more information.

Posted Sep 17, 2018 - 09:15 CDT

This incident affected: Customer Portal Interface (Customer Portal) and Secure Hosted Exchange (Exchange 2013/2016+ (EXG7)).