Service instability due to network incident
Resolved - Due to an undocumented change in an operating system update shipped by one of our network equipment vendors, network devices in our Frankfurt datacenter experienced an unexpected partial failure.
This incident impacted primarily Proton Mail, with approximately 50% of users who were routed to the impacted datacenter experiencing intermittent downtime for approximately 1 hour. Due to redundant systems, no data or emails were lost, but some email delivery may have been delayed.
Incident report:
Because the failure was partial, it was not sufficient to trigger a failover. Due to the unique circumstances surrounding this failure, a significant amount of confusion led to a longer than usual delay before the infrastructure engineers on shift made the call to failover to an alternative site.
That restored services, with approximately 30 minutes of lingering low-level instability while load was rebalanced. Investigation that took place in parallel uncovered the undocumented operating system change in the network device update that was rolled out earlier this month. Impacted network devices were updated, and the Frankfurt datacenter brought back into production with no user impact.
Proton routinely conducts testing before rolling out software patches to our network equipment and rolls them out gradually.
Unfortunately, this problematic undocumented change was not discovered because it only created issues under specific load conditions (indeed, the new software had been running for weeks without issues).
We apologize for the longer than usual incident response time. In the coming days, we will be analyzing our response to this incident to reduce future reaction times.
Their status page has an update on what happened.