Around 1AM PST on Thursday April 21st 2011, one of the four availability cloud hosting zones in the AWS US East region experienced a network fault that caused connectivity failures between EC2 instances and EBS.
This event triggered a failover sequence wherein EC2 automatically swapped out the EBS volumes that had lost connectivity with backup copies. At the same time, EC2 attempted to create new backup copies of all of the affected EBS volumes (they refer to this as “re-mirroring”).
While this procedure works fine for a few isolated EBS failures, this event was more widespread which created a very high load on the EBS infrastructure and the network that connects it to EC2. To make matters worse, some AWS users likely noticed problems and began attempting to restore their failed or poorly performing EBS volumes on their own.
All of this activity appears to have caused a meltdown of the network connecting EC2 to EBS and exhausted the available EBS physical storage in this availability zone. Because EBS performance is dependent on network latency and throughput to EC2, and because those networks were saturated with activity, EBS performance became severely degraded, or in many cases completely failed. These issues likely bled into other availability zones in the region as users attempted to recover their services by launching new EBS volumes and EC2 instances in those availability zones. Overall, a very bad day for AWS and EC2.
Article thanks to Cloud Harmony