While I love life in the Internet Cloud, it was a gray, rainy few days at the end of last week.
While it's easy to point fingers at Amazon Web Services, I'm focused on how Acquia can do better. Our goal is to deliver fantastic end-to-end service and support for our customers' web sites irrespective of problems in the underlying infrastructure. That's for us to worry about, mitigate against, and repair — not you.
The View from Acquia
We partnered with Amazon as the leading provider and innovator of Cloud infrastructure. But more importantly, we designed our high-availability architecture to quickly and seamlessly recover from AWS infrastructure problems. Single server failures cause no site downtime and we've successfully recovered from hundreds of such failures. Acquians pride ourselves on second-to-none engineering and operations.
However, a little before 4AM EDT on Thursday, a major incident at one AWS data center rendered most storage inaccessible which in turn made hundreds of our servers unusable. Still, Acquia had planned for this contingency by backing up all data to multiple data centers. Unfortunately, a second AWS failure made it impossible to access those backup volumes from any data center. Aargh! The impact was felt most keenly by our Drupal Gardens customers with thousands of sites unavailable. While Dev Cloud was unscathed, the outage impacted 1% of our Managed Cloud customers. Our team worked around the clock to restore service: migrating servers to other regions, finding crafty ways to restore backups, and keeping in constant contact with customers. By the end of Friday (midnight!), we'd recovered all services. Thanks to the redundancy built into our architecture, we lost virtually no customer data.
We're pressing Amazon to do better. For many months, they've promised us EBS storage improvements and we look forward to seeing those. They must also improve their transparency. AWS is too secretive both in a crisis and on sunnier days. AWS is not a book seller whose back office operations have little impact on their customers.
But I don't think that's enough. We're taking action now to redistribute Garden's servers amongst more data centers to minimize the impact of a similar outage and we're beginning to extend our backup infrastructure to distribute the data to multiple geographic regions. And Acquia will continue to make significant investments in people, technology, and processes to ensure the most worry-free web site hosting available.
The Cloud View
None of this has dampened my enthusiasm for the Cloud. I've managed many data centers over the years from my basement server rack, to class A facilities with redundant everything, to colo, VPS, and managed hosting. In this era, it simply doesn't make sense economically nor technically for most organizations to build their own data center and hire and train expert sysadmin staff. The economies of scale both in hardware and people will drive most business and organizations to the cloud over the next few years. The important lesson we can never learn too well is that "everything breaks". And nowhere is that more true than on the rapidly evolving Internet. It's our job at Acquia to build resilient architectures that can prevent downtime due to failure, even major failures.
It can be challenging to ensure seamless service and mitigate the Cloud's risks for our customers. I think what keeps all of us going through the long days and nights are the incredible web sites, both big and small, that our customers create. We are working 24x7 to meet and exceed the high standards you're setting by using Drupal to create incredible web experiences.