Very few of us in the web hosting industry will ever have the need to scale to the level that services like Dropbox do. With that said, when a service the size of Dropbox makes a misstep that leads to an outage, it is worth paying attention to the causes and impact to see if there are any potential lessons to be learned.
On January 10, Dropbox went offline. Users weren’t able to sync their folders, and thus they couldn’t access their files on many devices. The service was down for much of Friday evening, and users had trouble accessing their files throughout the weekend.
Of course, the media was full of speculation about potential causes for the outage, with many focusing on a possible DDoS attack. On the following Monday, Dropbox released a statement that went into detail about the causes of the outage, which dismissed the idea of an attack by hackers and instead blamed a faulty update process.
On the day of the outage, Dropbox was running a scheduled OS backup. As you can imagine, updating the thousands of servers that Dropbox uses is in no way an easy task. Much of the process is automated with scripts, which may have been the cause of the downtime.
The key lesson here, as detailed by Head of Infrastructure at Dropbox, Akhil Gupta, is that if you are going to do an upgrade, you need to be absolutely certain what state the server is in. To prevent the same mistake from happening again, Dropbox implemented an extra level of checks, so that the server will verify its own state before carrying out commands, rather than blindly executing incoming instructions regardless of what it is doing when it receives them.
It is not mentioned in the post-mortem of the incident, but the outage could probably have been avoided with more rigorous testing. The Dropbox outage is reminder of what may happen when a business is rapidly scaling their infrastructure. Scaling becomes the primary goal, and testing falls by the wayside to some degree.
A more rigorous approach to testing and verification of automation scripts may have caught the “subtle bug” before it wreaked havoc.
About Graeme Caldwell — Graeme works as an inbound marketer for InterWorx, a revolutionary web hosting control panel for hosts who need scalability and reliability. Follow InterWorx on Twitter at @interworx, Like them on Facebook and check out their blog, http://www.interworx.com/community.