Share on facebook
Share on twitter
Share on linkedin
Share on pinterest

Lessons to be Learned From the Recent Dropbox Outage

Very few of us in the web hosting industry will ever have the need to scale to the level that services like Dropbox do. With that said, when a service the size of Dropbox makes a misstep that leads to an outage, it is worth paying attention to the causes and impact to see if there are any potential lessons to be learned.

On January 10, Dropbox went offline. Users weren’t able to sync their folders, and thus they couldn’t access their files on many devices. The service was down for much of Friday evening, and users had trouble accessing their files throughout the weekend.

Of course, the media was full of speculation about potential causes for the outage, with many focusing on a possible DDoS attack. On the following Monday, Dropbox released a statement that went into detail about the causes of the outage, which dismissed the idea of an attack by hackers and instead blamed a faulty update process.

On the day of the outage, Dropbox was running a scheduled OS backup. As you can imagine, updating the thousands of servers that Dropbox uses is in no way an easy task. Much of the process is automated with scripts, which may have been the cause of the downtime.

The key lesson here, as detailed by Head of Infrastructure at Dropbox, Akhil Gupta, is that if you are going to do an upgrade, you need to be absolutely certain what state the server is in. To prevent the same mistake from happening again, Dropbox implemented an extra level of checks, so that the server will verify its own state before carrying out commands, rather than blindly executing incoming instructions regardless of what it is doing when it receives them.

It is not mentioned in the post-mortem of the incident, but the outage could probably have been avoided with more rigorous testing. The Dropbox outage is reminder of what may happen when a business is rapidly scaling their infrastructure. Scaling becomes the primary goal, and testing falls by the wayside to some degree.

A more rigorous approach to testing and verification of automation scripts may have caught the “subtle bug” before it wreaked havoc.

About Graeme Caldwell — Graeme works as an inbound marketer for InterWorx, a revolutionary web hosting control panel for hosts who need scalability and reliability. Follow InterWorx on Twitter at @interworx, Like them on Facebook and check out their blog,


Recommended Posts

Data Center Equipment Safety Matters: Pt. 2

In the Data Center: Is Safe, Safe Enough?

It may come as little surprise that at ServerLIFT headquarters, the majority of our conversations revolve around data center safety. It’s the reason we originally designed the ServerLIFT data center lift. No one had made anything like it before. We saw a clear need for a lift that could function effectively in the data center environment without putting anyone at risk.

Choosing the Right ServerLIFT Solution for Your Data Center

Choosing the Right ServerLIFT Data Center Lift

A server-handling lift is an important investment for any data center. Buying the right equipment requires careful analysis of both your current and future data center space.

enter the information below to download the whitepaper

The Data Center Migration Guide

enter the information below to download the whitepaper

The Data Center Safety Guidebook

enter the information below to download the whitepaper

Best Practices for Moving IT Department in the Data Center

enter the information below to download the whitepaper

Best Practices for Data Center Equipment Handling

enter the information below to download the whitepaper

data center consolidation action plan white paper

enter the information below to download the whitepaper

Buying a Data Center Lifting Device