Search
Close this search box.

SHARE

SHARE

SHARE

WATCH

Lessons to be Learned From the Recent Dropbox Outage

Very few of us in the web hosting industry will ever have the need to scale to the level that services like Dropbox do. With that said, when a service the size of Dropbox makes a misstep that leads to an outage, it is worth paying attention to the causes and impact to see if there are any potential lessons to be learned.

On January 10, Dropbox went offline. Users weren’t able to sync their folders, and thus they couldn’t access their files on many devices. The service was down for much of Friday evening, and users had trouble accessing their files throughout the weekend.

Of course, the media was full of speculation about potential causes for the outage, with many focusing on a possible DDoS attack. On the following Monday, Dropbox released a statement that went into detail about the causes of the outage, which dismissed the idea of an attack by hackers and instead blamed a faulty update process.

On the day of the outage, Dropbox was running a scheduled OS backup. As you can imagine, updating the thousands of servers that Dropbox uses is in no way an easy task. Much of the process is automated with scripts, which may have been the cause of the downtime.

The key lesson here, as detailed by Head of Infrastructure at Dropbox, Akhil Gupta, is that if you are going to do an upgrade, you need to be absolutely certain what state the server is in. To prevent the same mistake from happening again, Dropbox implemented an extra level of checks, so that the server will verify its own state before carrying out commands, rather than blindly executing incoming instructions regardless of what it is doing when it receives them.

It is not mentioned in the post-mortem of the incident, but the outage could probably have been avoided with more rigorous testing. The Dropbox outage is reminder of what may happen when a business is rapidly scaling their infrastructure. Scaling becomes the primary goal, and testing falls by the wayside to some degree.

A more rigorous approach to testing and verification of automation scripts may have caught the “subtle bug” before it wreaked havoc.

About Graeme Caldwell — Graeme works as an inbound marketer for InterWorx, a revolutionary web hosting control panel for hosts who need scalability and reliability. Follow InterWorx on Twitter at @interworx, Like them on Facebook and check out their blog, http://www.interworx.com/community.

 


Recommended Posts

Tech LIFT

The 7 Top Data Center Trends for 2024

Data centers play a crucial role in allowing enterprises to process, access, and store mission-critical data for their daily operations. As the world sees

enter the information below to download the whitepaper

The Data Center Migration Guide

enter the information below to download the whitepaper

The Data Center Safety Guidebook

enter the information below to download the whitepaper

Best Practices for Moving IT Department in the Data Center

enter the information below to download the whitepaper

Best Practices for Data Center Equipment Handling

enter the information below to download the whitepaper

data center consolidation action plan white paper

enter the information below to download the whitepaper

Buying a Data Center Lifting Device