Under-qualified sysadmin crashed Amazon.com for 3 hours • The Register

Who, Me? Welcome again to “Who, Me?” – The Register’s Monday column in which readers admit to making mistakes and explain how they managed to keep their careers going afterwards.

This week, meet a reader we’ll Regomize as “Ken” who told us that over 20 years ago he scored a job at Amazon.com as a Linux sysadmin, a role for which he admitted he was “completely unqualified.”

He previously worked as a Solaris admin, experience that earned him an interview at Amazon. He quickly studied Linux, got the job, and soon found that the Red Hat Enterprise Linux environment in place at the time was very different to Solaris!

Despite his inexperience, Amazon gave Ken the job of upgrading the e-tail giant’s tape backup application.

“I spent months planning and testing because with this upgrade configuration files changed and we were required to make new ones and push them out with the update,” Ken told Who, Me? “I created those files and did all the necessary tests. Everything appeared to be fine, and the day came when we pushed the button.”

For several hours, everything worked as intended. “We sat and watched for several hours after the update, everything worked great, so we patted ourselves on the back, called it a job well done, and went home.”

And then at about 7PM, Ken’s pager “started going crazy.”

Within minutes, Ken joined a conference call in which very, very senior people – including then-CEO Jeff Bezos – wanted to know why all of Amazon.com was down.

“This, many considered, was bad,” Ken told Who, Me?

Ken and his colleagues eventually noticed that the primary database for Amazon’s bookstore had stopped doing anything, despite the enormous cluster of computers it ran on operating normally.

Ken knew the backup app he built would copy the database’s logs to tape, then delete the logs on the servers that hosted the database. Ken checked the backup process, and found it was working just fine.

He kept digging and eventually checked the configuration files he had so carefully created… and found a typo that meant the system didn’t delete logs after backup.

“This wasn’t an issue for many hours, but eventually the partition holding the logs filled up and the database just gave up and started complaining that nobody loved it anymore,” he told Who, Me?

After satisfying himself that no log files had been lost, Ken and a database administrator deleted the logs on the cluster and watched as the database came back to life – and so did Amazon.com.

Ken fixed the typo in the configuration file, then went home and spent a restless night pondering the need to find a new job.

“I drove into the office the next morning to see my manager standing outside in the parking lot where I normally parked, which did not seem like a good omen,” Ken said.

“I got out of my car and shuffled over to him. He stood in silence for about 15 seconds just giving me a hard look. Suddenly, he got a huge grin on his face, shook my hand, and said, ‘Congratulations, you’re no longer a virgin.’ We walked inside, where everyone razzed me for a long time.”

“And that’s how I brought down Amazon.”

What have you broken with a typo? To sahre shera share your story, click here to send email to Who, Me? We’d love to tell your tale on some future Monday. ®

Leave a Comment