A few weeks ago, we celebrated Inc! Magazine’s great cover story about us, including an internal poll of our favorite item from the photo shoot (results: tie between the purple dragon and giant giraffe).  Unbeknownst to us, however, the story was later picked up by Yahoo!’s news feed on April 10th and posted to the scroller on their homepage.  This is where our story begins…

Around 10 a.m., we started to receive alerts from our internal and external monitoring indicating that something out of the ordinary was up with the site.  Response times were spiking and, within 15 minutes, we were returning error pages apparently caused by timeouts from our back-end.  Everyone scrambled to determine what was causing the site outage:

  • Did we just push bad code?
  • Did something break?
  • Was this a DOS attack?

Our analytics team was able to determine that referral traffic from Yahoo! had spiked by over 500%.  And they quickly realized that the Inc. story had been picked up by Yahoo! – thus leading us to conclude that the resulting interest was causing the capacity issues we were seeing.

Our site is designed to try to handle large traffic loads and we successfully handled the past holiday rush, which was significantly higher than our typical April load.  However, within the first 15 minutes, we had sustained a forty-fold increase in the amount of traffic to certain pages which brought our total load well beyond what we experienced on Cyber Monday, which had been our biggest day to date.  We had reached our capacity limit which caused the site to slow down for everyone and, at the peak, to return error pages for about half of all requests.

We had sufficient web server capacity, so that wasn’t the issue.  As we looked closer, we found that we had reached a connection pool limit within our load balancing layer, something that we had not experienced previously, so we quickly increased the cap.  By around 10:45 a.m., the error pages had stopped and the site started returning to normal.

Or so we thought…

Traffic continued to grow and we quickly reached another plateau, at twice our normal levels.  We were no longer returning error pages, but load times were still high.  We could see that something was overloading our front-end caching databases, causing application calls to queue and eventually time out.

Over the next hour, we determined that one routine in particular was causing most of our database problems.  The particular call provides us with geo-location information that we use, amongst other things, to provide localized content for our Get it Near Me! service.  Fortunately, our application makes use of internal “feature knobs”, which allow us to dial up and down the percentage of users who see elements on our site without requiring code changes, so we simply set this function to appear for zero user.  With this feature turned off, site performance returned to normal, but this time at four times our normal traffic volume.  About 90 minutes later, we had new code in production that resolved the issue and allowed us to turn the feature back on.

Overall, site traffic stayed up for the rest of the day and through 11am the following morning.  At that point, we saw a sharp drop down in traffic with levels returning to what we would expect for a normal April morning.  Our news story had cycled off of Yahoo!’s news page, thus ending the 24-hour news cycle and distinguishing Tuesday as having the largest overall traffic volume in our history, beating the previous Cyber Monday by over 20%.

We are now officially a member of a somewhat exclusive club of large web companies who have fallen victim to their own success.   The real test is to learn from the past, avoid the same mistakes, and continue to improve.  Yahoo! posted our story again on Sunday as part of their week in review, causing that day to have our fourth highest traffic count behind only Tuesday’s spike and the previous two Cyber Monday’s.  The site worked fine.

With Sunday’s spike in traffic successfully behind us, we’re encouraged by our team’s ability to adapt quickly and improve, but by no means are we thinking our job is done – instead, we’re now ready and eager to face our next challenge!

(Requests per second over time)