Feb 12, 2009

Server Status – Slow Delivery Feb 12, 2009

Wrap-up: Monday 02/16/09, 1:25 pm ET – Here’s a summary of everything that happened last week, what we learned from it, and what’s happening right now…

Where it all started

Last Wednesday, Feb 11, we saw a huge spike in new signups (double the normal amount), and email campaigns (quadruple the normal amount). Most of this was Valentines Day related (plus some Presidents Day promos). It was roughly 80 times the email volume that we had at the same time last year. We anticipated a growth in volume this year, as we’ve generally experienced all year long. Just not 80-fold. And not so suddenly.

The Wednesday Reboot

Anyway, the spike in new users, their Valentines email campaign volume, plus all their new lists being imported (remember that "lists" part because I’ll bring it up later), brought down one of our app servers for 20 minutes at around 4:17pm on Wedesday.

The other app servers limped along, but they still had a little dependency on that first app server, so they were also affected more than we’d have liked.

Anyway, rebooting that app server was the root of the bigger problem that we experienced on Thursday and Friday. Just bear with me…

First, I want to say that immediately after that incident on Wednesday, we placed an order for another app server, to help reduce the load on the app servers we already have.

Also, we were already planning on moving our website (mailchimp.com) to a new data center to reduce our overall load. In fact, the new site launch was gonna happen on Friday. That obviously had to be postponed.

And, we’ve already been working on removing all "interdependency" between app servers, so that if one goes down, none of the others are affected.

And remember the part about "all the new lists being imported"? We were scheduled to add 2 more redundant database servers this weekend (as described in this blog post from a few days prior to all this), in order to prevent bad stuff from happening. But bad stuff beat us to the punch. The dB upgrades are still planned, but we’ll need to wait for the dust to settle from last week first.

Some trivia: by Wednesday, our databases were running at about 4,000 queries per second. Normally, they chug along at 800-1,000 per second. The databases are not what failed, but they experienced some serious load last week. I’m just fascinated by the thought of little hard drive needles moving that fast.

istock_000006171204xsmall

For a completely useless reference, a hummingbird flaps its wings around 50 times per second, and up to a measely 200 times per second while mating. Imagine how many hummingbirds it would take to run our database servers. Sheesh. Lazy hummingbirds.

IP Address Conflict

Anyway, when that app server went down on Wednesday, our data center rebooted it pretty quick, so the downtime was only 20 minutes (still agonizing, but short as far as server outages go). And we still had other app servers running. So that’s nice. But what we didn’t know was that when they rebooted that server, some really, really old network configuration parameters were reinstated.

Basically, a bunch of our servers started to have IP address conflicts.

App servers were receiving traffic meant only for our email servers, and vice versa.

Kajillions of Email Campaigns Didn’t Help The Situation.

Just to illustrate things, let’s say you send a campaign to 250,000 recipients. Under normal circumstances, our email servers can send that in well under an hour. First, we’ve got to chunk that campaign up into lots of smaller pieces, and spread it out across lots of IP addresses, but we can do it fairly quickly.

But that’s only part of our job. Email delivery is not a one-way street. Email campaigns tend to generate bouncebacks. So that’s even more email volume (inbound) that we have to handle. Which means that little campaign to 250,000 recipients might get around 12,500 bounces. Then we have to scan every single bounceback, and decide what to do with each one of them.

To make it even more complicated, there are some ISPs out there that have unrealistic severe throttling measures in place, and who "defer" emails back to us for later sending. They literally send us a bounceback that says, "Um yeah, you’re sending us too much. How ’bout you wait a few hours, and try sending again? Mmm-hmm, thaaaaanks." So we store those "deferred" campaigns, and re-try later. When we retry later, it’s very possible that a few thousand other people have sent campaigns to that ISP too, which in turn would make them tell us to retry again, but even later-er (luckily, we’ve got a lot of IPs to spread things around).

So not only are we pumping out emails at a very fast rate, and processing bouncebacks immediately afterwards, but we’re also putting some of those bounces right back into the queue to automatically retry again and again, until ISPs stop telling us to retry. On top of all this, we’re generating reports and stats for you in real time.

Whew.

See why we kill so many hummingbirds?

istock_000006171204xsmall

Seriously though, MailChimp can normally handle all that stuff I described above without breaking a sweat. It’s just another day’s work for MailChimp.

Unless, of course, we have an 80-fold increase in email volume and the email servers have an IP address conflict.

That kind of exacerbated the problem.

And email is built to be asynchronous and persistent. For example, we can send an email to a server across the globe, and the server will send back a handshake to confirm they got it, or that there was a problem and we should retry. And if one of those servers is busy or down at the moment, the servers will keep trying. This "persistence" would come in really handy in the case of say, nuclear war.

But that same persistence can make a big traffic jam even worse, when the servers’ IP addresses aren’t allocated correctly.

Down But Not Out

Still though. The delivery servers never went down, and we continued to deliver emails throughout the entire ordeal (at a relatively slow rate of about 100,000 emails per hour). This is why some people had campaigns that got delivered, and others didn’t. As I’ve told the rest of our team: This is absolutely horrible, but I’ve actually seen worse. In the early days (circa 2001), we had zero redundancy, our servers ran on hamsters (really weak hamsters, at that), and outages sometimes meant losing campaign data altogether. That’s a real nightmare.

We’ve got some improvements to make, but I’m still relieved that the measures we’ve put in place over the last couple years helped prevent a complete meltdown.

Was This Preventable?

The network configuration settings that were applied after the Wednesday reboot would’ve been correct a year ago. But over the last year, our data center has helped us add lots more servers to our configuration, and moved servers to new racks and new locations. So those network settings are very outdated now. And I suppose that, since the server hasn’t been rebooted for about a year, it’s easy to overlook stuff old settings like that. In all honesty, we expected (perhaps naively) that our data center would track those network settings, or would have systems in place to warn us if there are IP conflicts between servers.

Our fault for expecting someone else to manage that for us. Lesson learned.

We’ve now got scripts in place that automatically check IP settings between the servers if a reboot happens again. We’ve already checked all IPs and servers, and everything is now running normally.

What Went Right, What Went Wrong

We didn’t have a postmortem discussion where we all gathered around the table to discuss how to prevent this from ever happening again. Instead, we had constant, ongoing discussions (via Twitter, instant messenger, email, phone, and in-person) throughout the entire ordeal (and throughout the weekend) about what went wrong, and what went right. Here’s a list:

  • During the ordeal, we called our MTA (the email server) vendor for assistance. They’re the same system that large banks and ISPs use. They put a top-notch team of engineers on the phone, who gave us lots of information about how their system works, what problems we might be having, and how we could optimize. Ultimately, the MTA was not our problem, and their advice wasn’t needed. There’s no way they could’ve known about a silly IP conflict issue. But we learned a *lot* about the underlying architecture behind their MTA, which is going to help us tremendously with future scalability.
  • At one point, we thought one of our hard drives was starting to fail. That made us pretty paranoid, and so we made sure our backup system was fully operational. It was. Always nice to check up on those sorta things.
  • We took too long to post a warning on our login screen. We placed a warning on our Dashboard, but nobody ever reads stuff there. So more and more people logged in to create more and more campaigns. Then, they submitted more and more complaints when they didn’t see their tests in their inbox. Not good. We’ll be launching "roadblock" screens shortly, so that when (not if) we experience server problems again, people will be properly warned while logging in.
  • Our engineers have been working on our own MTA for quite some time. Not to replace our current vendor, but to serve as a "brain" that could give us more control over our all our MTAs. The way things work now is, we hand email off to the MTAs for delivery. After handoff, we have very little control over them, and the queue in general. Ever send a big document to the office printer, and had a hard time cancelling it, or seeing what page it was working on? Basically, the goal is to be able to plug in all kinds of MTAs (regardless of brand) into MailChimp, like lightbulbs. Really, really expensive lightbulbs. Sounds far-fetched, I know. But parts of this were already in place, and it worked brilliantly. It’s what allowed us to give so many updates about the queue. It’s even wired into GoogleTalk, so we were receiving updates every half hour, which I began posting to Twitter. The very same system is actually being used by our abuse team to alert them of suspicious activity on our servers. Anyway. That went well, and will come in handy in the future.
  • When all this first started, we thought our MTA servers ran out of memory. So we jacked up the RAM on them. A lot. It ultimately wasn’t needed, but now they’re loaded. Instead of being able to process email in chunks of 800,000 at any given time, they’ll be able to handle chunks of up to 6,000,000 emails at a time.
  • We’ve been "warming up" servers and IPs at another data center over the last few months. Just so that all our eggs are not in one basket. This weekend, we ran a "doomsday" scenario at the new data center, just to see if the same network config problem would happen there too. Nope. It booted up brilliantly. The way they handle IP address allocation is much more modern than our current data center (who we love, and have been with for 10 years). We’ll still have server issues in that new data center (they’re inevitable) but not this particularly type of issue. Hopefully.
  • Twitter rocked. Users who follow us on Twitter got frequent updates, and they even gave us some words of encouragement (very helpful during such a stressful time).  If you run a web app, setup a Twitter account and save it for times like this.
  • Our customer service team was awesome. They informed all our customers about what was going on, and after the dust settled, contacted them over the weekend and gave out credits, refunds, inbox inspections, etc. Dan, a MailChimp co-founder and head of customer service, gave some guidelines and messaging tips. But they had carte blanche to deal with issues on a case-by-case basis. Now’s not the time to get stingy with email credit.
  • When something like this happens, our entire team knows that we’ll be issuing all kinds of refunds, credits, etc. It’s a given, right? And we assumed our customers knew that. But I guess people are so used to horrible customer service elsewhere, that their initial calls and emails to MailChimp were furious. It helps to defuse all that anger by starting every conversation with, "When the dust settles, you’ll get credits (and then some) for all this. Now here’s what we know is going on…" Then, for the most part, everyone got cool.
  • People seemed to like the transparency and frequent updates we posted on the blog, and on twitter. One guy wrote an article about it: "An Outage Done Right." I’d love to take full credit for being able to handle stuff like this, but the truth is I’m paranoid and always knew that I’d better read up on this topic. If you run a web app yourself, you might also want to read up on the topic. I’ve always tried to learn from companies that have handled it well, like this example from Intuit. I’ve written about "How to apologize for server outages" in the past.  Here’s a nice interview from the founder of Magnolia talking about their outage. Finally, one helpful resource is: My Bad: 25 Years of Public Apologies and the Appalling Behavior That Inspired Them (Hardcover). If you run a web app, or any online service that people depend on, it’s your job to know how to handle this kind of situation. Because it’s inevitable.
  • Using Twitpic to show our engineers staring at server graphs here and here seemed to help too. I took those pics to show how concerned they were. Actually, I took the pics because I was proud of how they were handling the situation: scientifically. They were very concerned, but were not emotional at all. Just scientific. They had theories, tested those theories, and repeated the scientific process again and again until they figured out the problem. And they were totally calm throughout.

This Will Never Happen Again, Right?

So this is where, traditionally, I’m supposed to say, "our team has done an exhaustive investigation of what happened, and have put in place measures to prevent this from ever happening again." That’s what companies always say after a server problem, and nobody ever believes them. At least I never do.

The truth is, we investigated, and we put measures in place, and it shouldn’t ever happen again. But servers break. Servers get slow. And servers die.

I mean, they’ve got little needles in them that flap faster than hummingbirds in heat, for pete’s sake. And those little needles flap like that 24/7.

istock_000006171204xsmall

I can’t lie to you. All we can do is prolong stuff like this from happening again, and add more and more servers with more and more failover. But when do you ever hear people talking about server problems? When their failover mechanisms fail. So even that won’t prevent this from happening again. Outages happen to the best and biggest of us.

The only honest thing I can tell you is that we’re working extremely hard on separating all of our distinct processes from each other, and putting them on their own distinct, redundant server clusters. We’ve already been working on that for the last several months. We’re at 18 servers now, with 3 more already on order. Our efforts are focused on removing interdependencies between servers, so that as we scale up, adding servers (across different data centers) is easy. We’re moving some things to that magical "cloud" and we’re using independent services for load balancing (instead of relying on our data centers). This won’t prevent server problems 100%, but will help us minimize the impact of future server problems. Hopefully.

To our MailChimp customers who were affected by the slowdown, I apologize. And I thank you for your business. We are doing our best to keep this from happening again for as long as humanly possible.

Archived Updates from the server slowdown:

Update: Friday 02/14/09, 1:42 pm ET – Last night, at about 8:22 PM ET, the backlogged mail queue finally cleared. The dev team began manually sending a handful of campaigns that they had to move aside to clear the jam, and the customer svc team began formulating a plan to compensate those affected by the delay. I began writing personal emails and direct tweets to people to let them know their campaigns were delivered (and that’s why I forgot to come back here to the blog to tell everyone the queue was clear). As of this morning, we’ve found 3 campaigns that didn’t send. We manually sent those out. If you see any campaigns in your account that don’t appear to be sent (like if you see no stats), please contact customer service, and they’ll get it sent. After we’re through all this, we’ll post a more detailed update about what happened and what we learned from all this.


Update: Friday 02/13/09, 7:37 pm ET – 81,092 emails left in queue. Matter of minutes before the queue is all clear. We’ll be spending the weekend looking for any dropped campaigns, and re-sending. Our cleanup work is not complete, but the system is running at (faster than) normal speeds now. Check your campaign stats for opens, etc., to see if it was delivered. Give it some time to get through ISPs, too. Thanks to everyone for your patience and understanding. Those who were irate – it’s well-deserved, and we’re sorry. Customer service will catch up on the weekend to issue credits. Please contact them if you were affected by the delays.


Update: Friday 02/13/09, 7:02 pm ET – 370,804 emails are left in the queue now, meaning roughly 30 minutes until we’re all cleared out and back to normal.


Update: Friday 02/13/09, 4:35 pm ET – There are exactly 1,624,202 left in the queue. Based on our current sending rate, we’ll be completely through in approx. 3-4 hours. We’ll be working on this all through the night, till we’re done.


Update: Friday 02/13/09, 4:10 pm ET –  The "batch" I mentioned earlier? It’s being sent to the queue now. In small chunks, so we can watch closely. We’re up to 15k emails/hr delivery speeds now. Should have an ETA for completion shortly. If your campaigns page says "Sent" that doesn’t mean "sent to recipients." It means "sent to delivery servers." An unfortunate semantic nuance that only matters on days like this. Check your campaign’s actual stats to see if there are opens, clicks, bounces, etc. That will tell you if yours was sent to recipients. People are starting to see stats show up, slowly but surely.


Update: Friday 02/13/09, 2:50 pm ET –  Most of the backlog has been cleared from the queue, and customers are starting to report that their campaigns are getting delivered. Also, newly created campaigns are sending almost instantaneously now. However, there is still a small batch of campaigns we had to move aside, because we had a suspicion that something in that batch (a corrupt campaign?) might’ve been bogging things down. If your campaign hasn’t been delivered yet, it’s possible that yours is in that batch. We are scanning the batch now, and plan to send to the queue for delivery shortly. Also, we’re still investigating network interface connections at the server level as a possible cause. Beginning to see some light at the end of this tunnel, but we’re far from finished (again, ISP throttling could be the next possible bottleneck).


Update: Friday 02/13/09, 1:26 pm ET –  Just heard that throughput on email delivery servers has increased. We were limping along at 60k emails/hr earlier this morning. It’s now up to 600,000/hr.  Might be through with this in 3-hrs, but we’ve got to keep an eye on it. Even after the queue is cleared of the backlog, we’ve still got to monitor deliverability with ISP throttling. Still going to be a long day of recovery and investigation, but the traffic jam might be clearing up.


Update: Friday 02/13/09, 12:50 pm ET – Asking data center to run a check on some network equipment, just to be sure. They’ve been very responsive in helping us diagnose things.


Update: Friday 02/13/09, 11:59am ET – Not a hardware issue (thanks to Rackspace and RippleIT for your help). We just did some campaign triage that has sped things up dramatically, which might clear out the delivery queue much sooner than expected. But it’s too early to provide an ETA. That’s because we want to watch how ISPs throttle the surge in emails coming from us. Thanks to those who are supportive during all this, and our sincere apologies to everyone whose campaigns are delayed. They’re getting delivered – just slow.


Update: Friday 02/13/09, 9:55am ET – First and foremost, contact customer svc for make goods. We know you’re frustrated, and we’re sorry. Once we’ve solved the technical problem, we will circle back and issue credits to customers affected by the slow-down. Okay, time for some updates…

We’ve been in talks with the vendor who provides our MTA. Brainstormed some fixes, implemented last night, waited, and it didn’t work. At 8am, we worked with data center to add more memory to delivery servers. They worked fast, but it didn’t help. MTAs were still trying to use disks for memory, instead of new RAM. Had a new theory that a spool drive on one of our delivery servers is starting to fail (don’t worry, we back up all data). We initialized a new drive, and are diverting traffic. It’s running much faster now, so we think our theory was correct. 250k delivered in last 40 minutes. That’s promising. Once initialization of the new drive is complete and it’s running at full speed, it might take 6 hrs to noticeably clear the queue. Probably another 12-18 hrs before full normalcy is restored. To be clear, campaigns have been getting delivered all this time – just slowly. Stay tuned to Twitter for up-to-the-minute updates.

Quick heads-up to all MailChimp customers – not sure what happened yesterday, but we saw a pretty big spike in just about everything. Free trial signups doubled, and we had more than 4 times the normal dB load.

In the past hour alone, around 1 million new emails were queued up for delivery. It’s only 10am here on the East Coast, so we’re bracing ourselves for a busy day.

To quote our customer service team, "Looks like everyone is spreading the love for Valentines Day specials."

We’re keeping a close watch on our delivery and app servers. Everything is running fine, but slower than normal.  Follow us on Twitter for more updates.

On the subject of our servers, yesterday we saw a brief outage from 4:17pm to 4:37pm ET. It was relatively brief, as far as server downtime goes. But to any of our customers building an email campaign at the exact time of the outage, those 20 minutes were frustrating and agonizing, we know.

Sorry about that. Please contact our customer service team, and they’ll work something out with you.

Now What?

As soon as we got everything back online yesterday, we placed an order for another app server, to help balance things out even more.

And we’ve already been working on adding more database servers and replication (those should go online this weekend).

More information about our server expansion plans in this previous post.