Wrap-up: Monday 02/16/09, 1:25 pm ET – Here’s a summary of everything that happened last week, what we learned from it, and what’s happening right now…
Where it all started
Last Wednesday, Feb 11, we saw a huge spike in new signups (double the normal amount), and email campaigns (quadruple the normal amount). Most of this was Valentines Day related (plus some Presidents Day promos). It was roughly 80 times the email volume that we had at the same time last year. We anticipated a growth in volume this year, as we’ve generally experienced all year long. Just not 80-fold. And not so suddenly.
The Wednesday Reboot
Anyway, the spike in new users, their Valentines email campaign volume, plus all their new lists being imported (remember that “lists” part because I’ll bring it up later), brought down one of our app servers for 20 minutes at around 4:17pm on Wedesday.
The other app servers limped along, but they still had a little dependency on that first app server, so they were also affected more than we’d have liked.
Anyway, rebooting that app server was the root of the bigger problem that we experienced on Thursday and Friday. Just bear with me…
First, I want to say that immediately after that incident on Wednesday, we placed an order for another app server, to help reduce the load on the app servers we already have.
Also, we were already planning on moving our website (mailchimp.com) to a new data center to reduce our overall load. In fact, the new site launch was gonna happen on Friday. That obviously had to be postponed.
And, we’ve already been working on removing all “interdependency” between app servers, so that if one goes down, none of the others are affected.
And remember the part about “all the new lists being imported”? We were scheduled to add 2 more redundant database servers this weekend (as described in this blog post from a few days prior to all this), in order to prevent bad stuff from happening. But bad stuff beat us to the punch. The dB upgrades are still planned, but we’ll need to wait for the dust to settle from last week first.
Some trivia: by Wednesday, our databases were running at about 4,000 queries per second. Normally, they chug along at 800-1,000 per second. The databases are not what failed, but they experienced some serious load last week. I’m just fascinated by the thought of little hard drive needles moving that fast.
For a completely useless reference, a hummingbird flaps its wings around 50 times per second, and up to a measely 200 times per second while mating. Imagine how many hummingbirds it would take to run our database servers. Sheesh. Lazy hummingbirds.
IP Address Conflict
Anyway, when that app server went down on Wednesday, our data center rebooted it pretty quick, so the downtime was only 20 minutes (still agonizing, but short as far as server outages go). And we still had other app servers running. So that’s nice. But what we didn’t know was that when they rebooted that server, some really, really old network configuration parameters were reinstated.
Basically, a bunch of our servers started to have IP address conflicts.
App servers were receiving traffic meant only for our email servers, and vice versa.
Kajillions of Email Campaigns Didn’t Help The Situation.
Just to illustrate things, let’s say you send a campaign to 250,000 recipients. Under normal circumstances, our email servers can send that in well under an hour. First, we’ve got to chunk that campaign up into lots of smaller pieces, and spread it out across lots of IP addresses, but we can do it fairly quickly.
But that’s only part of our job. Email delivery is not a one-way street. Email campaigns tend to generate bouncebacks. So that’s even more email volume (inbound) that we have to handle. Which means that little campaign to 250,000 recipients might get around 12,500 bounces. Then we have to scan every single bounceback, and decide what to do with each one of them.
To make it even more complicated, there are some ISPs out there that have unrealistic severe throttling measures in place, and who “defer” emails back to us for later sending. They literally send us a bounceback that says, “Um yeah, you’re sending us too much. How ’bout you wait a few hours, and try sending again? Mmm-hmm, thaaaaanks.” So we store those “deferred” campaigns, and re-try later. When we retry later, it’s very possible that a few thousand other people have sent campaigns to that ISP too, which in turn would make them tell us to retry again, but even later-er (luckily, we’ve got a lot of IPs to spread things around).
So not only are we pumping out emails at a very fast rate, and processing bouncebacks immediately afterwards, but we’re also putting some of those bounces right back into the queue to automatically retry again and again, until ISPs stop telling us to retry. On top of all this, we’re generating reports and stats for you in real time.
Whew.
See why we kill so many hummingbirds?
Seriously though, MailChimp can normally handle all that stuff I described above without breaking a sweat. It’s just another day’s work for MailChimp.
Unless, of course, we have an 80-fold increase in email volume and the email servers have an IP address conflict.
That kind of exacerbated the problem.
And email is built to be asynchronous and persistent. For example, we can send an email to a server across the globe, and the server will send back a handshake to confirm they got it, or that there was a problem and we should retry. And if one of those servers is busy or down at the moment, the servers will keep trying. This “persistence” would come in really handy in the case of say, nuclear war.
But that same persistence can make a big traffic jam even worse, when the servers’ IP addresses aren’t allocated correctly.
Down But Not Out
Still though. The delivery servers never went down, and we continued to deliver emails throughout the entire ordeal (at a relatively slow rate of about 100,000 emails per hour). This is why some people had campaigns that got delivered, and others didn’t. As I’ve told the rest of our team: This is absolutely horrible, but I’ve actually seen worse. In the early days (circa 2001), we had zero redundancy, our servers ran on hamsters (really weak hamsters, at that), and outages sometimes meant losing campaign data altogether. That’s a real nightmare.
We’ve got some improvements to make, but I’m still relieved that the measures we’ve put in place over the last couple years helped prevent a complete meltdown.
Was This Preventable?
The network configuration settings that were applied after the Wednesday reboot would’ve been correct a year ago. But over the last year, our data center has helped us add lots more servers to our configuration, and moved servers to new racks and new locations. So those network settings are very outdated now. And I suppose that, since the server hasn’t been rebooted for about a year, it’s easy to overlook stuff old settings like that. In all honesty, we expected (perhaps naively) that our data center would track those network settings, or would have systems in place to warn us if there are IP conflicts between servers.
Our fault for expecting someone else to manage that for us. Lesson learned.
We’ve now got scripts in place that automatically check IP settings between the servers if a reboot happens again. We’ve already checked all IPs and servers, and everything is now running normally.
What Went Right, What Went Wrong
We didn’t have a postmortem discussion where we all gathered around the table to discuss how to prevent this from ever happening again. Instead, we had constant, ongoing discussions (via Twitter, instant messenger, email, phone, and in-person) throughout the entire ordeal (and throughout the weekend) about what went wrong, and what went right. Here’s a list:
- During the ordeal, we called our MTA (the email server) vendor for assistance. They’re the same system that large banks and ISPs use. They put a top-notch team of engineers on the phone, who gave us lots of information about how their system works, what problems we might be having, and how we could optimize. Ultimately, the MTA was not our problem, and their advice wasn’t needed. There’s no way they could’ve known about a silly IP conflict issue. But we learned a *lot* about the underlying architecture behind their MTA, which is going to help us tremendously with future scalability.
- At one point, we thought one of our hard drives was starting to fail. That made us pretty paranoid, and so we made sure our backup system was fully operational. It was. Always nice to check up on those sorta things.
- We took too long to post a warning on our login screen. We placed a warning on our Dashboard, but nobody ever reads stuff there. So more and more people logged in to create more and more campaigns. Then, they submitted more and more complaints when they didn’t see their tests in their inbox. Not good. We’ll be launching “roadblock” screens shortly, so that when (not if) we experience server problems again, people will be properly warned while logging in.
- Our engineers have been working on our own MTA for quite some time. Not to replace our current vendor, but to serve as a “brain” that could give us more control over our all our MTAs. The way things work now is, we hand email off to the MTAs for delivery. After handoff, we have very little control over them, and the queue in general. Ever send a big document to the office printer, and had a hard time cancelling it, or seeing what page it was working on? Basically, the goal is to be able to plug in all kinds of MTAs (regardless of brand) into MailChimp, like lightbulbs. Really, really expensive lightbulbs. Sounds far-fetched, I know. But parts of this were already in place, and it worked brilliantly. It’s what allowed us to give so many updates about the queue. It’s even wired into GoogleTalk, so we were receiving updates every half hour, which I began posting to Twitter. The very same system is actually being used by our abuse team to alert them of suspicious activity on our servers. Anyway. That went well, and will come in handy in the future.
- When all this first started, we thought our MTA servers ran out of memory. So we jacked up the RAM on them. A lot. It ultimately wasn’t needed, but now they’re loaded. Instead of being able to process email in chunks of 800,000 at any given time, they’ll be able to handle chunks of up to 6,000,000 emails at a time.
- We’ve been “warming up” servers and IPs at another data center over the last few months. Just so that all our eggs are not in one basket. This weekend, we ran a “doomsday” scenario at the new data center, just to see if the same network config problem would happen there too. Nope. It booted up brilliantly. The way they handle IP address allocation is much more modern than our current data center (who we love, and have been with for 10 years). We’ll still have server issues in that new data center (they’re inevitable) but not this particularly type of issue. Hopefully.
- Twitter rocked. Users who follow us on Twitter got frequent updates, and they even gave us some words of encouragement (very helpful during such a stressful time). If you run a web app, setup a Twitter account and save it for times like this.
- Our customer service team was awesome. They informed all our customers about what was going on, and after the dust settled, contacted them over the weekend and gave out credits, refunds, inbox inspections, etc. Dan, a MailChimp co-founder and head of customer service, gave some guidelines and messaging tips. But they had carte blanche to deal with issues on a case-by-case basis. Now’s not the time to get stingy with email credit.
- When something like this happens, our entire team knows that we’ll be issuing all kinds of refunds, credits, etc. It’s a given, right? And we assumed our customers knew that. But I guess people are so used to horrible customer service elsewhere, that their initial calls and emails to MailChimp were furious. It helps to defuse all that anger by starting every conversation with, “When the dust settles, you’ll get credits (and then some) for all this. Now here’s what we know is going on…” Then, for the most part, everyone got cool.
- People seemed to like the transparency and frequent updates we posted on the blog, and on twitter. One guy wrote an article about it: “An Outage Done Right.” I’d love to take full credit for being able to handle stuff like this, but the truth is I’m paranoid and always knew that I’d better read up on this topic. If you run a web app yourself, you might also want to read up on the topic. I’ve always tried to learn from companies that have handled it well, like this example from Intuit. I’ve written about “How to apologize for server outages” in the past. Here’s a nice interview from the founder of Magnolia talking about their outage. Finally, one helpful resource is: My Bad: 25 Years of Public Apologies and the Appalling Behavior That Inspired Them (Hardcover). If you run a web app, or any online service that people depend on, it’s your job to know how to handle this kind of situation. Because it’s inevitable.
- Using Twitpic to show our engineers staring at server graphs here and here seemed to help too. I took those pics to show how concerned they were. Actually, I took the pics because I was proud of how they were handling the situation: scientifically. They were very concerned, but were not emotional at all. Just scientific. They had theories, tested those theories, and repeated the scientific process again and again until they figured out the problem. And they were totally calm throughout.
This Will Never Happen Again, Right?
So this is where, traditionally, I’m supposed to say, “our team has done an exhaustive investigation of what happened, and have put in place measures to prevent this from ever happening again.” That’s what companies always say after a server problem, and nobody ever believes them. At least I never do.
The truth is, we investigated, and we put measures in place, and it shouldn’t ever happen again. But servers break. Servers get slow. And servers die.
I mean, they’ve got little needles in them that flap faster than hummingbirds in heat, for pete’s sake. And those little needles flap like that 24/7.
I can’t lie to you. All we can do is prolong stuff like this from happening again, and add more and more servers with more and more failover. But when do you ever hear people talking about server problems? When their failover mechanisms fail. So even that won’t prevent this from happening again. Outages happen to the best and biggest of us.
The only honest thing I can tell you is that we’re working extremely hard on separating all of our distinct processes from each other, and putting them on their own distinct, redundant server clusters. We’ve already been working on that for the last several months. We’re at 18 servers now, with 3 more already on order. Our efforts are focused on removing interdependencies between servers, so that as we scale up, adding servers (across different data centers) is easy. We’re moving some things to that magical “cloud” and we’re using independent services for load balancing (instead of relying on our data centers). This won’t prevent server problems 100%, but will help us minimize the impact of future server problems. Hopefully.
To our MailChimp customers who were affected by the slowdown, I apologize. And I thank you for your business. We are doing our best to keep this from happening again for as long as humanly possible.
Archived Updates from the server slowdown:
Update: Friday 02/14/09, 1:42 pm ET – Last night, at about 8:22 PM ET, the backlogged mail queue finally cleared. The dev team began manually sending a handful of campaigns that they had to move aside to clear the jam, and the customer svc team began formulating a plan to compensate those affected by the delay. I began writing personal emails and direct tweets to people to let them know their campaigns were delivered (and that’s why I forgot to come back here to the blog to tell everyone the queue was clear). As of this morning, we’ve found 3 campaigns that didn’t send. We manually sent those out. If you see any campaigns in your account that don’t appear to be sent (like if you see no stats), please contact customer service, and they’ll get it sent. After we’re through all this, we’ll post a more detailed update about what happened and what we learned from all this.
Update: Friday 02/13/09, 7:37 pm ET – 81,092 emails left in queue. Matter of minutes before the queue is all clear. We’ll be spending the weekend looking for any dropped campaigns, and re-sending. Our cleanup work is not complete, but the system is running at (faster than) normal speeds now. Check your campaign stats for opens, etc., to see if it was delivered. Give it some time to get through ISPs, too. Thanks to everyone for your patience and understanding. Those who were irate – it’s well-deserved, and we’re sorry. Customer service will catch up on the weekend to issue credits. Please contact them if you were affected by the delays.
Update: Friday 02/13/09, 7:02 pm ET – 370,804 emails are left in the queue now, meaning roughly 30 minutes until we’re all cleared out and back to normal.
Update: Friday 02/13/09, 4:35 pm ET – There are exactly 1,624,202 left in the queue. Based on our current sending rate, we’ll be completely through in approx. 3-4 hours. We’ll be working on this all through the night, till we’re done.
Update: Friday 02/13/09, 4:10 pm ET - The “batch” I mentioned earlier? It’s being sent to the queue now. In small chunks, so we can watch closely. We’re up to 15k emails/hr delivery speeds now. Should have an ETA for completion shortly. If your campaigns page says “Sent” that doesn’t mean “sent to recipients.” It means “sent to delivery servers.” An unfortunate semantic nuance that only matters on days like this. Check your campaign’s actual stats to see if there are opens, clicks, bounces, etc. That will tell you if yours was sent to recipients. People are starting to see stats show up, slowly but surely.
Update: Friday 02/13/09, 2:50 pm ET - Most of the backlog has been cleared from the queue, and customers are starting to report that their campaigns are getting delivered. Also, newly created campaigns are sending almost instantaneously now. However, there is still a small batch of campaigns we had to move aside, because we had a suspicion that something in that batch (a corrupt campaign?) might’ve been bogging things down. If your campaign hasn’t been delivered yet, it’s possible that yours is in that batch. We are scanning the batch now, and plan to send to the queue for delivery shortly. Also, we’re still investigating network interface connections at the server level as a possible cause. Beginning to see some light at the end of this tunnel, but we’re far from finished (again, ISP throttling could be the next possible bottleneck).
Update: Friday 02/13/09, 1:26 pm ET - Just heard that throughput on email delivery servers has increased. We were limping along at 60k emails/hr earlier this morning. It’s now up to 600,000/hr. Might be through with this in 3-hrs, but we’ve got to keep an eye on it. Even after the queue is cleared of the backlog, we’ve still got to monitor deliverability with ISP throttling. Still going to be a long day of recovery and investigation, but the traffic jam might be clearing up.
Update: Friday 02/13/09, 12:50 pm ET – Asking data center to run a check on some network equipment, just to be sure. They’ve been very responsive in helping us diagnose things.
Update: Friday 02/13/09, 11:59am ET – Not a hardware issue (thanks to Rackspace and RippleIT for your help). We just did some campaign triage that has sped things up dramatically, which might clear out the delivery queue much sooner than expected. But it’s too early to provide an ETA. That’s because we want to watch how ISPs throttle the surge in emails coming from us. Thanks to those who are supportive during all this, and our sincere apologies to everyone whose campaigns are delayed. They’re getting delivered – just slow.
Update: Friday 02/13/09, 9:55am ET – First and foremost, contact customer svc for make goods. We know you’re frustrated, and we’re sorry. Once we’ve solved the technical problem, we will circle back and issue credits to customers affected by the slow-down. Okay, time for some updates…
We’ve been in talks with the vendor who provides our MTA. Brainstormed some fixes, implemented last night, waited, and it didn’t work. At 8am, we worked with data center to add more memory to delivery servers. They worked fast, but it didn’t help. MTAs were still trying to use disks for memory, instead of new RAM. Had a new theory that a spool drive on one of our delivery servers is starting to fail (don’t worry, we back up all data). We initialized a new drive, and are diverting traffic. It’s running much faster now, so we think our theory was correct. 250k delivered in last 40 minutes. That’s promising. Once initialization of the new drive is complete and it’s running at full speed, it might take 6 hrs to noticeably clear the queue. Probably another 12-18 hrs before full normalcy is restored. To be clear, campaigns have been getting delivered all this time – just slowly. Stay tuned to Twitter for up-to-the-minute updates.
Quick heads-up to all MailChimp customers – not sure what happened yesterday, but we saw a pretty big spike in just about everything. Free trial signups doubled, and we had more than 4 times the normal dB load.
In the past hour alone, around 1 million new emails were queued up for delivery. It’s only 10am here on the East Coast, so we’re bracing ourselves for a busy day.
To quote our customer service team, “Looks like everyone is spreading the love for Valentines Day specials.”
We’re keeping a close watch on our delivery and app servers. Everything is running fine, but slower than normal. Follow us on Twitter for more updates.
On the subject of our servers, yesterday we saw a brief outage from 4:17pm to 4:37pm ET. It was relatively brief, as far as server downtime goes. But to any of our customers building an email campaign at the exact time of the outage, those 20 minutes were frustrating and agonizing, we know.
Sorry about that. Please contact our customer service team, and they’ll work something out with you.
Now What?
As soon as we got everything back online yesterday, we placed an order for another app server, to help balance things out even more.
And we’ve already been working on adding more database servers and replication (those should go online this weekend).
More information about our server expansion plans in this previous post.

Thanks for the explanation guys, and for being proactive about setting up more servers. Your excellent communication is one of many reasons why MailChimp is growing so quickly.
I’m so confident in your ability to deliver that I’m relaunching our corporate website next month with our ‘news’ page content (newsletter signup and archive) coming exclusively from MailChimp. Keep up the good work!
Thanks for the update, I had scheduled to have several emails sent out today and I hadn’t received one of them. Now I know why, thanks for keeping us updated. Great service so far!!
thanks for the update…. my email campaign has been stuck in limbo for over 4 hours now… hopefully this is all resolved soon!
Spent the night making a newsletter so that it could be sent in the morning local time just to see that 4 hours after planned sending, it has not been sent.
Not getting the newsletter out on the planned time will cut the effect by 30-40% loosing revenue becuase of this.
MailChimp is an excellent service, but problems as theese are the ones that makes inhouse e-mail handling worth considering.
thanks for the heads-up guys!
I was sitting staring at my inbox wondering where my emails were! Now I know and I’ll let the client know too…
Well, what dya expect, it is Friday 13th after all?!
Ben, if you need me to pull back on the evangelism, I can. Last thing we need is PETA swooping in with some sort of self righteous “you can’t make chimps work that hard” banter. Say the word.
-.-
@Matt – Yes please. Throttle back a little on the evangelism. And if PETA called, we’d send them to: http://blog.mailchimp.com/chimp-sanctuary-update/ Hopefully it buys us a little chimp karma.
I just saw this notice but I’m concerned that this morning I sent a small campaign with the intention of it hitting inboxes at a certain time.
My worry is that the mail isn’t going to hit until a time that will result in fewer opens.
Will there by any re-crediting because of this problem?
@Pete – Absolutely, there will be credits (plus some) issues to everyone affected. For now though, we’re working on getting through the delivery traffic jam. Also, campaigns are getting delivered as we speak. It’s possible some of your recipients are getting your messages now, but not all of them. And very, very slowly. Sorry. We’ll post updates here throughout the day, but also stay tuned to twitter.com/mailchimp if at all possible.
Just tweeted this but thought I’d put it here too.
Not everyone reads the blog and not everyone follows on Twitter.
I think it’s obvious since there are so few comments here that communication of the issues being experienced is somewhat lacking.
Might I suggest putting a large notice on the homepage / account homepage explaining the problem which might prevent people adding more campaigns while you’re trying to clear the backlog.
You know, we put a notice on our dashboard, but didn’t think to put one on the login page *before* people log in. In retrospect, that would have to be the biggest mistake we made in all this. FWIW, things are speeding up dramatically now. Thanks for your feedback, Pete.
Have been waiting 24 hours for my first email to go out. It’s especially disappointing because I sent an email the day before to clients telling them to expect our inaugural newsletter…they’re still waiting.
Hopefully all will be resolved before workday is over.
my email campaign has now been sitting who knows where for over 24 hours. It is a time sensitive email and I am very concerned about when it is going to be delivered. If it goes out tomorrow I will look like an idiot to my customers.
PLEASE fix this issue ASAP!
I am also delaying buying a Mail Chimp plan until I see how this critical delivery issue has been addressed.
@Alisa – took a quick look at the system, and it looks like one of your campaigns from yesterday has registered an “open.” Campaigns are sort of “round-robin’d” across our delivery IPs, so bits and pieces are going out. It’s just taking a long time, because there are so many to send. We’ve just made some great progress a few minutes ago, and hope to get our system un-back-logged soon. Then, we’ll see how long it takes for ISPs to begin accepting everything we’ve delivered.
I saw the notice on the dashboard earlier today and figured you’d put it there after I’d posted.
I think it might be that the page is so, so full of info that the notice just didn’t jump out and I never saw it.
A proper ‘modal’ appearing front and center over everything else would be a great way to display alerts as important this one. Either that or make ‘em big and bold so they can’t be missed.
Maybe a regular “service status” meter would fit in nicely too?
re: modal windows, that’s actually already in the works for our next release. On top of that, log-in roadblock pages. Cringe, I know, but in some cases necessary. Thanks again — good feedback.
Well, there’s 15 mins left of the working day here and the campaign I sent at 10:45am still hasn’t gone anywhere which is going to mean we missed the banana boat.
Oh well. On Monday our clients can read the mailer they should have got this morning providing they don’t ignore it during their regular Monday morning inbox clearance.
[...] problems. Please stand by,” I’m provided with a running commentary on Twitter and their blog about how they’re drilling into the issue, along with estimates for when things may be back [...]
I sent a campaign (info redacted by Ben for privacy) to 43 recipients. Mail Chimp shows this was sent out at 01:07 AM on the 13th. No one has received the email. I’m concerned that my clients are not going to recive this information. How can I verify that all is well?
@Tom – We’re processing things a bit faster now, but on a first in, first out basis. Those waiting in line before you (since early Feb 12) will be sent first. We’re progressing better (2 steps FWD, 1 step back it seems). Up-to-the-minute-ish updates on Twitter is the best I can do to help verify all is well. But please do contact customer svc to make sure we can issue credits to your account.
thank you Ben!…. the open that was recorded was for a single email I sent out to a late email newsletter subscriber LOL…. it went out AFTER the main campaign did. Weird that they got their email first is it not? Hopefully my main email campaign being affected by this goes out soon, my sale ends tomorrow
Hi Ailsa, yes, they’re sort of randomized and spread across all our different IP addresses, so there’s no telling which will get sent first (at least, not to any decent degree of accuracy). Looking at your campaign, I was just thinking. It was sent to a relatively small list. Have you considered grabbing its campaign-archive link, which we create and host free of charge – and then emailing that to your list from your email program? Here’s how, if you’re interested: http://tinyurl.com/debz8c
yay!! My campaign went through earlier! Thanks for working so hard to resolve this issue!!