Jan 13, 2012

Planned Server Maintenance, and Followup to Server Outage

Last week, we had some hardware failures at our US1 data center that affected about 400,000 users (here’s the blog post with all the related updates). Today I want to post an announcement about some upcoming server maintenance that’s related to that outage, plus provide a little followup to what happened.

Planned Downtime: January 22, 1am ET

First, we’re doing some server maintenance at our US1 data center on Sunday, January 22nd at 1am ET (see this in your timezone). The maintenance will require downtime, but should only last a few minutes. During those few minutes, MailChimp will not be available for US1 users at all. Their campaign links will not work, nor will new subscribes be tracked. Again, it should only be a few minutes before everything’s back online. This upgrade will basically help us rebound faster should a similar outage occur again (heaven forbid).

So what exactly happened that day?

To recap, last year we invested in super fast SSD equipped servers to handle our increasing traffic. They helped us handle a TON of load, and sped things up nicely through the holidays. Then on January 2nd, several of those servers just up and died all at once–for no apparent reason at all. It just didn’t make any sense, and we’ve never experienced anything like this before. We admittedly didn’t spend much time investigating the cause, because we were busy taking out those SSDs and replacing them with 15k rpm SAS drives (plus a bunch more RAM).

Then a few days later, we saw this news: 64GB Crucial M4s crashing after 5,000 hours, fix coming

Those were the exact drives our data center used, and 5,000 hours is how old they were. We can’t say with 100% certainty that was the cause, but we can say there were other drives at US1 that didn’t fail, and they were different models (if we were the cause, you’d think all the drives would fail). And as stated above, we’re making some changes that should make events like this faster to recover from.

Some people have asked us, "So does this mean you don’t keep frequent backups?"  And from our close friends who know better, "Ha, guess you guys don’t know much about redundancy?"

To shed some light on the insanity of the situation, 76 hard drives died in 6 hrs. Fortunately, since we separate our users across 3 different data centers, the majority of them had no idea anything was even wrong.

For anybody who’s curious, there’s also this comment from Joe, one of our DevOps Engineers:

I thought I would briefly mention how we set things up though as we take these sorts of downtime and data loss events very seriously and I don’t want people to think this was a simple “a machine died, we restored from backups” scenario.

Each user shard in US1 is supported by 3 separate machines. Each of these machines is powerful enough to easily support the entire user shard on its own but only one of these is active at any given time. The other two sit mostly idle staying in sync and ready to take over in the event of a failure. We essentially are always running up to the second backups to two different machines for each user shard. In addition to all of this, we run full backups on every shard every day for “disaster” scenarios.

We very, very rarely lose an entire machine and when we do users do not notice because one of the backup machines is activated automatically and takes over.

In this case the nature of the hardware failure was so severe and so catastrophic that it impacted ALL of our machines – causing them to crash and corrupting their entire raid arrays of SSDs as they went down. We were unable to pull the current data off these machines for all shards before they were fully offline and despite spending many, many hours trying to recover the data after the crashes we failed except for that one clump of users that didn’t have any data loss.

Hope that sheds some light on our process. This was a truly unique event that I hope never happens again. We are making changes now to fortify our shards against data loss even further.

 

Customer feedback from the outage

I wanted to share something else with you. After the outage, we sent an email apology out to 788 users who were affected the most severely by the outage.

Here’s an archive of the email we sent.

I asked those customers to reply and send me any feedback they wanted. I totally expected to be screamed at and threatened for the next few weeks.

Instead, all I got was positive energy. People told me they loved us anyway, that my health was all that mattered, I’m a good human, etc. Here’s a sample of the replies I’ve received so far:

—-

So I am writing to thank you for you attention to this situation! I still LOVE Mail Chimp.. no worries… It was frustrating to reconstruct my campaign and to resend to my list… but I did it and all is well. Luckily, I had sent the campaign and could reconstruct it from the one I sent to myself.

—-

Ben, I’m so bummed. ARGH. Oh, that totally sucks. I thought that I had only lost a mini add-on campaign to one I had done earlier, but I just discovered this evening that it was the whole kit and caboodle campaign. SHEESH! The good news is that I’ve gotten some email replies from the campaign it looks like it got sent, but I have no way to know that for sure, who opened it, who doesn’t want to get email from me anymore, etc. as it doesn’t show up as a sent campaign on my dashboard.

Flippin’ firecrackers, I’m disappointed! I kinda want to throw a banana at you. BUT you’ve got such an amazingly fantastic, user-obsessed, FREE product, it’s hard to be mad at you.
How can you win back my confidence…? I would really love to know the stats for the campaign I sent out. I would like people from that to unsubscribe without sending me an email. Since those things are probably impossible, there’s not much else I can ask of you. You made a mistake, you totally owned it and are doing everything in your power to make it better. I was a Mailchimp evangelist before this, but I daresay this little f-up and your response to it might have just elevated my opinion of Mailchimp even higher.

Thanks for doing and being your best.

—-

Ben, thanks for your lovely email! I really feel you guys care about what you’re doing :-)
Anyway, it’s no biggie. Thanks for letting me know.

—-

Hi Ben,
I don’t need any kind of compensation. This apology is more than enough for me.
Thank you!

—-

Dear Mr Chestnut,

Greetings from the Czech Republic and I hope you are doing well!

I am regret to hear about your hardware failures that have an impact on us. We have a good experiences with Mailchimp and it is clear to me that these things could happen. Due to the effort, the discount for the next month would be much appreciated! Kindly please let me know if it would be possible.

I am truly looking forward to hearing from you and wish you and your family all the best for a healthy and Prosperous New Year.

—-

Hello, thank you so much for all the explanations. As you say, we couldn’t send our campaign yesterday, but there is no problem with that. We’ve been using your service without any problem until now, so thank you for that and don’t worry about your hardware failure. These things sometimes happen.

—-

Although I lost my campaign and as a result need to send it again because of this failure,
Your truly honest mail and apology is something rare in the business world, and for that you have my full confidence in your service,
and much more important, in your credibility as a human being that doesn’t afraid to admit he made a mistake.

So, Thank you for this wonderful service of yours,
and apology accepted!

—-

I just wanted to say that you guys are awesome. Mailchimp customer service is SO top notch every single time. What a perfect "I’m sorry" email :) 

Love y’all, and no worries on the campaign. I’ll just shoot out another one. 

—-

I just got you email – as a free user I am ok with having a simple outage in fact all I lost was the report for a campaign I sent out on the morning of "the incident". And as much as I enjoy seeing which of the 46 magicians read our email reminder – I’m sure the world isn’t going to end.  I say all that to say I appreciate your integrity and willingness to go the extra mile – it makes me confident that if my website ever gets off the ground – I’ll use mailchimp as my email service – thanks for being AWESOME.

—-

That is ok, Nobody can controle the lectronic world….if your healthy..that is all that count

—-

Hey Ben,

No biggie, it was just one of those automated RSS campaigns. This isn’t my paid account which I use for another project so I appreciate the awesome free level of service you continue to provide. You guys do a great job keeping me informed and providing a reliable, easy to use service with helpful support staff. Keep it up.

—-

Hi Ben,

Thank you for your message, I appreciate your concerns. At this point I am only doing testing so nothing of any significance was lost. This is my first attempt creating an RSS campaign and I must say I am quite impressed. I think MailChimp is great.

—-

Thanks for owning up to the problem. Ironic that it happened on the first business day of the year.  

Mailchimp is a great service. Thanks.

—–

Glad to here all sorted guys. Would prefer a t shirt ;) via they are way cool.

—–

I’m devoted to MailChimp in the same way I’m devoted to Apple, Gmail, WordPress and Saddleback Leather. You’re a brand head-and-shoulders above the rest. When I was unable to access my account yesterday, it was disappointing, but had you not sent this email I’d have thought nothing of it. MailChimp had never disappointed me before, so one day without access or autoresponders hardly bothered me.

I appreciate the gesture, though. Keep up the excellent work, Ben. My businesses couldn’t run without you guys. Hopefully soon I’ll have enough subscribers to use the $50 credit on the premium service. (Although, a free lifetime account would have been an awesome way to "make things right" ;) )

—-

Dear Ben,

Thank you for informing me so openly – i really appreciate it.

You are doing a great job and i am very grateful for your services.

My campaign link was broken, but no problem!

Many thanks for your support.

—-

Crap happens. You’all are still my favorite chimps. Thanks for the heads up.

—-

It happens. Don’t worry. I have full confidence in mailchimp. My campaign went and that’s all that matters. I don’t need a record.

Thank you for the explanation.

—-

Thanks for the email but really it seems that my email went out just fine. I got a copy in my inbox and everything looks Okay to me. We are still loving MailChimp!

—-

 

There were a couple not-so-happy ones too (and well-deserved) but the overwhelming majority of replies were very positive. I love our customers.