Jan 2, 2012
Hardware Issues at US1 Data Center
To our customers, and to their subscribers:
On January 2nd (yesterday), we started to see multiple hardware failures at one of our data centers. As background, we’ve spread MailChimp across 3 data centers across the country so we don’t have all our “eggs in one basket”. Then, we further divide each data center into different groups or “shards” of users. Some shards house big, high-volume users with large lists and intense server resource requirements, while some shards are for users with relatively smaller lists (less than 25k recipients is considered “small”). This is an attempt to keep issues for one set of users from bringing down our entire base of 1.2 million users.
US1, which is our first and oldest data center with the most users, saw 3 of the “small user” database shards failing. Around 2pm, we decided to completely disable access to those 3 database shards in order to prevent those users from logging in and creating new campaigns, which would’ve been lost in the event we had to restore from backup. This affected about 400,000 users.
We then began the long, painstaking process of replacing hardware and then restoring data. For 1/3 of the affected 400k users, we were able to restore things nearly perfectly with no campaign or data loss (note that during the outage, we could not collect any new subscribe or unsubscribe activity though). For the other 2/3, the failing hardware corrupted data so badly that we had to revert to backups from 1am ET that morning (January 2nd).
So for those users, any campaigns created, or campaigns sent (including all tracking links) between 1am ET and roughly 3pm ET were lost (RSS and Autoresponder campaigns, however, will simply pick up where they left off).
We’ve researched the logs, and it looks like 788 users lost that campaign data between 1am and 3pm. Needless to say, an email with apologies and refunds and make goods for these users is already in the works. If you were affected by this outage, do not hesitate to contact our support team. They have full authority to make things right with you.
We’re still investigating the exact cause, but we know that back in September, we began replacing our older hardware at US1 with brand new, super fast, super expensive SSD-equipped servers in order to beef up for the Thanksgiving and Christmas holiday email volume. This was a significant infrastructure investment for us to keep things stable (yes, the irony). The upgrades did manage to sustain delivering +100 million emails per day during peak periods, but the RAID controllers for the SSDs weren’t working as reliably as we hoped. When those things fail, they apparently break the SSDs along with them. So our plan was to switch back from the SSDs after the holidays when things are usually more stable and quiet (yes, irony). Obviously, we’ll be accelerating those plans as soon as the dust settles.
We sincerely apologize for this, and will be working extra hard to regain your confidence.
P.S.
If you’re into gory technical details, read the comment from Joe, one of our devops engineers, below.
——– below is the original post, which I wrote in haste to keep people updated —–
Hi everybody, sorry to report we’re experiencing some hardware issues at our US1 data center, which is affecting a large number (but not all) of the users there. US2 and US4 users are not affected. We’re currently in the process of doing some quick final backups, so that we can replace the hardware. To prevent data loss during the switch, we’ve disabled access to MailChimp for those users. We expect this process to take several hours. We’re very sorry for this inconvenience. If your campaigns are affected, talk to our support team, and they’ll work to make things right with you. We’ll also be posting updates here on the blog and on twitter at @mailchimpstatus
Update January 2 (10:28 pm): We’ve replaced the hardware, and are currently reinstalling software and data. This is going to take several more hours to get fully operational. We’ll be working through the night to get it all running smoothly again. Sorry again to all affected users.
Update January 3 (4:50 am): Slight progress. 25% of affected user accounts have been restored and are back online. Still working on restoring backups of the remaining accounts.
Update January 3 (7:47 am): A little more progress. Another 25% of the accounts have been restored and are back online.
Update January 3 (8:24 am): The final batch of users have been brought back online. From our customers’ perspective, US1 is live again. Behind the scenes, we’re mopping up some messes, and soon I’ll post more details here on the blog.
Heavenly Order
Wow, you guys must be ‘scritch, scritch, scritching’ right now! Not enough bananas? You provide a great service, so I will wait patiently by my computer until you’re swinging through the jungle again :)
01.02.2012
Enrique
Good luck with this guys – you do offer a great service, thus will be patient.
01.02.2012
nrhk
Thanks guys. Too bad it happens on the first business day of the year when I need to send my monthly update to my clients. Please continue the good work!
01.02.2012
Scott
Hmmm – Naturally, I have several campaigns that need updating and are scheduled for delivery by 10 am EST on Jan 3.
01.02.2012
mike
Thanks for the update. I am glad to see you posting what is happening.
01.02.2012
Donna Johnson
Glad to see that you’ve started this blog post. I’ve started a thread on my FB stream and will let them know to come here, also.
01.02.2012
Robert Graham
I’d totally drive down and bring you guys some pizza and Red Bull… But I live in Michigan.
01.03.2012
FilipEvans
I’ve just sent my biggest and most important campaign for a client to 2500+ people and none of the links work on the campaign… complete failure… I cant think of a way to resolve this.. sending a 2nd time looks unprofessional.
01.03.2012
Edi
Things will be as bad as you think they will be. You will be surprised how many of your clients will understand that those things can happen. All the best, Edi
01.03.2012
facebook_martijnengler
Hey guys / gals,
I’m sure you’re doing everything you can to fix this. I didn’t have any campaigns planned until Thursday (except for autoresponders), so I’m pretty good. Just one question: people trying to subscribe to my list… Will they have any issues?
Another question: like I said, my autoresponders won’t work (I think?)
Is there any way to have them queued up for later or something?
- Martijn
01.03.2012
Ben
If your account is affected by the outage, signups and scheduled autoresponders are also off until we can restore service. Very sorry. Should be a few more hours now.
01.03.2012
Larry Escher
Thanks for your work. Server issues happen, especially with the demand that probably comes from the beginning of the year.
I appreciate all your hard work.
Larry
01.03.2012
webinspire
Impressed: How MailChip handled their outage and kept customers updated with progress. Good work MailChimp! http://t.co/SDFDtz4y
01.03.2012
MikeL
Glad you managed to get it sorted. Unfortunately the notice saying you have sorted it is preventing OpenID login from Google.
Maybe you could mop this up with the rest of the mess backstage.
Thanks.
01.03.2012
Ben
Our message shouldn’t affect OpenID logins. Try clearing cache/cookies?
01.03.2012
Nrhk
Thanks guys for the quick turnaround. I had time to design my campaign and send it just in time. It’s at times like this that one does realize how much one counts on MailChimp to do the work.
01.03.2012
Barry Durdant-Hollamby
Hi guys – thanks for all your work. I’ve got a campaign that we were just about to send out vefore the crash – but the version now showing is not the most recently updated version. Is the old version lost or will it be returning????
01.03.2012
Ben
Sorry, I’m writing up some more technical details for this blog post now, but I’ll go ahead and respond here: For 2/3 of the affected users, we had to restore from a 1am ET January 2nd backup. So anything created between then and about 3pm ET is unfortunately lost and will have to be rebuilt. That’s my quick, way-too-terse response for now, but I’ll be posting more very soon.
01.03.2012
Neil
Just found out about this from one of my subscribers who said the links in my newsletter were all broken.
The test email I sent to myself just prior to sending the campaign was all okay so I was none the wiser!
Might be useful to send users an email notice to let them know what’s happening.
Rather frustrating but these things happen and only a slight blip in your otherwise fantastic (and largely free!) service.
01.03.2012
Ben
Definitely. We’re gathering all the user accounts affected by this so that we can email them with an update. Sorry your campaign links broke. I’ll have more details written up on this blog post shortly, so that you can at least point your customers here to show it was our problem, not a mistake that you made.
01.03.2012
Diane
I can again access my account but the numbers on my report for January 1 campaign have changed. Yesterday I had 230 opens with a 65% open rate. Today it says 213 and 59%. Will the missing opens be restored?
01.03.2012
Gwen
Pass along my thanks to all on your MailChimp team. Everyone that lives in this techno jungle has experienced a hardware or software issue. I hope your clients handle the situation in the same way they would want THEIR clients to handle it. Thanks for all that you do, and for keeping us informed. Is there any way to know when one’s account is fully functional – before I send out a new campaign?
01.03.2012
celia
These things happen. We love Mailchimp more than ever. I hope you get some much-deserved sleep judging by the timeline of your updates. :-D
01.03.2012
John Garrick
I can only imagine the stress load trying to deal with this! YUCK! I was pretty frustrated my self at the inconvenience and we spend a lot of time doing damage control with our donors who could not use the links we sent out in the email. HOWEVER, I am grateful for the serice you provide and your model is strong and your care for your customers is evident. SO, although it sucks and creates pain for many… there is nothing perfect and some things (many actually) are just out of human control (or chimps for that matter). We will make the best out of a cruddy situation. Thanks for all the effort you put forth on a regular basis to make your products available for both paying and free clients. You are appreciated!
01.03.2012
ConnectionMaven
This is the positive way to manage a challenge. Hardware Issues at US1 Data Center-MailChimp Blog http://t.co/IihJfZ6o via @addthis
01.03.2012
John Meredith
This is a good reminder for everyone in IT that backups are essential, but an often overlooked part of disaster recovery is the time to takes to run those restores.
01.03.2012
Henry-Michel Rozenblum
Hope to see your service working soon.
Good luck from Paris, FRANCE
01.03.2012
aileen
ahhhh.. that’s where my draft went – I knew I’d saved it. Argh… I sent out the wrong newsletter to my list this morning, rectified it quickly but had my does of humble for the year.
It’s nice to know it wasn’t (just) my own stupidity… but I did help!
01.03.2012
Gwen
Wow. You chimps back up often! Only losing 2 hours is amazing. Good work! Thanks!
01.03.2012
joe
Unfortunately the backups were run at 1am so we ended up losing more than 2 hours for those users. The backups we had to use are a last resort that we’ve never had to use before and hope to never need again (but we will obviously continue running them).
I thought I would briefly mention how we set things up though as we take these sorts of downtime and data loss events very seriously and I don’t want people to think this was a simple “a machine died, we restored from backups” scenario.
Each user shard in US1 is supported by 3 separate machines. Each of these machines is powerful enough to easily support the entire user shard on its own but only one of these is active at any given time. The other two sit mostly idle staying in sync and ready to take over in the event of a failure. We essentially are always running up to the second backups to two different machines for each user shard. In addition to all of this, we run full backups on every shard every day for “disaster” scenarios.
We very, very rarely lose an entire machine and when we do users do not notice because one of the backup machines is activated automatically and takes over.
In this case the nature of the hardware failure was so severe and so catastrophic that it impacted ALL of our machines – causing them to crash and corrupting their entire raid arrays of SSDs as they went down. We were unable to pull the current data off these machines for all shards before they were fully offline and despite spending many, many hours trying to recover the data after the crashes we failed except for that one clump of users that didn’t have any data loss.
Hope that sheds some light on our process. This was a truly unique event that I hope never happens again. We are making changes now to fortify our shards against data loss even further.
01.04.2012
Edi
I feel for you guys, all the best, you are doing a great job!
01.03.2012
myleftone
MailChimp makes good on weekend server outage (props to MC) http://t.co/lOfJETaJ via @mailchimp
01.03.2012
Jonathan Lackman
I love the way you guys tell the unvarnished truth. I’m an I.T. Director and used to having to “spin” reality for operational users. I’m sure you have some clients flipping out, but in reality bad things happen sometimes. Keep telling the truth, and thank your teams who have undoubtedly been up all night and all day working on this.
JL
01.03.2012
Sharif
Fecal matter happens…
A question: I just created a campaign, hit “send”… and, so far, nothing’s happened. Is this due to the hardware problem, or do I have some other type of problem. First time it’s happened, so I’m thinking hardware.
Any idea how long the backlog is? If its going to be additional hours, I do have some backup/contingency plans I can use…
Echoing others, thanks for being forthright. If there is ever a “next time”, consider putting something on your splash page…
Peace,
Sharif
01.03.2012
musicmarketeers
@MailChimpStatus That’s some impressive PR, a lot of companies can learn a lot from you guys! http://t.co/xfxlrmdT #honestygoesalongway
01.04.2012
viridianonline
Mailchimp “service unavailable” on 2 Jan due to hardware issues but handled really well by @MailChimp
http://t.co/OU35Lw9b
01.04.2012
Brad Conquer
**YOUR SERVER NOTICE IS CAUSING MY GMAIL APP LINK FROM SUCCESSFULLY LOGGING IN** Please restore the original page so I may access the account through Google Aps!
01.04.2012
Ben
Hi Brad, clearing your browser’s cache and cookies has helped other people with this problem. If that doesn’t help you, please contact our support team so they can investigate your account: http://help.mailchimp.com
01.05.2012
George
when can we expect to be back up? seeing how I am one of the non paying members of mailchimp i was probably put on the older hardware that is now shut down. Will paying for Mailchimp get me on a different server and up and running? I have an important newsletter that has to get out by today if I want it to have any effect for over the weekend…
01.05.2012
Ben
Hi George, everything’s been back online since Tuesday morning. If you’re not able to log in, try clearing your browser’s cache and cookies. If that doesn’t work, contact our support team so we can investigate what’s up w/your account.
01.05.2012
dwradcliffe
Always nice to see a good post-mortem: Hardware Issues at US1 Data Center http://t.co/G2XeJNoz via @mailchimp
01.05.2012
LotosYoga
We admire your Honesty. Your service is excellent, your humor is great…
01.05.2012
Steev
Thanks for the update guys! You rock!
Can you satisfy my curiosity and explain what the hardware fault was? I’m a wannabe techie and I love a good story =) How could something possibly be so severe and so catastrophic? Did lightning strike?
01.06.2012
Ben
Look in the comments for a reply from Joe. He describes what happened in gory detail.
01.06.2012
Steev
Thanks Ben, I did read the gory parts =) Joe said “In this case the nature of the hardware failure was so severe and so catastrophic that it impacted ALL of our machines”
He didn’t go into detail on the root cause, just the symptoms. I was really curious as to what triggered it all off in the beginning?
01.06.2012
Ben
Hey Steev, the maddening part is that there didn’t really seem to be a trigger. Just a mysteriously sudden death, across the board. So it’s been hard to talk about that in public, and still look sane. However, our engineers did just stumble upon this article:
http://www.neoseeker.com/news/18098-64gb-crucial-m4s-crashing-after-5000-hours-fix-coming/
It’s the same hardware we upgraded to, and the same age when we started to hit the prob. It makes us feel a little better that it’s maybe not quite as inexplicable, but it would take more investigation to confirm all this. We’d rather spend the time moving off this equipment entirely. Sorry for the late reply.
01.09.2012
carmen chan
Hi, I’m still having trouble logging into mail chimp to access my campaign, is there someone I can speak with to help me check my account?
Thanks,
Carmen
01.06.2012
Amanda
Sure thing Carmen. The first thing you’ll want to try is clearing your browser cache and cookies or using an alternate browser. If neither of those options work and you’re still having trouble, feel free to contact our support team. They’re happy to assist – http://mailchimp.com/support
01.09.2012
GBGames
Honesty and a sincere apology from @MailChimp. A breath of fresh air. http://t.co/bnXo2Ujf
01.07.2012
Joshua Utley
I can still not access my profile or account. )=
01.09.2012
Amanda
Joshua, mind getting in touch with our support team? They’ll be able to get some additional information from you and help troubleshoot. http://mailchimp.com/support
01.09.2012