Jan 2, 2012

Hardware Issues at US1 Data Center

To our customers, and to their subscribers:

On January 2nd (yesterday), we started to see multiple hardware failures at one of our data centers. As background, we’ve spread MailChimp across 3 data centers across the country so we don’t have all our “eggs in one basket”. Then, we further divide each data center into different groups or “shards” of users. Some shards house big, high-volume users with large lists and intense server resource requirements, while some shards are for users with relatively smaller lists (less than 25k recipients is considered “small”). This is an attempt to keep issues for one set of users from bringing down our entire base of 1.2 million users.

US1, which is our first and oldest data center with the most users, saw 3 of the “small user” database shards failing. Around 2pm, we decided to completely disable access to those 3 database shards in order to prevent those users from logging in and creating new campaigns, which would’ve been lost in the event we had to restore from backup. This affected about 400,000 users.

We then began the long, painstaking process of replacing hardware and then restoring data. For 1/3 of the affected 400k users, we were able to restore things nearly perfectly with no campaign or data loss (note that during the outage, we could not collect any new subscribe or unsubscribe activity though). For the other 2/3, the failing hardware corrupted data so badly that we had to revert to backups from 1am ET that morning (January 2nd).

So for those users, any campaigns created, or campaigns sent (including all tracking links) between 1am ET and roughly 3pm ET were lost (RSS and Autoresponder campaigns, however, will simply pick up where they left off).

We’ve researched the logs, and it looks like 788 users lost that campaign data between 1am and 3pm. Needless to say, an email with apologies and refunds and make goods for these users is already in the works. If you were affected by this outage, do not hesitate to contact our support team. They have full authority to make things right with you.

We’re still investigating the exact cause, but we know that back in September, we began replacing our older hardware at US1 with brand new, super fast, super expensive SSD-equipped servers in order to beef up for the Thanksgiving and Christmas holiday email volume. This was a significant infrastructure investment for us to keep things stable (yes, the irony). The upgrades did manage to sustain delivering +100 million emails per day during peak periods, but the RAID controllers for the SSDs weren’t working as reliably as we hoped. When those things fail, they apparently break the SSDs along with them. So our plan was to switch back from the SSDs after the holidays when things are usually more stable and quiet (yes, irony). Obviously, we’ll be accelerating those plans as soon as the dust settles.

We sincerely apologize for this, and will be working extra hard to regain your confidence.

P.S.

If you’re into gory technical details, read the comment from Joe, one of our devops engineers, below.

 

——–  below is the original post, which I wrote in haste to keep people updated —–

 

Hi everybody, sorry to report we’re experiencing some hardware issues at our US1 data center, which is affecting a large number (but not all) of the users there. US2 and US4 users are not affected. We’re currently in the process of doing some quick final backups, so that we can replace the hardware. To prevent data loss during the switch, we’ve disabled access to MailChimp for those users. We expect this process to take several hours. We’re very sorry for this inconvenience. If your campaigns are affected, talk to our support team, and they’ll work to make things right with you. We’ll also be posting updates here on the blog and on twitter at @mailchimpstatus

 

Update January 2 (10:28 pm): We’ve replaced the hardware, and are currently reinstalling software and data. This is going to take several more hours to get fully operational.  We’ll be working through the night to get it all running smoothly again. Sorry again to all affected users.

Update January 3 (4:50 am): Slight progress. 25% of affected user accounts have been restored and are back online. Still working on restoring backups of the remaining accounts.

Update January 3 (7:47 am): A little more progress. Another 25% of the accounts have been restored and are back online.

Update January 3 (8:24 am): The final batch of users have been brought back online. From our customers’ perspective, US1 is live again. Behind the scenes, we’re mopping up some messes, and soon I’ll post more details here on the blog.

 

 

Discussion

  • Heavenly Order

    Wow, you guys must be ‘scritch, scritch, scritching’ right now! Not enough bananas? You provide a great service, so I will wait patiently by my computer until you’re swinging through the jungle again :)

  • Enrique

    Good luck with this guys – you do offer a great service, thus will be patient.

  • nrhk

    Thanks guys. Too bad it happens on the first business day of the year when I need to send my monthly update to my clients. Please continue the good work!

  • Scott

    Hmmm – Naturally, I have several campaigns that need updating and are scheduled for delivery by 10 am EST on Jan 3.

  • mike

    Thanks for the update. I am glad to see you posting what is happening.

  • Robert Graham

    I’d totally drive down and bring you guys some pizza and Red Bull… But I live in Michigan.

  • FilipEvans

    I’ve just sent my biggest and most important campaign for a client to 2500+ people and none of the links work on the campaign… complete failure… I cant think of a way to resolve this.. sending a 2nd time looks unprofessional.

    • Edi

      Things will be as bad as you think they will be. You will be surprised how many of your clients will understand that those things can happen. All the best, Edi

  • facebook_martijnengler

    Hey guys / gals,

    I’m sure you’re doing everything you can to fix this. I didn’t have any campaigns planned until Thursday (except for autoresponders), so I’m pretty good. Just one question: people trying to subscribe to my list… Will they have any issues?

    Another question: like I said, my autoresponders won’t work (I think?)
    Is there any way to have them queued up for later or something?

    - Martijn

    • Ben

      If your account is affected by the outage, signups and scheduled autoresponders are also off until we can restore service. Very sorry. Should be a few more hours now.

  • Larry Escher

    Thanks for your work. Server issues happen, especially with the demand that probably comes from the beginning of the year.

    I appreciate all your hard work.

    Larry

  • MikeL

    Glad you managed to get it sorted. Unfortunately the notice saying you have sorted it is preventing OpenID login from Google.
    Maybe you could mop this up with the rest of the mess backstage.
    Thanks.

    • Ben

      Our message shouldn’t affect OpenID logins. Try clearing cache/cookies?

  • Nrhk

    Thanks guys for the quick turnaround. I had time to design my campaign and send it just in time. It’s at times like this that one does realize how much one counts on MailChimp to do the work.

  • Barry Durdant-Hollamby

    Hi guys – thanks for all your work. I’ve got a campaign that we were just about to send out vefore the crash – but the version now showing is not the most recently updated version. Is the old version lost or will it be returning????

    • Ben

      Sorry, I’m writing up some more technical details for this blog post now, but I’ll go ahead and respond here: For 2/3 of the affected users, we had to restore from a 1am ET January 2nd backup. So anything created between then and about 3pm ET is unfortunately lost and will have to be rebuilt. That’s my quick, way-too-terse response for now, but I’ll be posting more very soon.

  • Neil

    Just found out about this from one of my subscribers who said the links in my newsletter were all broken.
    The test email I sent to myself just prior to sending the campaign was all okay so I was none the wiser!
    Might be useful to send users an email notice to let them know what’s happening.
    Rather frustrating but these things happen and only a slight blip in your otherwise fantastic (and largely free!) service.

    • Ben

      Definitely. We’re gathering all the user accounts affected by this so that we can email them with an update. Sorry your campaign links broke. I’ll have more details written up on this blog post shortly, so that you can at least point your customers here to show it was our problem, not a mistake that you made.

  • Diane

    I can again access my account but the numbers on my report for January 1 campaign have changed. Yesterday I had 230 opens with a 65% open rate. Today it says 213 and 59%. Will the missing opens be restored?

  • Gwen

    Pass along my thanks to all on your MailChimp team. Everyone that lives in this techno jungle has experienced a hardware or software issue. I hope your clients handle the situation in the same way they would want THEIR clients to handle it. Thanks for all that you do, and for keeping us informed. Is there any way to know when one’s account is fully functional – before I send out a new campaign?

  • celia

    These things happen. We love Mailchimp more than ever. I hope you get some much-deserved sleep judging by the timeline of your updates. :-D

  • John Garrick

    I can only imagine the stress load trying to deal with this! YUCK! I was pretty frustrated my self at the inconvenience and we spend a lot of time doing damage control with our donors who could not use the links we sent out in the email. HOWEVER, I am grateful for the serice you provide and your model is strong and your care for your customers is evident. SO, although it sucks and creates pain for many… there is nothing perfect and some things (many actually) are just out of human control (or chimps for that matter). We will make the best out of a cruddy situation. Thanks for all the effort you put forth on a regular basis to make your products available for both paying and free clients. You are appreciated!

  • John Meredith

    This is a good reminder for everyone in IT that backups are essential, but an often overlooked part of disaster recovery is the time to takes to run those restores.

  • aileen

    ahhhh.. that’s where my draft went – I knew I’d saved it. Argh… I sent out the wrong newsletter to my list this morning, rectified it quickly but had my does of humble for the year.
    It’s nice to know it wasn’t (just) my own stupidity… but I did help!

  • Gwen

    Wow. You chimps back up often! Only losing 2 hours is amazing. Good work! Thanks!

    • joe

      Unfortunately the backups were run at 1am so we ended up losing more than 2 hours for those users. The backups we had to use are a last resort that we’ve never had to use before and hope to never need again (but we will obviously continue running them).

      I thought I would briefly mention how we set things up though as we take these sorts of downtime and data loss events very seriously and I don’t want people to think this was a simple “a machine died, we restored from backups” scenario.

      Each user shard in US1 is supported by 3 separate machines. Each of these machines is powerful enough to easily support the entire user shard on its own but only one of these is active at any given time. The other two sit mostly idle staying in sync and ready to take over in the event of a failure. We essentially are always running up to the second backups to two different machines for each user shard. In addition to all of this, we run full backups on every shard every day for “disaster” scenarios.

      We very, very rarely lose an entire machine and when we do users do not notice because one of the backup machines is activated automatically and takes over.

      In this case the nature of the hardware failure was so severe and so catastrophic that it impacted ALL of our machines – causing them to crash and corrupting their entire raid arrays of SSDs as they went down. We were unable to pull the current data off these machines for all shards before they were fully offline and despite spending many, many hours trying to recover the data after the crashes we failed except for that one clump of users that didn’t have any data loss.

      Hope that sheds some light on our process. This was a truly unique event that I hope never happens again. We are making changes now to fortify our shards against data loss even further.

  • Edi

    I feel for you guys, all the best, you are doing a great job!

  • Jonathan Lackman

    I love the way you guys tell the unvarnished truth. I’m an I.T. Director and used to having to “spin” reality for operational users. I’m sure you have some clients flipping out, but in reality bad things happen sometimes. Keep telling the truth, and thank your teams who have undoubtedly been up all night and all day working on this.

    JL

  • Sharif

    Fecal matter happens…

    A question: I just created a campaign, hit “send”… and, so far, nothing’s happened. Is this due to the hardware problem, or do I have some other type of problem. First time it’s happened, so I’m thinking hardware.

    Any idea how long the backlog is? If its going to be additional hours, I do have some backup/contingency plans I can use…

    Echoing others, thanks for being forthright. If there is ever a “next time”, consider putting something on your splash page…

    Peace,

    Sharif

  • Brad Conquer

    **YOUR SERVER NOTICE IS CAUSING MY GMAIL APP LINK FROM SUCCESSFULLY LOGGING IN** Please restore the original page so I may access the account through Google Aps!

    • Ben

      Hi Brad, clearing your browser’s cache and cookies has helped other people with this problem. If that doesn’t help you, please contact our support team so they can investigate your account: http://help.mailchimp.com

  • George

    when can we expect to be back up? seeing how I am one of the non paying members of mailchimp i was probably put on the older hardware that is now shut down. Will paying for Mailchimp get me on a different server and up and running? I have an important newsletter that has to get out by today if I want it to have any effect for over the weekend…

    • Ben

      Hi George, everything’s been back online since Tuesday morning. If you’re not able to log in, try clearing your browser’s cache and cookies. If that doesn’t work, contact our support team so we can investigate what’s up w/your account.

  • LotosYoga

    We admire your Honesty. Your service is excellent, your humor is great…

  • Steev

    Thanks for the update guys! You rock!

    Can you satisfy my curiosity and explain what the hardware fault was? I’m a wannabe techie and I love a good story =) How could something possibly be so severe and so catastrophic? Did lightning strike?

    • Ben

      Look in the comments for a reply from Joe. He describes what happened in gory detail.

      • Steev

        Thanks Ben, I did read the gory parts =) Joe said “In this case the nature of the hardware failure was so severe and so catastrophic that it impacted ALL of our machines”

        He didn’t go into detail on the root cause, just the symptoms. I was really curious as to what triggered it all off in the beginning?

      • Ben

        Hey Steev, the maddening part is that there didn’t really seem to be a trigger. Just a mysteriously sudden death, across the board. So it’s been hard to talk about that in public, and still look sane. However, our engineers did just stumble upon this article:

        http://www.neoseeker.com/news/18098-64gb-crucial-m4s-crashing-after-5000-hours-fix-coming/

        It’s the same hardware we upgraded to, and the same age when we started to hit the prob. It makes us feel a little better that it’s maybe not quite as inexplicable, but it would take more investigation to confirm all this. We’d rather spend the time moving off this equipment entirely. Sorry for the late reply.

  • carmen chan

    Hi, I’m still having trouble logging into mail chimp to access my campaign, is there someone I can speak with to help me check my account?

    Thanks,
    Carmen

    • Amanda

      Sure thing Carmen. The first thing you’ll want to try is clearing your browser cache and cookies or using an alternate browser. If neither of those options work and you’re still having trouble, feel free to contact our support team. They’re happy to assist – http://mailchimp.com/support

Comment