Jan 2, 2012

Hardware Issues at US1 Data Center

To our customers, and to their subscribers:

On January 2nd (yesterday), we started to see multiple hardware failures at one of our data centers. As background, we’ve spread MailChimp across 3 data centers across the country so we don’t have all our "eggs in one basket". Then, we further divide each data center into different groups or "shards" of users. Some shards house big, high-volume users with large lists and intense server resource requirements, while some shards are for users with relatively smaller lists (less than 25k recipients is considered "small"). This is an attempt to keep issues for one set of users from bringing down our entire base of 1.2 million users.

US1, which is our first and oldest data center with the most users, saw 3 of the "small user" database shards failing. Around 2pm, we decided to completely disable access to those 3 database shards in order to prevent those users from logging in and creating new campaigns, which would’ve been lost in the event we had to restore from backup. This affected about 400,000 users.

We then began the long, painstaking process of replacing hardware and then restoring data. For 1/3 of the affected 400k users, we were able to restore things nearly perfectly with no campaign or data loss (note that during the outage, we could not collect any new subscribe or unsubscribe activity though). For the other 2/3, the failing hardware corrupted data so badly that we had to revert to backups from 1am ET that morning (January 2nd).

So for those users, any campaigns created, or campaigns sent (including all tracking links) between 1am ET and roughly 3pm ET were lost (RSS and Autoresponder campaigns, however, will simply pick up where they left off).

We’ve researched the logs, and it looks like 788 users lost that campaign data between 1am and 3pm. Needless to say, an email with apologies and refunds and make goods for these users is already in the works. If you were affected by this outage, do not hesitate to contact our support team. They have full authority to make things right with you.

We’re still investigating the exact cause, but we know that back in September, we began replacing our older hardware at US1 with brand new, super fast, super expensive SSD-equipped servers in order to beef up for the Thanksgiving and Christmas holiday email volume. This was a significant infrastructure investment for us to keep things stable (yes, the irony). The upgrades did manage to sustain delivering +100 million emails per day during peak periods, but the RAID controllers for the SSDs weren’t working as reliably as we hoped. When those things fail, they apparently break the SSDs along with them. So our plan was to switch back from the SSDs after the holidays when things are usually more stable and quiet (yes, irony). Obviously, we’ll be accelerating those plans as soon as the dust settles.

We sincerely apologize for this, and will be working extra hard to regain your confidence.

P.S.

If you’re into gory technical details, read the comment from Joe, one of our devops engineers, below.

 

——–  below is the original post, which I wrote in haste to keep people updated —–

 

Hi everybody, sorry to report we’re experiencing some hardware issues at our US1 data center, which is affecting a large number (but not all) of the users there. US2 and US4 users are not affected. We’re currently in the process of doing some quick final backups, so that we can replace the hardware. To prevent data loss during the switch, we’ve disabled access to MailChimp for those users. We expect this process to take several hours. We’re very sorry for this inconvenience. If your campaigns are affected, talk to our support team, and they’ll work to make things right with you. We’ll also be posting updates here on the blog and on twitter at @mailchimpstatus

 

Update January 2 (10:28 pm): We’ve replaced the hardware, and are currently reinstalling software and data. This is going to take several more hours to get fully operational.  We’ll be working through the night to get it all running smoothly again. Sorry again to all affected users.

Update January 3 (4:50 am): Slight progress. 25% of affected user accounts have been restored and are back online. Still working on restoring backups of the remaining accounts.

Update January 3 (7:47 am): A little more progress. Another 25% of the accounts have been restored and are back online.

Update January 3 (8:24 am): The final batch of users have been brought back online. From our customers’ perspective, US1 is live again. Behind the scenes, we’re mopping up some messes, and soon I’ll post more details here on the blog.