Sep 16, 2010

Update on v5.3 upgrade, delivery issues

[UPDATE: 9-17-10, 8:40am]

As of 6:47am ET, things appear to be back to normal. The RSS-to-email and autoresponder campaigns that we trickled out overnight have all been sent. Our delivery queues are back down to normal levels.  If you have a campaign that did not get sent, please contact support, and we’ll look into its status for you. Thanks for your patience throughout all this.

Summary of what happened yesterday…

[UPDATE: 9-16-10, ~9:30pm]

Yesterday at 10pm ET, we launched MailChimp v5.3. The upgrade process went well. It only took about 30 minutes to finish. But then we started to experience extremely heavy loads the next morning around 10am. We quickly had a massive backlog of emails on about 10% of our IP addresses. This doesn’t sound like much, but it’s frustrating if you’re waiting on a test or a campaign, and it’s stuck in one of those particular queues. The rest of the system performed okay. Instead of roughly 3.5 million deliveries per hour, we were averaging 3 million. But every time we got close to clearing out those queues, another surge of traffic would hit. One surge caused an outage, which made one of our MTAs (a big email delivery server) reboot itself.

When these things reboot, they can usually remember where they left off, and resume their job. But sometimes, they get mixed up, and they resend campaigns. Unfortunately, a handful of our customers’ campaigns sent multiple times. We apologize to our customers, and to your recipients, for that. Contact our support team, and we’ll make it up to you.

Unfortunately, it didn’t end there. Around 8:30pm, we had another surge in volume (this is also when we normally start running some background processes) which caused yet another brief outage. On the one hand, this was downright miserable. But on the other hand, it ran some processes that ultimately helped clear things up. So we were finally able to get a grip on things and spread out the volume to be less "spikey."

So. Things seem to finally be calming down in terms of the server-gremlin whack-a-mole game we’ve been playing all day. Emails are sending, albeit a little slower, but the load is a lot better now. And we’re watching things closely to make sure no more moles pop up. Again, very sorry for the inconvenience.