Aug 4, 2011

MailChimp Status Page

MailChimp is constantly growing and adding new customers around the world. Currently, we manage a hundred servers and employ CDNs and Akamai Web Application Accelerator to keep things running smoothly and quickly.

Sometimes though we still see things like:

Example Tweet

Which can be frustrating for us too. While our infrastructure is rock-solid and we are rarely down, sometimes users experience issues connecting to one of our many servers (often, the user is on  a wifi hotspot that doesn’t like a CDN somewhere). And these days, when people have issues, they tweet something like, "Hey @mailchimpstatus, are you down?" The answer is more complex than "yes" or "no." And we think it needs more explanation than those green or red "this server is alive" icons.

To help troubleshoot these situations, we just launched the MailChimp Status Page:

It’s a quick way to see how fast MailChimp is running from the geographical standpoint of our customers.

To do this, we’re using the app monitoring service from Webmetrics which has computers strategically placed in cities around the world. Instead of just "pinging" our servers to see if they’re alive, every couple minutes they log in and perform a series of actions inside MailChimp to sort of simulate a human user:

  1. Log in
  2. Visit the MailChimp dashboard
  3. Open campaigns screen
  4. Open reports screen
  5. View a single campaign report
  6. Return to the dashboard
  7. Log out

This entire "cycle" is timed, and we track the results. For each city, we store the "average cycle time" so that we know what’s "normal" for that region:

This helps tremendously when someone from Argentina tweets something like, "Anybody else feel like MailChimp is slower than usual today?" Using the screenshot above, we can see that Buenos Aires customers generally experience slower connection times than Austin customers (maybe it’s the intertubes), but as of right now, their connection to MailChimp is technically faster than normal.

In this specific case, we might ask the user to test their connection to MailChimp by using our connection speed tools on the status page:

We built this tool because it was so hard walking our customers through the steps of running a traceroute in order to pin down where the problem was in their connection.

Instead, you click a button, and we check the following:

This tells us if you’re having problems with our app servers or a CDN, which in turn helps us know which vendor to call (or tweet) for help.

To see how our servers are responding near you, visit our new status page.

Related Quick Tips:

  • Follow @mailchimpstatus on twitter to be notified of server issues we’re experiencing
  • Bookmark login.mailchimp.com instead of visiting MailChimp’s home page and clicking the "login" button. It will save you some time, but also help in those events where our public website is down, but the application is not.