Feb 9, 2011

MailChimp’s Email Genome Project

Every once in a while, we ask some random questions about email here at MailChimp. Questions like:

  • Remember that blog network that just got hacked, and how all their user data was posted to the public? Wonder if any bad guys are importing that email list into MailChimp anywhere. Would be nice to shut them down, and maybe even report them to the FBI.
  • Hey, what if we purchased some spam lists ourselves, and just used them to scan all users’ imported lists for high levels of correlation?
  • Across all the emails we’ve ever sent, what’s a realistic "average shelf life" for a subscriber’s engagement?
  • Is there a *real* "best time" and "worst time" to send email? Of course people will always say "it depends" but what if we actually crunched (all) the numbers anyway? Would we find interesting patterns?

And some questions can be real dilemmas, like:

  • If user X imports a list, and we find a bunch of hard bounces, why don’t we prevent those bad email addresses from being imported into our system by user Y? (after all, lots of bounces can lead to delivery problems at some of the big ISPs)
  • If we know a particular subscriber is a habitual (false) complainer, should we keep allowing them to subscribe to lists that we host? Even if there’s double opt-in proof?

MailChimp Engineers: "Shutup, already. Go look it up yourself."

I guess all these questions finally annoyed our engineers enough to make them setup The Email Genome Project, which scans MailChimp’s 600,000 users, the hundreds of millions of subscribers they manage, and the 40 million (and growing) messages they send every day for nuggets of information that we can use to improve our deliverability and train our Omnivore abuse prevention algorithms.
The fun part of all this? The nerds get to play with cool toys…

First, they setup a server that’s used for some occasional pre-test "heavy lifting." To be honest with you, I don’t think they really needed this one. I’m pretty sure they got it for fun. Whatever the case, here are the specs:

  • 4 x Xeon X7550 CPUs, each 8 cores @2.0Ghz with HT
  • 128 GB of DDR3 RAM
  • Hardware BBU-backed raid 10 of Intel X25-E SLC SSDs

And then they setup another server that is not quite as impressive (with "only" 2×6 core xeons for a total of 24 threads, 36 GB RAM). This one was configured more for storage, with a 12 disk raid 10 of 15k SAS drives with ~4TB of usable raid 10 space.

I pretty much have no idea what I just typed there. Sounds impressive, though. The monthly bill certainly made an impression on me.

But hey, all in the name of R&D. If they wanna use the toys to play Doom (people still play that game, right?) or test their password cracking skills, it’s all good.

Anyway, the high level goal of the Email Genome Project is to help improve the email ecosystem. Specifically, we want to provide answers — fast. The more we learn about email, the better we can help prevent the abuse of it.

We’ll talk more about our findings here on the MailChimp blog soon.

For now, to get a feel for what kind of data our Email Genome Project can produce, you should sign up to Dan Zarrella’s "Science of Email Marketing" webinar.

He asked us a few questions about email marketing. We scanned 10 billion emails, and gave him some answers: