Nov 13, 2012

Project Omnivore: Three Years of Gorging on Data

Let’s take a trip back to the heady days of 2009. Balloon Boy, Battlestar, Ms. Boyle. And, of course, MailChimp freemium. Three years ago, we allowed users to sign up for free accounts that they could keep forever. The effort has been a huge success, and these days we’re up to 2.5 million users. But when we first went free, the prospect really freaked some of us doomsdayers out. Wouldn’t we be opening the doors to the riff raff? How could we keep an eye on millions of customers to make sure no one was sending spam?

If even a few people got in the system and started abusing it, that could mean trouble for our reputation with major ISPs like Gmail, Hotmail, and Yahoo. No, we had to find a way to stop bad actors before they even sent a drop of email.

So we released Omnivore into the wild. Just as Taylor Swift was genetically engineered to handle both pop and country with aplomb, so was Omnivore engineered to predict the good and the bad senders. Omnivore eats all our data and uses it to see into the very soul of users. Not really. It’s just an AI model. But with over 10 years’ worth of MailChimp data generated at a rate of over 4 billion email address interactions a month, Omnivore has gotten pretty stinking intelligent. And fat. Think Churchill or Taft. Or an aging Elvis.

Let’s detail what Omnivore looks like these days and how it’s being used.

Getting Technical for a Moment

Omnivore sits on top of the Email Genome Project infrastructure. In its brain are internal profiles on nearly 2 billion email addresses. When they’ve been sent to, bounced, when they’ve clicked and opened, whether they have "asdf" in their prefix or use funky symbols, whether their domain is Gmail or some graphic design firm in Ljubljana, whether the address is stolen, for sale, or public, etc. We keep all this nutty data in RAM (!), so right now we’re pushing one terabyte of utilized RAM in a key-value store called Redis that sits on top of a massive sharded Postgres implementation. What this means to you is that Omnivore makes judgements about users and their lists ultra-fast.

In terms of models, Omnivore right now uses an ensemble of boosted trees and random forest models to provide the application with data about new users and new lists. One of the main predictors for these models is something I’ve termed the "Badness Cumulative Distribution Function" or BCDF. Sexy, I know.

Let me break down what the BCDF is for you. Let’s assume we’ve got a heretofore unknown user. Each email address on that user’s list is a piece of evidence supporting whether the user is good or bad. So if you import 80,000 addresses into MailChimp, that’s 80,000 pieces of evidence telling us just how trustworthy you are. We score each email address from the user from 0 (good) to 1 (evil). There’s some secret sauce in here as to how this works so I’ll just replace our copyrighted genius with some Star Wars characters so that you’ll get the idea:

Now, once we’ve scored every address on a list, we combine those scores into a single cumulative curve. This curve is the user’s Badness CDF, and it’s one of many things that flows into our AI models. Here are two real-world examples of these curves:

In the graph above, the bad user has all sorts of Jar Jars and Billy D’s on their list. The good user is majority Yoda. That’s partially how the model kicks spammers out the airlock.

Too Long; Didn’t Read

In short, Omnivore let’s us know things like:

  • How many email addresses on a list are likely to be dead (Hard Bounce)? How does a list get oodles of dead addresses on it? If these people legitimately signed up for a newsletter recently, especially if they double opted in, then why would their addresses be dead? The list must be old or collected with questionable methods.
  • How many email addresses on the list have been stolen, sold, or scraped? Good people don’t sign up for a MailChimp account with a list that’s identical to the ones stolen from Sony or Ticketmaster. Just doesn’t happen.

How does a user experience Omnivore?

Hopefully, you don’t! At MailChimp, we wish we could just open the doors and let people flow through our system with as little friction as possible. Now that we have Omnivore humming along, we’re getting close to making that a reality.

When you upload a list into the system, Omnivore predicts whether you’re naughty or nice. And the prediction has to be lightning fast. Right now we run a user’s list data through 5 million naive classifiers in just 2 seconds. Just like your cranky grandma, Omnivore’s a quick judge of character. These predictions flow through a live dashboard that alerts us any time an evil doer just got punted. Here’s a snapshot of the dashboard:

If you’re nice, you shouldn’t see any hurdles as you attempt to send newsletters or buy a larger account. Your predictions just bounce along the bottom of that graph, and Omnivore let’s you slide through. If Omnivore thinks you’re in a middle ground and may have some problems, you might be advised to clean up your list or may be allowed to send on a provisional basis. If Omnivore thinks you’re undoubtedly evil, you get its mechanical foot in your back as you’re shown the door.

The great thing about this setup is that good folks are no longer slowed down by vetting in the same way they used to be. If we can verify you’re legit, you move through the system like offal on its way to a sausage casing. Yay!

Back to Freemium

Not just anyone can open up their site to a freemium plan. MailChimp is great fun for the user, but we’re dead serious about keeping the email ecosystem healthy for our users. Over the past three years we’ve been refining our models, and in just a couple months we’ll be releasing an even better version of Omnivore.

The neat part is that as we grow and gain more users and traffic, our models get smarter, which allows us to safely grow even larger. Our growth as a company and our reputation for great delivery are now intertwined, and I can’t wait for the future…when Omnivore becomes sentient and takes over the earth.