Jan 27, 2010

Project Omnivore: Declassified

iStock_000000051702XSmall

In late 2008, MailChimp Labs began Project Omnivore. Our goal was to build a massively scalable tool for our abuse team that could predict bad behavior.

The experiment started with an nVidia Tesla supercomputer, then grew to a cluster of Amazon EC2 servers running a genetic optimization program for 2 weeks nonstop, running over 61 trillion email data comparisons.

This article shares some of the results of our experiment, and where the technology is taking us…

Why Is Omnivore Needed?

You know what the hardest part of running an Email Service Provider (ESP) is? Detecting ignorant spammers. They’re very different from evil spammers. See, it’s pretty easy to detect "evil" spam. You know, the pharmaceutical appendage enhancing stuff, phishing scams, and Nigerian prince (419) junk. Spam filters actually do a really good job of catching the evil stuff nowadays (not perfect, but pretty darn good, all things considered). And most ESPs employ some kind of spam filter (usually a variation of SpamAssassin) to scan outgoing emails in their queue. Either to prevent evil spam from tainting our reputation, or to "grade" the spamminess of a message.

But those spam filters aren’t designed to detect when an ignorant marketer doesn’t realize he’s spamming, and sends a mass email without permission (remember, the definition of spam is "unsolicited bulk email").

Lack of permission, in an otherwise perfectly legitimate looking business email, is very subtle and much harder to detect.

I’m talking about when a well-meaning small business owner just wants to get the word out about his new store, and "blasts" an unsolicited email to a list he obtained from his local chamber or from a tradeshow. He didn’t mean harm, and he thinks he’s "just doing business," but he’s actually spamming. While it’s a different flavor of spam, it’s still spam (again, see: definition of spam). This kind of spam is hard to detect because the content is often perfectly fine and doesn’t contain the normal keywords or traits that spam filters are trained to look for. But this flavor of spam can cost an ESP dearly, because they tend to generate the bad kind of engagement (high complaints, high bounces, high unsubs) that can get our IPs blacklisted by email gateways and ISPs.

How exactly does one detect the lack of permission in someone’s account? Across over 230,000 accounts? Sure, we’ve got a well-trained compliance team who can review a new user’s account, and in the blink of an eye, judge whether or not they’re going to cause trouble for us. But as good as we are, a human review team is just not scalable enough to deal with hundreds of thousands of senders. Not to mention that someone we might approve as a "good sender" can eventually become a "bad" sender. Rigorous, 24/7 account review becomes a necessity.

So our abuse desk decided long ago that we had to change the way we think about handling abuse. We began experimenting and analyzing massive amounts of data in 2008, which led to our list activity score feature. The idea here was to stop classifying customers as good or bad (and giving them access to special IP ranges for better deliverability), and start looking at their list management practices instead.

This then led to even more granular analysis: subscriber engagement tracking. We now treat email delivery differently, depending on the engagement level of your subscribers. Which is nice, considering ISPs are also looking at engagement to decide whose emails show up in the inbox or not. As a sender, you can segment your campaigns based on subscriber engagement, or clean out the inactive members.

But it was when we came up with the idea for our freemium plan that we knew we needed a completely automated, intelligent abuse detection system in place. Without a scalable abuse prevention system, there’d be no (scalable) way to protect the deliverability of our servers from the abuse that comes with free. So we stepped up our research and created Omnivore.

What Omnivore Does

Omnivore is a program that runs in the background and analyzes email campaign and user account data. Non-stop.

When it finds anything suspicious about a MailChimp user or his campaigns, it’ll do one of two things:

  1. Send the user a warning for something that looks problematic.
  2. Suspend a user’s account for something bad, send them a warning, and alert our abuse team to investigate the account.

What Omnivore Doesn’t Do

Most important of all, Omnivore doesn’t replace or reduce our human abuse desk team. And despite what some angry people out there might think (or tweet), Omnivore doesn’t shut down "totally innocent, opt-in users" with "absolutely no warning." Humans review reports from Omnivore. If an account’s been suspended or flagged by Omnivore for problems, our team investigates. So long as the user is not obviously an evil spammer, we attempt to contact the sender with some advice or instructions for account reinstatement. If you’re curious about how our abuse team makes its decisions, check out these compliance tips.

How Omnivore Works

Chad, our lead engineer, headed up the Omnivore Project. I’ve asked him to provide some technical insight into how it all works.

Ben: Without revealing too much of the secret sauce, how does Omnivore work? I heard the team discussing something about "genetic optimization?"

Chad: Yes, in a nutshell, genetic optimization is a method of determining the best option from a large set of possible choices.  When the universe of possibilities is large enough, it isn’t practical to just try all of them and pick the best – you have to use an optimization algorithm to narrow down on the best choices.  Genetic optimization uses a process that roughly mirrors how natural selection processes can incrementally produce the fittest candidate over many generations, hence the name.  You create a population of possible options, then breed and mutate the top performers until you get a good enough solution to stop. Assuming that choices that are similar to each other will perform similarly, this can get you to a good answer relatively efficiently.

Ben: So how’d you apply that to email marketing and spam?

Chad: We took every bad campaign that had ever been shut down by our human reviewers as well as every bad campaign that managed to get through, and started looking for common patterns.  We know a lot about every campaign that goes through our systems, as well as every list we manage and customer we sign up.  Our human experts had a laundry list of the traits that scream "bad campaign", but for this thing to scale we needed to be absolutely, mathematically certain.  So we used a series of large scale genetic optimization tests running against every campaign we’ve ever sent to confirm which traits were predictive, and how predictive they were.

We did this for both negative reactions (bounces, unsubscribes, abuse complaints) and signs of engagement (opens, clicks) to give our team a complete picture of the likely results of any campaign, before the campaign is ever sent.  If Omnivore sees something that it’s certain will be bad, it alerts the abuse desk to review the campaign before it’s let through the system.

Ben: I hear you tried this on the machines at the office and they were too slow?

Chad: Right – even early small-scale tests would run for weeks before giving good results. The full tests would have taken years to complete. We ended up getting an nVidia Tesla and writing the process in highly-optimized C code, which was able to give us our preliminary results in a couple of hours. After we knew our algorithm was pretty close to what we wanted, we converted the process to a giant Hadoop Map/Reduce program running on a cluster of Amazon EC2 servers for about 20 days to get the final results for the first version.  Smaller optimization processes still run continuously to test new ideas and refine the model.

Ben: So this is totally different than just checking all outgoing campaigns with a spam filter?

Chad: Yes. It’s using the detailed sender information that we have as an ESP to look for that permission "gray area" mentioned above.

More importantly, we needed to be sure that Omnivore would continue to be efficient and predictive as our customer base grew and morphed after the free program was put into place.  Unlike static rules or blacklist-based methods of detecting spam, all of the major Omnivore systems are learning algorithms that keep up with changing user behavior without losing their predictive power.

Ben: After all is said and done, any fun or surprising observations to share?

Chad: Some traits and keywords that we thought we should focus on were actually poor predictors of bad behavior. For example, highly-targeted campaigns don’t do much better than other campaigns when it comes to abuse or unsubscribe rates.  Other things that you’d think are totally irrelevant at first glance turned out to be effective predictors, like the length of the subject line.

Ben: So a subject line that’s too short, or um — too long — would be a sign of trouble?

Chad: Something like that. Keep in mind it takes a combination of traits that add up in order for Omnivore to determine "this looks like lack of permission."

Ben: Any other interesting observations?

Chad: When we started this process, we went straight to our team of human reviewers to show us the patterns that they were looking at when evaluating a new customer.  A lot of it was right on the money – particular industries definitely have a profile, and the language used when describing where permission came from is crucially important. However, some of the patterns turned out to be less predictive, like having a mailing address displayed prominently in the content and some of the other details of CAN-SPAM compliance.  It was also a bit surprising to discover exactly how bad most spam filters are at predicting permission issues.  Whether or not a campaign passes any of the free or commercial spam filters generally has little impact on its predicted outcomes.

Results So Far

As MailChimp scales and sends more campaigns, Omnivore will collect more data and adapt. It’s by no means complete. There are switches and knobs we haven’t even turned on yet. We’re currently running some of Omnivore’s scanning in "observation mode," and not letting it act on anything. As it gets smarter, we’ll gradually activate more functionality and grant it more decision-making power.

But so far, here are some of the results:

  • As of January 6, 2010, Omnivore has automatically sent 19,581 warnings to 9,349 users for exhibiting bad behavior. Of course, we also include tips and pointers on how they can change their ways.
  • Omnivore has automatically suspended 2,249 users since September 1st 2009.
  • 861 of those users ultimately had to be shut down. We hate losing customers (because we love money), but no customer is worth jeopardizing the deliverability and reputation of our entire system.

Looking ahead (literally)

The reason we built Omnivore was because we wanted to change the way we think about abuse. The project involved so much data crunching that it resulted in some interesting byproducts. Our subject line suggester is one example, as well as the engagement ranking and segmenting tools we mentioned earlier.

But Omnivore is learning more every day, and is actually getting good at predicting not just bad behavior, but good behavior too. Here’s a snapshot from our internal dashboard:

omnivore-predictions

As you can see, Omnivore’s predicting open and click rates for this particular campaign, along with the "bad" stuff. As we feed it more data, the margin of error narrows, making it a powerful new feature we could be offering to our customers one day.

Omnivore’s predictive reporting is changing the way we deal with abuse, but might end up changing the way we think about email marketing in general.