Dec 7, 2011

Fun with Data Science

MailChimp hired a scientist! More specifically, a data scientist. His name is John, and he’s good at calculus.

John’s main focus is our Email Genome Project, which analyzes millions of email lists—and hundreds of millions of email addresses—to find stories and trends in the data. This research helps us understand the email ecosystem, prevent abuse, and create better experiences for you guys.

Nice shirt, John!

Lately, John’s been analyzing how the geographic distribution of a customer’s list relates to her campaign performance. To do this, he uses an agglomerative clustering algorithm, combined with a little calculus, to detect population centers within a list. One of the outputs of the algorithm is something called a dendrogram, which is pretty much a scientific illustration of the clusters. Our dendrograms look like this:

Confused yet? Basically, humans can look at a list and logically cluster similar items, because organizing based on similarities is a human-like thing to do. But computers need a little help with that sometimes, so we’ve been training EGP to calculate distances between our customers’ list members, and then make clusters based on those locations. All this analysis allows us to better predict how a campaign might perform for any given list, based on where the customer’s located and where the clusters within that customer’s list are located.

John’s made some helpful discoveries. For example, we now know that if a majority of a list’s members are located in a different country than the list owner, then a campaign sent to that list will have about two percent more bounces than publishers who live in the same country as their subscribers. (Keep in mind that when it comes to bounces, two percent is more significant than it sounds.) And of course the percentage of your list that we’ve geolocated is a predictor of list performance itself, because subscribers have to open or click before we can geolocate them at all. So if we don’t have that list data for someone who’s already sent a campaign, then it’s a surefire sign that the next campaign isn’t going to perform so well.

John’s also been working on bounce predictions. We’re learning stuff like this: The more subscribers you have with Gmail addresses, then the fewer bounces, unsubscribes, and abuse reports you’ll have. It makes sense, because Gmail addresses are generally recent—so the more Gmail addresses you have on your list, the less room you have for old (and therefore bad) domains.

Another fun fact: We’ve been purchasing email lists. This might come as a shock, considering how much we hate purchased lists—but rest assured that it’s all in the name of abuse prevention. Buying and analyzing dirty lists makes it easier for us to detect other dirty lists. We feed that data to EGP, EGP becomes even smarter, and we all benefit from a cleaner system and better deliverability.

There’s a lot more where this came from. We’re constantly slicing and dicing public data, uncovering trends, and figuring out how these discoveries can prevent abuse and make our customers’ lives easier. Because we love you.