Dec 7, 2011
Fun with Data Science
MailChimp hired a scientist! More specifically, a data scientist. His name is John, and he’s good at calculus.
John’s main focus is our Email Genome Project, which analyzes millions of email lists—and hundreds of millions of email addresses—to find stories and trends in the data. This research helps us understand the email ecosystem, prevent abuse, and create better experiences for you guys.

Nice shirt, John!
Lately, John’s been analyzing how the geographic distribution of a customer’s list relates to her campaign performance. To do this, he uses an agglomerative clustering algorithm, combined with a little calculus, to detect population centers within a list. One of the outputs of the algorithm is something called a dendrogram, which is pretty much a scientific illustration of the clusters. Our dendrograms look like this:

Confused yet? Basically, humans can look at a list and logically cluster similar items, because organizing based on similarities is a human-like thing to do. But computers need a little help with that sometimes, so we’ve been training EGP to calculate distances between our customers’ list members, and then make clusters based on those locations. All this analysis allows us to better predict how a campaign might perform for any given list, based on where the customer’s located and where the clusters within that customer’s list are located.
John’s made some helpful discoveries. For example, we now know that if a majority of a list’s members are located in a different country than the list owner, then a campaign sent to that list will have about two percent more bounces than publishers who live in the same country as their subscribers. (Keep in mind that when it comes to bounces, two percent is more significant than it sounds.) And of course the percentage of your list that we’ve geolocated is a predictor of list performance itself, because subscribers have to open or click before we can geolocate them at all. So if we don’t have that list data for someone who’s already sent a campaign, then it’s a surefire sign that the next campaign isn’t going to perform so well.
John’s also been working on bounce predictions. We’re learning stuff like this: The more subscribers you have with Gmail addresses, then the fewer bounces, unsubscribes, and abuse reports you’ll have. It makes sense, because Gmail addresses are generally recent—so the more Gmail addresses you have on your list, the less room you have for old (and therefore bad) domains.
Another fun fact: We’ve been purchasing email lists. This might come as a shock, considering how much we hate purchased lists—but rest assured that it’s all in the name of abuse prevention. Buying and analyzing dirty lists makes it easier for us to detect other dirty lists. We feed that data to EGP, EGP becomes even smarter, and we all benefit from a cleaner system and better deliverability.
There’s a lot more where this came from. We’re constantly slicing and dicing public data, uncovering trends, and figuring out how these discoveries can prevent abuse and make our customers’ lives easier. Because we love you.
Darren
John has a boffin beard. This is a good thing. Make sure he never shaves it off or he will lose his boffin skills and resort to using crayons and counting M&Ms.
Cool article – I’m surprised you are only deploying one beard for something that sounds so complex.
12.07.2011
Luci
This is really cool. Can we see the dendro-whatevers?
12.08.2011
John Wolff
Hi Kate,
What is the source of the data for geographic location?
If it is the stated location of the recipient’s IP address as we see when we peruse our list, then the stated location can be hundreds of miles from the recipients known residential location.
To cite an example, xtra.co.nz is the most common domain name on our list for customers all over New Zealand. Irrespective of their residential address, the location is most often stated as Newmarket which is a suburb of Auckland.
If there is a better way of obtaining location then I’d like to hear about that!
Cheers,
John
12.08.2011