Oct 23, 2012

Digging Deeper into Wavelength and EGP Data: Finding Interest Clusters in MailChimp’s Network

Update (6/27/17): This post contains information about features and workflows that are no longer available within the MailChimp application. To learn more about publicity settings for your list, visit our Knowledge Base.

MailChimp recently launched Wavelength, which allows users to find publishers like themselves and discover at a high level what other content their readership is engaged with. MailChimp has more than two million customers and sends more than three billion emails a month, so we’ve got all kinds of data in the Email Genome Project to bring to bear in understanding our vast network of publishers and readers.

But I wanted to find out if we could take Wavelength’s data one step further, one step deeper.

Who are my neighbors? (a short math lesson)

At a high level, Wavelength finds lists with membership that’s a lot like yours. I wanted to drill down to the individual address level, look at any address on your list, and find out which addresses also on the list were most similar to it, based on the newsletters they were subscribed to across MailChimp. Let’s take a look at 3 fake email addresses:

OK, so on this fake list I’ve got three addresses, and just as in Wavelength, I can see other publishers who send to those addresses. 1 means "subscribed" and 0 means "not subscribed." Let’s take a look at Coach Eric Taylor. Is Veronica his neighbor more than Lord Grantham?

Well, Veronica and Eric share three subscriptions, while Lord Grantham and Eric share two, so it’s Veronica for the win, right? Mmmm, not really—because Veronica is subscribed to everything. It’s hard to say that her interests truly align with Eric’s. Instead we’re going to use a simple calculation called cosine similarity. Basically, the weight of the connection between Eric and Veronica is their three shared subscriptions divided by the square root of Eric’s total connections (square root of 3) times the square root of Veronica’s total connections (square root of 7).

If you’re taking notes, that means Eric Taylor and Veronica Mars get a neighborlyness score of 3/sqrt(21)=.65, whereas he and Lord Grantham get a 2/sqrt(9)=.67.

Lord Grantham wins by a hair!

Let’s do more of that

So what if we ran these similarity calculations on every pair of addresses on your list? Then we’d know who was close to whom on your list. We could find little pockets or communities of subscribers that are different than the rest.

Well, we tried it. Lo and behold, we get some cool stuff:

Click the image for a zoom.it-hosted deep zoom image of the graph. Crazy.

The subscription checks we have to perform to make this graph just for one user is in the billions. It’s truly a big data problem that only works because of our vast user base.

Like most projects in EGP, this one started out as weapons development. We’re able to use this technique to find communities of blog comment spammers. Their scripts end up signing up for the blog’s newsletter whenever they sign up for an account to comment, and that’s when they find themselves in users’ lists. But by clustering, we can locate them. Here’s an image of a comment spam cluster detected on one of our publisher’s lists:

The red is our tumor of bad email addresses (each point is an address). And digging further into EGP data, guess where these addresses came from? The Sony breach where Lulzsec made a bunch of addresses and passwords public. Many of those email accounts were then breached and used for spam purposes. Whoops.

But there are a lot more interesting ways for email addresses to cluster than around comment spam. What about all our good readers?

Here are the results for a couple different users. Note: As I go through these results, I’ll overlay snapshots from the campaigns of the top lists the cluster is interested in.

Reader clustering for a fashion and beauty sender

We looked at a user with a large list that sends fashion and beauty tips. We wanted to know if there were clusters on their list that would help them:

  1. Understand unique pockets of their audience
  2. Better target individuals based on their interests

So we calculated these email address neighbors, created a graph of the top three connections of each address to others on the list, and loaded it into Gephi. (Gephi is just a large network graph analysis and visualization software. Fun stuff for nerds.) We were able to suss out quite a few clusters.

Let’s start with this one (click the image for a full-size look):

The cluster has unified interests in knitting, wedding music, wedding invitations, and customer jewelry. These interests also clue us into probable age and gender and stage in life, right? If I’m this sender, perhaps I should keep that in mind when creating content for these folks.

Here are a couple of fun ones:

We’ve got organic shampoo, organic clothing, wellness, and world peace. These are our Whole-Foodies. Let’s contrast them with this cluster on the same list:

Tea party stuff. And volumizing shampoo.

Needless to say, these two groups have wildly divergent interests. You probably wouldn’t do well trying to market fashion tips to both clusters in the same way.

Now, let’s contrast all the previous clusters with this sucker:

Whoa. I see fashion, fashion, fashion, fashion, fashion, and fashion. These are the people that made Nina Garcia into a rock star (and rightfully so—who doesn’t love Nina?).

To take a look at more clusters from this list, you can check out this swoopy, zoomy presentation over at Prezi.com. Beware: The presentation is large. Only open it in Firefox (not Chrome!) on a box with more than 8GB of RAM. May the odds be ever in your favor.

Let’s move on to another user.

Reader clustering for an e-commerce sender

We checked out one of our E-commerce360 users that sells snarky t-shirts. How would their list break down? Here’s an image of it at a high level:

Alright. I’m seeing some clusters there. Let’s investigate a few. Here’s the first one:

Fantasy sports! Guns! And flowers, for what I can only assume are apologies for doing something stupid with the first two.

Then we’ve got this group:

A bong newsletter, outdoor gear, and a dubstep festival that’s booked these guys. What, no Aphex Twin? Kids these days…

I can see, then, if I’m sending newsletters selling snarky t-shirts, that thematically different ones are going to appeal to these two groups. How about this group?

Fitness, fitness, really nice knives, high-end sunglasses, and "Dance Floor Filth 2."

I could go on, but I’m gonna stop the fun there. To view all the clusters in another zoomy presentation, go over to Prezi. This one’s a bit more manageable than the previous one.

So what can we do with all this?

This is very much Skunk Works stuff here. We’re in the early days of data exploration. Maybe you could use this for ad placement ("Ah, lemme advertise in the Vogue Knitting Twitter feed"). If we combined it with engagement data, we could figure out which clusters are engaged more than your list as a whole, and then you could go after those types of readers with your ad dollars. Maybe gun fans spend more than potheads. You could segment and view your e-commerce360 data. We’re working on that data overlay piece, as well as clustering by click vectors rather than subscription vectors.

What do you think? How might you use such a feature? Or what would you need to make it work for your MailChimp account? I set up a little feedback form to take your suggestions if you’ve got them.

P.S. For more discussion on this clustering technique, there’s a tutorial here on doing it in Excel.


More Email Research: