May 15, 2013

# Detecting Click Clusters in Your Data Using Spreadsheet Math

Most of us call our own shots when it comes to segmenting email lists. We tell the computer what to do. We decide that it’s gender and age that matter, and we submit that query—segment my list into old dudes and young ‘uns. Or maybe it’s domain that matters. Or subscription date. Or something else.

And that’s cool. But what about asking the computer to use your data to decide how it would segment your list for you?

Every time we send a newsletter, we generate interactions. Specifically, clicks are a great source of data for understanding our subscribers’ different interests, and those interests can define segments. In this post, we’re going to investigate how to get Microsoft Excel (or LibreOffice if that’s your poison) to suggest list segments based on past clicks.

This is called *data mining *or *unsupervised learning. *It’s called *data mining* because we’re going to get the computer to dig into your click data and discover some interesting segments on its own. It’s called *unsupervised learning*, which is a branch of artificial intelligence, because we’re not going to give the computer any past examples of how *we’ve* segmented. We’ll let it choose what it feels is right.

What can you do with some computer detected segments? Two things:

1. You can dig into the segments it creates to try and understand why. We’re going to use click data in this post, so why would a computer separate these people who clicked these URLs from these other people who clicked these URLs? Cluster analysis can often divide your audience in ways you hadn’t thought about.

2. Even if you don’t understand the why, you can still test out the segments. If I have a list and a computer decides to divide it in two, even without knowing anything about the segments, any A/B test I do on content sent to these two segments will do as well or better when sent at the segment level than at the list level. Sure, both groups might like A better than B. But now you have an opportunity to distinguish a group that may like A better and a group that may like B better.

Let’s get started. But first, some notes:

This post will be a bit longer than most on the blog. I’d encourage you to work along with me in the example spreadsheet (available here for download). If you’re using Chrome and you click the link, Google may try to open the sheet in their Google Drive software rather than downloading. Select "Download Original" from the Google Drive spreadsheet file menu to get the file.

A bit about software: This works in Excel and LibreOffice, but for the love of all things holy, don’t use Google Drive. It won’t work. Google spreadsheets are garbage. Yes, I’m a spreadsheet primadonna, but they just can’t handle what we’re about to do. For the figures in this post, I’m using Excel 2011 for Mac, but this works just as easily in 07 and 10 for Windows. Haven’t tried 2013, but it should be fine. And if you’re using Excel 2003, that’s just sad.

## Pulling down some input data

I’m going to work through an example of this technique using click data from The Next Web, which is one of the most influential tech blogs on the web. If you’ve never been over to TNW, check it out. They send blog updates through MailChimp each day, so their account is a data scientist’s idea of a playground. Of course they gave us permission to use their data here.

Now, why am I using click data instead of open data? Open data is, after all, more prevalent. Cluster analysis depends on the reliability of these interactions at an individual level, and since some mail clients can generate false opens (Apple Mail) while others don’t (Gmail), it’s best not to use interactions that are dubious indicators of individual interest.

So in order to cluster email addresses using click data, I’m going to pull down into a spreadsheet clicks from the last three months of daily emails. In MailChimp, you can export these clicks url-by-url by logging in and viewing the clicks under the subscriber activity section of each report. Alternatively, you can pull this data via the MailChimp API or by hitting the SQLite copy of your data created by Hairball. If you export CSVs from the MailChimp reports section of the website, you’ll need to open them up and copy paste them into a single spreadsheet.

For The Next Web, I’m going to lop off the long tail of the data and consider only the 40 best performing articles out of those they’ve sent in the last 90 days. What I end up with is a sheet in Excel that looks like so:

Note that in this list of who-clicked-what I’ve purposefully obfuscated the email addresses to protect TNW and their readership. I haven’t obfuscated the URLs, because I’m sure they’d be happy for you to click them. Also, I’ve removed the tracking code from the URLs (anything after the & in the URL), leaving just the original link.

## Creating a matrix of click data

The first thing we’re going to do with this data is create click vectors for each email address. A click vector is just a list of 1s and 0s for an email address where a 1 means that they clicked a URL and 0 means they didn’t.

For instance, for my email address, let’s say I clicked google.com and news.ycombinator.com today but not excite.com (which sadly still exists). Then I could write that as [1,1,0] for the URLs in that order.

To do this in Excel, I can just highlight the first two columns of data in the "TNW_40″ tab I showed in the figure above and create a pivot table (Insert–>Pivot Table on Windows, Data–>Pivot Table on Mac) from them.

In the pivot table builder interface, I can place my email addresses in the rows, URLs in the columns, and counts of the URL instances by email in the values section. This gives the builder setup and table shown in the picture:

This is a pretty awesome presentation of click data. The differences in the rows expose the individuality of each address, and as we scroll through the table, outliers pop out faster than Rick Steves at an Insane Clown Posse concert.

For instance, there’s row 2483 (see below) where the email address clicks nearly everything sent. Turns out, the email address maps back to a domain without a website (just receives mail apparently) whose name servers point back to a marketing intelligence firm. They’re likely registering bots to popular email lists to scape content or measure inbox placement. Gee, thanks for that guys. Just FYI, your bot missed a few clicks.

## Getting started with 2-Medians clustering

Let’s set up a sheet where we’re going to divide our list of clickers in 2 using a technique called k-Medians clustering using cosine distance. Scary name, but it’s not that bad.

Essentially, we’re going to create two archetypal click vectors, "cluster center 1″ and "cluster center 2," and each email address will be assigned to whichever of these two clusters has the **nearest** center. These assignments are our two segments.

OK, so let’s create a tab called "TNW 2-Med" and copy-paste values of the pivot table into the top left of the tab. Let’s also add blank rows in row 2 and 3 to the sheet for our two cluster centers (right click row 2 and select "Insert" to insert a blank row).

This gives us the following sheet:

These cluster centers start off empty (empty cells are 0s in Excel) meaning they’re both defined by no clicks whatsoever.

Now, in columns AP and AQ to the right of each click vector row, let’s measure the distance between each email address and the two centers. In the end, each address will get assigned to whichever center is closer.

In order to understand our distance calculation, it’s best to first understand how to measure the cosine similarity between two click vectors. Cosine similarity is calculated as:

# of click matches between center and email address / (sqrt(email address’s total # of clicks) * sqrt(cluster center’s total # of clicks))

On the top of the fraction, we get points for every matched URL between the two vectors. On the bottom of the faction, we divide by the product of the square root of one click count times the square root of the other click count. When the two vectors are identical, this calculation is equal to 1 meaning "completely similar." When the two vectors have no matches, this similarity value goes to 0. If they only overlap on *some* but not all, then we get a value in the middle.

The cool thing about this calculation is that it focuses on clicks rather than non-clicks, because let’s be honest, when someone clicks they’re showing more intentionality than when they don’t. There are a multitude of reasons for not clicking something (non-interest, didn’t understand the title, didn’t see the email, etc.), but a click is most often due to interest. At the same time, the denominator of this fraction penalizes vectors that click everything, essentially watering down the intentionality of every click (like that bot up there).

Since this is cosine similarity, that makes cosine distance simply:

1 – (# of click matches between center and email address / (sqrt(email address’s total # of clicks) * sqrt(cluster center’s total # of clicks)))

Yep, we just subtract the value from 1 to flip everything around. That way, when you’re identical to something, you have a distance of 0 that grows as your clicks begin to differ.

Then in cell AP4 in Excel we can calculate the distance between the first email address and the cluster in row 2 by writing out this distance formula as:

*=IFERROR(1-SUMPRODUCT($B4:$AO4,$B$2:$AO$2)/(SQRT(SUM($B4:$AO4))*SQRT(SUM($B$2:$AO$2))),1)*

The SUMPRODUCT of the two binary vectors just sums up matches. The whole distance calculation is wrapped in IFERROR to account for the fact that if a center were set to all 0s instead of creating a divide-by-zero error, we’ll just specify a maximum distance of 1. Note that I’ve used absolute references ("$") so when I copy the formula to another row, the email address changes but the cluster center’s row stays put.

Similarly in AQ4 we can measure the distance to the second cluster center:

*=IFERROR(1-SUMPRODUCT($B4:$AO4,$B$3:$AO$3)/(SQRT(SUM($B4:$AO4))*SQRT(SUM($B$3:$AO$3))),1)*

In the next cell over, AR4, we can calculate the closest distance of the two centers as:

*=MIN(AP4:AQ4)*

And then in AS4 we can assign the email address to the closest of the two clusters using the MATCH formula:

*=MATCH(AR4,AP4:AQ4,0)*

This just spits out a "1" or a "2" depending on which of the two columns has the same value as the minimum.

We can then copy these cells, highlight AP5:AS2634, and paste them to send the calculations to the other email addresses:

Every distance starts at 1 (as far as it gets), because our centers are both zeroed out.

Let’s label cell AU1 as "Total distance" and below it in AU2 let’s total up the min distance column (AR) as:

*=SUM(AR4:AR2634)*

Initially, this sum is as bad as it gets: 2631.

## Optimizing our two centers

This is where we get segmenting! We need to find the best 0-1 values for our cluster centers on rows 2 and 3. To optimize this problem we’re going to use Solver. Solver comes with Excel, but you have to turn it on. (2007, 2010, Mac 2011) Once solver is added in, it can be opened from the Data tab in the Excel ribbon.

Solver allows us to specify an optimization problem for Excel to solve. In this case, we want to find the best 0-1 vectors for our cluster centers that minimize the total distance in AU2. So opening up Solver, we specify AU2 as our objective and make sure we select "Min," because we want to minimize distance. We’re going to change our cluster center cells *$B$2:$AO$3*, and we’re going to add a constraint that these cells remain binary.

To add a constraint, click Add, give the cell reference ($B$2:$AO$3) and select "Bin" from the drop-down before hitting OK. The solving methods given are algorithms for finding optimal solutions to optimization problems. Our particular problem is non-linear (the decisions are used in the distance calculations inside of denominators and square root calculations), so we can either select "GRG Nonlinear" or "Evolutionary" as our algorithms to solve. If you use the evolutionary solver, bump the maximum time without improvement up from 30 seconds to something more reasonable, like 5 minutes.

This gives the solver setup pictured below:

Pressing solve, we can watch Excel try things out until the distance falls to an optimal level. Now, since this is a non-linear problem, Excel can’t *prove* its solution is optimal. Feel free to run the problem again, maybe using the other algorithm. In my runs, I ended up at 1832:

## Interpreting the results

As we scroll through the cluster assignments in column AR, we see that some emails go to cluster 1 and some to cluster 2. All right. Why?

One way to understand this is to count up the most popular URLs within each cluster. Scroll to the bottom of the sheet, and label rows 2635 and 2636 as "Count 1″ and "Count 2."

Let’s start in B2635. In this cell we’d like to sum up all the clicks for our first URL by cluster 1 email addresses if indeed this URL is part of the cluster center definition. To do that, we use this formula:

*=B2*SUMPRODUCT(B4:B2634,IF($AS4:$AS2634=1,1,0))*

We have B2, which is whether or not the URL is part of the archetypal cluster vector, times the sum of all clicks of the URL from those whose assignment is to cluster 1. Now, unfortunately, this formula doesn’t run in Excel as-is, because the IF formula is operating on an entire range of values. This only works if we make the whole thing an array formula. To make that happen just press CONTROL + ENTER (CMD + RETURN on a Mac) when you put in the formula. This will make curly braces appear around the formula and now everything will work just fine.

The formula in B2636 is the same except we check the center for cluster 2 (B3), and in the IF statement, we look for emails assigned to cluster 2:

*{=B3*SUMPRODUCT(B4:B2634,IF($AS4:$AS2634=2,1,0))}*

Note that I’ve used absolute references on the cluster assignment column AS, so we can copy these two cells, select range C2635:AO2636 and paste them to get all of our counts. This gives the sheet shown below:

Now, let’s create a new tab called "2-Med Interpretation". First thing first, let’s paste our URLs in the first column. Since they’re in a row on the previous tab, copy the URLs and then paste-special-values making sure to check off the "Transpose" option in the Paste Special dialog box. In columns B and C, do the same for the count values from the previous tab. This yields:

Applying Auto-filtering to the 3 columns and sorting by the counts in the first cluster, the top stories are all "shareables" (general human interest) or Google/Android related. In contrast, sorting by the counts in the second cluster, the top four stories are all iOS related. See the screenshot below:

Pretty cool right? One thing that TNW could try based on this is A/B testing subject lines *per segment* where Android vs. iOS is played up where relevant (such as in the AppMachine post).

## Adding more clusters

And we don’t have to just do two segments. We can add another cluster center to the clustering sheet, tack on a third distance column, modify the solver formulation, and get some more counts at the bottom of the sheet. In the downloadable example workbook, I’ve added a "TNW 3-Med" sheet that does just that. So how do things change with three clusters?

Well, one cluster that the model chooses to separate out is a new "shareables" article cluster, defining a group of email addresses that mostly click the general interest stories (5 of the top 6 articles are "shareable" with the outlier being one about Google glass). See the screenshot below:

We could keep going. Vanilla Excel Solver allows for 200 decision variables and 200 constraints on non-linear problems. With 40 URLs and 3 clusters, we’re clocking in at 120 decision variables and 120 binary constraints. So we could go up to 5 clusters on this problem. The fewer the URLs you consider, the more clusters you can do. Note that the number of email addresses on your list does not affect the size of the problem.

## Getting this data back in

Once we have some clusters we’d like to act on, we can copy/paste values of the email addresses and their assignments to a new sheet:

Auto-filtering the assignments, we can copy the addresses from the cluster we’re interested in. Then we can navigate to the Lists section of MailChimp, and under the List Tools menu we can create a static segment of our list by pasting this cluster of addresses into the static segment box:

We can save that segment off and then send whatever we want to it!

## Wrapping up

OK, so I know that wasn’t the easiest worksheet you’ve ever put together, but I hope it demonstrates the power of your own campaign data, which can be mined by the computer even when it may be too large for you to understand with your own two eyeballs.

Feel free to try this out, and reach out if you find something interesting in your own data. And if you think this stuff is pretty awesome (that’s probably about two of you at this point), I’ve got a whole book on spreadsheet data science coming out this fall. Sign up for my newsletter to be notified about the pre-order.

Good night. Happy clustering.

## Adam

I understood some of those words.

05.15.2013

## mikeschinkel

I feel like I’m back at Georgia Tech in 6th quarter Calculus, and completely overwhelmed again. :)

So what’s a “nearest center” and what do these clusters represent? Now it seems I know

howto use this analysis (kind of) but it’s not clearwhyI’d use it. In practical terms what does the distinction between clustersmean, ya know, like a double rainbow? :)05.15.2013

## John MailChimp

Let me explain by way of analogy. Imagine you’re at a middle school dance, and you want to segment the attendees into 2 segments. The cluster centers are like flags you’d plant on the dance floor, and each dancer is assigned to the flag closest to his or herself. The algorithm searches the space and finds that the best way to plant those flags is in the middle of two groups of dancers: the boys and the girls. Because in their acne-fueled nervousness, neither of them are dancing with each other.

The clusters in this example represent the best way, algorithmically, to divide up the email list. And the centers are just flags planted in the ground that “define” the segment. The section I give above on interpreting the segments shows the best way to attribute meaning to them. Essentially in the 2 segment case, what the result is saying is that the most natural separation you’re going to get in your list, if you want to segment by interest, is to segment the iOS folks from the Android folks. This gives a better separation of interests than if you segmented by, say, gender. And that nice separation allows for better targeting.

05.16.2013

## David Mann

Like a voronoi diagram!

05.30.2013

## Lavneet Sharma

Thanks for the information.. Got to learn something new regarding data mining. ;) Well, I didn’t have time to read complete tutorial.. :p But I’ve bookmarked it and m gonna read it after some time.

05.16.2013

## Allen

John….dude…you sir just blew my mind!

First, didn’t know Excel had the cajones to do such complex decisions / algorithms. Definitely learned something!

Have a question, if I’ve got this right – when you imported your findings back into Mailchimp, that created a static segment right? So, when TNW gets a new subscriber and sends them some updates they’d have to consistently rerun this analysis and re-pop the new list segmentation?

Why not just associate a tag with a news story (shareables, ios, android, etc.) and then auto update the subscriber’s profile with the tag if they clicked the story? Real time, indicates interest, and allows multiple segmentation.

05.17.2013

## Leo

Very interesting read! I want to try this myself, but the dropbox link 404’s. Is it still up or am I doing something wrong?

06.12.2013

## John MailChimp

Apologies! Give it a shot now.

06.12.2013

## Leo

Thanks, I got it now. Much appreciated!

06.12.2013

## The Hitting Project

I thought I was an expert with mail chimp on my site, but blogs like this have taught me I have a long way to go with email marketing.

08.09.2013