Why Are You Doing This in Excel? A Primer on K-means Clustering


Introduction

A few years back, when Data Science was still Big Data (but after it was called Artificial Intelligence), I was tasked with helping to prepare a primer on different techniques that could be useful for cost analysis. One of the techniques, K-means clustering, seemed simple enough to tackle and I was soon distracted enough by it to implement it. Recently, I came across the topic again and was inspired to see if I'd learned enough in the interim to do a better version. Ignoring common sense, I once more set out to solve a non-existent problem using a 30-year-old language that hasn't been supported in a decade.

This is the result.

Generating the Data

This is a solution in search of a problem, so generating that problem is the first step. I like problems that I already know the answer to, so I generated three groups of 2 data type (x,y) sets. I used a simple random distribution on the x's and y's with a set 'deviation' to get a nice spread. 

The result looks like this: 


I picked clusters that would intentionally be a little confusing to keep myself honest - hence, closely packed. Shaping the sets as circles would have been more optimal for this implementation, but since I'm not arguing for this particular technique I kept the generated problem simple. 

The Approach

K-Means clustering is an extraordinarily simple algorithm that has 2 real steps:

Initilization: Make k guesses for each potential central point (mean). I used a random set of points for fun graphics, but something like min, max, average, and regular intervals would be more scalable.
  1. Group the data according to the closest mean
  2. Calculate the means of the new groups, and use these as the next guess
Rinse and repeat until your data achieves the desired sheen. 

Here's what the data looks like after the first iteration:


Nice and messy.

If you use this for an actual problem, a tolerance should probably be defined, but in the spirit of this blog, I just stopped at 15 steps since the data is known to be well behaved in the assumed structure (3 means). At that point, it already looks like this: 


As you can see, this is more or less the original grouping. The actual means themselves are totally different than the original, but the fun thing about this sort of analysis is that it doesn't really matter.

I hope you enjoyed this overcomplicated solution to a problem that doesn't exist. If you'd like to view the source code, the link to the Excel sheet (VBA script included) is here.

Bonus Useless Solution - Refreshing charts in the middle of a VBA loop

You shouldn't use Excel to animate things, but I wanted to watch my groups evolve to the solution through each iteration. Here's the code I inserted at the end of the loop to get it to repopulate graphs without re-drawing every single point individually:

'Refresh graph display

        DoEvents
        ActiveSheet.ChartObjects("Chart 3").Chart.Refresh
        Application.Calculate
        DoEvents

I don't know why charts don't refresh automatically, I don't know why Chart.Refresh doesn't work without DoEvents, I don't know why DoEvents needs to be in there twice, but I'm pretty sure no one else does either. 

Fortunately, I don't have to, so if you run the macro you can watch the Solved graph evolve into the solution. Enjoy the show.










Comments

Popular posts from this blog

What does it take to win at Kaggle? An Introduction to Data Strategy

Cutting Logs Part 1: Catching Ideas and Z's

Creating a Dynamic Calendar