Instructions and tips for (def shef 33) kNN classifier dojo
There’s a presentation to give you context and get you started.
Essentially, the naive kNN classifier considers data as points in a space - each point has co-ordinates and a label. The co-ordinates are known as “features”. This approach means you can measure the distance between any two points in that space, and so you can find the “nearest neighbours” of any point. The algorithm performs this calculation for an unlabelled datum and classifies the unlabelled point the same as the majority of its “k” nearest neighbours. Simple!
We’ll use the classic “Iris” data set to build and test our algorithms. It’s provided by the UCI Machine Learning Repository and it’s a comma-separated table with rows like this:
The four numbers are features of the flower, things like petal width and stamen length. The last string is a label - this row is measurements of Iris Setosa.
I suggest we use this dataset because you don’t need to know what the data means to get the algorithm working, we know we should be able to get good predictive power from it, and it’s small - 150 rows - so we can start with writing a completely naive classifier and focus on the machine learning and functional programming aspects, without worrying about performance.
Your task, should you choose to accept it, is:
The following is a suggested breakdown - feel free to completely ignore it!
There’s a quick (and dirty!) implementation I did in Clojure included in this project. I provide it here in case you get completely stuck on something! For an example of target performance on the iris dataset using this implementation, I typically get mean correct predictions exceeding 0.9, but variation on exact accuracy between runs.
Once you have a working classifier, there’s a whole bunch of things you can play with, for example:
Copyright © 2017 Paul Brabban
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.