R Language: Implementing K Nearest Neighbours
DOWNLOAD
Intro
A generalized function for calculating nearest neighbours for any value of k. This is a set based solution that aims to optimise for speed by avoiding as many loops as possible.knn_general <- function
Code
Ok this page is a work in progress and to begin with I’m just going to dump all my code here.
Now this is an implementation of KNN that I made in Q4 2016 and it was THE first thing I built in R. I tried to make it rely on as few loops as possible since… R is really slow.
I do think it’s imperfect in the way it tries to use merge
as an analogue to SQL’s CROSS JOIN
and INNER JOIN
.
The problem is that merge
returns a set that is completely different from the original ordering.
I mean it’s just a waste of time.
Well I had some design constraints, which is why I went about doing it this way.
Design constraints:
- Not allowed to use
sort
ororder
. - Minimise the use of for-loops.
knn_general
Class Creation
First I create my class. It takes four parameters:
trainObject
: a vector of training data consisting of Objects.testObject
: a vector of test data. This vector would probably just be a singleton.trainLabel
: a vector of training data consisting of Labels.
knn_general <- function
( trainObject
, testObject
, trainLabel
, kValue
){
Computing Mode
R does not appear to have a function of finding model values.
modlab <- function
(xx){
modlab <- aggregate(
as.numeric(xx)
, by=list(as.numeric(xx))
, FUN = length
) [which.max(aggregate(
as.numeric(xx)
, by=list(as.numeric(xx))
, FUN = length
) $x),1]
return(modlab)
}
INITIALISE DATA FRAMES AND CREATE IDENTITIES
trainObject <- data.frame(rownames(trainObject),trainObject);
names(trainObject)[1] <- paste("trainID");
testObject <- data.frame(rownames(testObject),testObject);
names(testObject)[1] <- paste("testID");
trainLabel <- data.frame(trainLabel);
trainLabel <- data.frame(rownames(trainLabel),trainLabel);
names(trainLabel)[1] <- paste("labelID");
CREATE DATA FRAME FOR STORING PREDICTED LABEL
This stores the predicted labels for our testObject
of every nearest neighbour at any level of k.
predicted <- data.frame(testObject[,1]);
names(predicted)[1] <- paste("testID");
This converts testID
from factor to numeric to maintain ordering.
predicted$testID <- as.numeric(levels(predicted$testID))[predicted$testID]