Implementation of simple K-NN from scratch

3 min readSep 27, 2021

https://www.askpython.com/python/examples/k-nearest-neighbors-from-scratch

A small Intro about K-Nearest Neighbors algorithm

Here I will a brief introduction about K-NN working procedure.

we will have an existing set of example data which is our training set, we have labels for all data in our training set, now we will be given a new data point without label. Now we compare new data point with every point in existing data. We then take the nearest neighbors and look at their labels. We look at the top k most similar labels(nearest neighbors) and from here KNN follows the democracy 😉. based of majority votes we will return label as predicted value.

About dataset and problem

Before we exactly go into the proper coding, let me explain me about the dataset. Our dataset contains three files

train.csv (contains Weight, Colour and Label)
test.csv (contains only Weight and Colour)
test-labels.csv (contains only labels)

As this is a simple implementation our training data set will have only Numerical data to make out computation easy.

Implementation Begins

As we have data in our csv files we read those files using pandas.

train_data = pd.read_csv("dataset/train.csv")
#Read test data
test_data = pd.read_csv("dataset/test.csv")

In this implementation we are using Euclidean distance as distance measure so now we have to implement function for euclidean distance.

https://aigents.co/blog/publication/distance-metrics-for-machine-learning

def euclidean_disance(a, b):
    #a=(x1, y1), b=(x2, y2)
    if not (len(a) == 2 and len(b) == 2):
        raise ValueError('length of a and b expected is 2, 2 but got {} and {} respectively'.format(len(a), len(b))) 
    return np.sqrt( (b[0] - a[0])**2 + (b[1] - a[1])**2  )

Now we have to calculate distance using euclidean distance.

As we are doing a simple implementation, we took our hyper-parameter k = 5. In the below code we are computing euclidean distance for each point in test data with each point in train data and saving them into a temp variable along with labels and we are sorting and slicing them with k.

test_dict = test_data.to_dict('records')
train_dict = train_data.to_dict('records')# Taking K as default value 5
k = 5 # hyper-parameterpredicted_labels = []for test_row in test_dict:
    test_point = (test_row['Weight'], test_row['Colour'])
    measured_distances = []
    
    for train_row in train_dict:
        train_point = (train_row['Weight'], train_row['Colour'])
        distance = euclidean_disance(test_point, train_point)
        measured_distances.append((distance, train_row['Label']))
        # Sorted_distances
        measured_distances.sort() # as my first element is a distance(float) in tuple default sort works here
        measured_distances = measured_distances[:k]
    predicted_labels.append(majority_votes(measured_distances))

Majority Voting

# https://stackoverflow.com/questions/47843707/count-frequency-of-item-in-a-list-of-tuples
from collections import Counterdef majority_votes(a):
    """
    Returns a string which has more major count in the given list.
    
    Parameters
    ----------
    a: array required
       contains array of tuples with first element is distance and second element as label
    """
    counts = Counter(x[1] for x in a)
    return 'B' if counts['B']>counts['A'] else 'A'

Now after predicting the labels using majority votes we have small sanity check here.

we have a third file with only labels which is very useful to check our predicted labels. we will compare our predicted labels with values in that file which will return us the accuracy.

Note: As it is an simple example and very small and easy dataset we will get 100 % accuracy in the real time it is not the case.

defined_labels = pd.read_csv('dataset/test-labels.csv')
count = 0
defined_labels_dict = defined_labels.to_dict('records')for row in defined_labels_dict:
    if row['Label'] in predicted_labels:
        count += 1
accuracy = count/len(test_dict)
print('Accuracy: {} %'.format( accuracy * 100.0))

this is the simple implementation of KNN very easy to understand and implement and best algorithm to start as first step to Machine Learning.

Thank for your time

full code with dataset is available here

Kindly let me know if any mistakes.

References:

https://appliedroots.com

Machine_Learning_in_Action book by Peter Harrington