Nearest Neighbor Regression

Nearest Neighbor Regression#

Nearest neighbor

Learning from training data#

A key concept in machine learning is using a subset of a dataset to train an algorithm to make estimates on a separate set of test data. The quality of the machine learning and algorithm can be assesed based on the accuracy of the predictions made on test data. Many times there are also parameters sometimes termed hyper-parameters which can be optimized through an iterative approach on test or validation data. In practice a dataset is randomly split into training and test sets using sampling.

k nearest neighbor#

We will examine one machine learning algorithm in the laboratory, k nearest neighbor. Many of the concepts are applicable to the broad range of machine learning algorithms available.

Nearest neighbor concept #

The training examines the characteristics of k nearest neighbors to the data point for which a prediction will be made. Nearness is measured using several different metrics with Euclidean distance being a common one for numerical attributes.
Euclidean distance:

(4)#\[\begin{align} d(p,q) = \sqrt{(p-q)^{2}} \label{eq:distance_1D} \tag{1} \end{align}\]

(5)#\[\begin{align} d(p,q) = \sqrt{(p_1-q_1)^{2}+(p_2-q_2)^{2}} \label{eq:distance_2D} \tag{2} \end{align}\]

(6)#\[\begin{align} d(p_i,q_i) = \sqrt{\sum_{i}{((p_i-q_i)^{2})}} \label{eq:distance_multiD} \tag{3} \end{align}\]

import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Distance function inspired from above equations (1-3)#

def distance(pt1, pt2):
    """The distance between two points, represented as arrays."""
    return np.sqrt(np.sum((pt2-pt1)**2))

Nearest neighbor Functions#

These cells create the complete algorithm and use as part of a nearest neighbor toolbox

def row_distance(row1, row2):
    """The distance between two rows of a table."""
    return distance(np.array(row1), np.array(row2)) # Need to convert rows into arrays

def distances(training, test, target, features):
    """Compute the distance from test for each row in training."""
    dists = []
    attributes = training.select(features)
    for row in attributes.rows:
        dists.append(row_distance(row, test))
    return training.with_column('Distance', dists)

def closest(training, test, k, target, features):
    """Return a table of the k closest neighbors to example row from test data."""
    return distances(training, test, target, features).sort('Distance').take(np.arange(k))

Prediction Functions#

def predict_knn(row, train, test, k=5, pr=False):
    """Return the predicting value or class among the 
     k nearest neighbors, pr=1 prints"""
    predict = np.average(closest(train, test.select(features).row(row), k , target, features).column(target[0]))
    if pr:
            print(f'Predicting target value, {target[0]}, for row = {row} using k={k} with features: {features}')
            print(f'Actual value: {test.select(target).take(row)[0][0]:.2f}')
            print(f'Predicted value: {predict:.2f}')
            print(f'Closest neighbor values: {closest(train, test.select(features).row(row), k , target, features).column(target[0])}')
    return predict

def predict_knn_class(row, train, test, k=5, pr=False):
    """Return the predicting value or class among the 
     k nearest neighbors, pr=1 prints"""
    closestclass = list(closest(train, test.select(features).row(row), k , target, features).column(target[0]))
    if pr:
            print(f'Predicting target value, {target[0]}, for row = {row} using k={k} with features: {features}')
            print(f'Actual classification: {test.select(target).take(row)[0][0]}')
            print(f'Predicted classification: {max(closestclass, key=closestclass.count)}')
            print(f'Closest classifications: {closestclass}')
    return max(closestclass, key=closestclass.count)

Regression Functions#

Use as part of a toolbox for later analysis and the project

def standard_units(any_array):
    "Convert any array of numbers to standard units."
    return (any_array - np.mean(any_array))/np.std(any_array)  
    
def correlation(t, label_x, label_y):
    """Compute the correlation between two variables from a Table with column label_x and label_y.."""
    return np.mean(standard_units(t.column(label_x))*standard_units(t.column(label_y)))

def slope(t, label_x, label_y):
    """Compute the slope between two variables from a Table with column label_x and label_y."""
    r = correlation(t, label_x, label_y)
    return r*np.std(t.column(label_y))/np.std(t.column(label_x))

def intercept(t, label_x, label_y):
    """Compute the slope between two variables from a Table with column label_x and label_y."""
    return np.mean(t.column(label_y)) - slope(t, label_x, label_y)*np.mean(t.column(label_x))