Nearest Neighbor Regression#

Nearest neighbor

Learning from training data#

A key concept in machine learning is using a subset of a dataset to train an algorithm to make estimates on a separate set of test data. The quality of the machine learning and algorithm can be assesed based on the accuracy of the predictions made on test data. Many times there are also parameters sometimes termed hyper-parameters which can be optimized through an iterative approach on test or validation data. In practice a dataset is randomly split into training and test sets using sampling.

k nearest neighbor#

We will examine one machine learning algorithm in the laboratory, k nearest neighbor. Many of the concepts are applicable to the broad range of machine learning algorithms available.

Nearest neighbor concept#

The training examines the characteristics of k nearest neighbors to the data point for which a prediction will be made. Nearness is measured using several different metrics with Euclidean distance being a common one for numerical attributes.
Euclidean distance:

(4)#\[\begin{align} d(p,q) = \sqrt{(p-q)^{2}} \label{eq:distance_1D} \tag{1} \end{align}\]
(5)#\[\begin{align} d(p,q) = \sqrt{(p_1-q_1)^{2}+(p_2-q_2)^{2}} \label{eq:distance_2D} \tag{2} \end{align}\]
(6)#\[\begin{align} d(p_i,q_i) = \sqrt{\sum_{i}{((p_i-q_i)^{2})}} \label{eq:distance_multiD} \tag{3} \end{align}\]
import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Distance function inspired from above equations (1-3)#

def distance(pt1, pt2):
    """The distance between two points, represented as arrays."""
    return np.sqrt(np.sum((pt2-pt1)**2))

Nearest neighbor Functions#

These cells create the complete algorithm and use as part of a nearest neighbor toolbox

def row_distance(row1, row2):
    """The distance between two rows of a table."""
    return distance(np.array(row1), np.array(row2)) # Need to convert rows into arrays

def distances(training, test, target, features):
    """Compute the distance from test for each row in training."""
    dists = []
    attributes = training.select(features)
    for row in attributes.rows:
        dists.append(row_distance(row, test))
    return training.with_column('Distance', dists)

def closest(training, test, k, target, features):
    """Return a table of the k closest neighbors to example row from test data."""
    return distances(training, test, target, features).sort('Distance').take(np.arange(k))

Prediction Functions#

def predict_knn(row, train, test, k=5, pr=False):
    """Return the predicting value or class among the 
     k nearest neighbors, pr=1 prints"""
    predict = np.average(closest(train, test.select(features).row(row), k , target, features).column(target[0]))
    if pr:
            print(f'Predicting target value, {target[0]}, for row = {row} using k={k} with features: {features}')
            print(f'Actual value: {test.select(target).take(row)[0][0]:.2f}')
            print(f'Predicted value: {predict:.2f}')
            print(f'Closest neighbor values: {closest(train, test.select(features).row(row), k , target, features).column(target[0])}')
    return predict
def predict_knn_class(row, train, test, k=5, pr=False):
    """Return the predicting value or class among the 
     k nearest neighbors, pr=1 prints"""
    closestclass = list(closest(train, test.select(features).row(row), k , target, features).column(target[0]))
    if pr:
            print(f'Predicting target value, {target[0]}, for row = {row} using k={k} with features: {features}')
            print(f'Actual classification: {test.select(target).take(row)[0][0]}')
            print(f'Predicted classification: {max(closestclass, key=closestclass.count)}')
            print(f'Closest classifications: {closestclass}')
    return max(closestclass, key=closestclass.count)

Regression Functions#

Use as part of a toolbox for later analysis and the project

def standard_units(any_array):
    "Convert any array of numbers to standard units."
    return (any_array - np.mean(any_array))/np.std(any_array)  
    
def correlation(t, label_x, label_y):
    """Compute the correlation between two variables from a Table with column label_x and label_y.."""
    return np.mean(standard_units(t.column(label_x))*standard_units(t.column(label_y)))

def slope(t, label_x, label_y):
    """Compute the slope between two variables from a Table with column label_x and label_y."""
    r = correlation(t, label_x, label_y)
    return r*np.std(t.column(label_y))/np.std(t.column(label_x))

def intercept(t, label_x, label_y):
    """Compute the slope between two variables from a Table with column label_x and label_y."""
    return np.mean(t.column(label_y)) - slope(t, label_x, label_y)*np.mean(t.column(label_x))


Nearest neighbor regression example#


We will look at home sales in Ames, Iowa from 2006-2010. The dataset is described here#

HOUSE=Table().read_table('data/house.csv')
HOUSE
Order PID MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape Land Contour Utilities Lot Config Land Slope Neighborhood Condition 1 Condition 2 Bldg Type House Style Overall Qual Overall Cond Year Built Year Remod/Add Roof Style Roof Matl Exterior 1st Exterior 2nd Mas Vnr Type Mas Vnr Area Exter Qual Exter Cond Foundation Bsmt Qual Bsmt Cond Bsmt Exposure BsmtFin Type 1 BsmtFin SF 1 BsmtFin Type 2 BsmtFin SF 2 Bsmt Unf SF Total Bsmt SF Heating Heating QC Central Air Electrical 1st Flr SF 2nd Flr SF Low Qual Fin SF Gr Liv Area Bsmt Full Bath Bsmt Half Bath Full Bath Half Bath Bedroom AbvGr Kitchen AbvGr Kitchen Qual TotRms AbvGrd Functional Fireplaces Fireplace Qu Garage Type Garage Yr Blt Garage Finish Garage Cars Garage Area Garage Qual Garage Cond Paved Drive Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch Screen Porch Pool Area Pool QC Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice
1 526301100 20 RL 141 31770 Pave nan IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6 5 1960 1960 Hip CompShg BrkFace Plywood Stone 112 TA TA CBlock TA Gd Gd BLQ 639 Unf 0 441 1080 GasA Fa Y SBrkr 1656 0 0 1656 1 0 1 0 3 1 TA 7 Typ 2 Gd Attchd 1960 Fin 2 528 TA TA P 210 62 0 0 0 0 nan nan nan 0 5 2010 WD Normal 215000
2 526350040 20 RH 80 11622 Pave nan Reg Lvl AllPub Inside Gtl NAmes Feedr Norm 1Fam 1Story 5 6 1961 1961 Gable CompShg VinylSd VinylSd None 0 TA TA CBlock TA TA No Rec 468 LwQ 144 270 882 GasA TA Y SBrkr 896 0 0 896 0 0 1 0 2 1 TA 5 Typ 0 nan Attchd 1961 Unf 1 730 TA TA Y 140 0 0 0 120 0 nan MnPrv nan 0 6 2010 WD Normal 105000
3 526351010 20 RL 81 14267 Pave nan IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6 6 1958 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace 108 TA TA CBlock TA TA No ALQ 923 Unf 0 406 1329 GasA TA Y SBrkr 1329 0 0 1329 0 0 1 1 3 1 Gd 6 Typ 0 nan Attchd 1958 Unf 1 312 TA TA Y 393 36 0 0 0 0 nan nan Gar2 12500 6 2010 WD Normal 172000
4 526353030 20 RL 93 11160 Pave nan Reg Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 7 5 1968 1968 Hip CompShg BrkFace BrkFace None 0 Gd TA CBlock TA TA No ALQ 1065 Unf 0 1045 2110 GasA Ex Y SBrkr 2110 0 0 2110 1 0 2 1 3 1 Ex 8 Typ 2 TA Attchd 1968 Fin 2 522 TA TA Y 0 0 0 0 0 0 nan nan nan 0 4 2010 WD Normal 244000
5 527105010 60 RL 74 13830 Pave nan IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 5 5 1997 1998 Gable CompShg VinylSd VinylSd None 0 TA TA PConc Gd TA No GLQ 791 Unf 0 137 928 GasA Gd Y SBrkr 928 701 0 1629 0 0 2 1 3 1 TA 6 Typ 1 TA Attchd 1997 Fin 2 482 TA TA Y 212 34 0 0 0 0 nan MnPrv nan 0 3 2010 WD Normal 189900
6 527105030 60 RL 78 9978 Pave nan IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 6 6 1998 1998 Gable CompShg VinylSd VinylSd BrkFace 20 TA TA PConc TA TA No GLQ 602 Unf 0 324 926 GasA Ex Y SBrkr 926 678 0 1604 0 0 2 1 3 1 Gd 7 Typ 1 Gd Attchd 1998 Fin 2 470 TA TA Y 360 36 0 0 0 0 nan nan nan 0 6 2010 WD Normal 195500
7 527127150 120 RL 41 4920 Pave nan Reg Lvl AllPub Inside Gtl StoneBr Norm Norm TwnhsE 1Story 8 5 2001 2001 Gable CompShg CemntBd CmentBd None 0 Gd TA PConc Gd TA Mn GLQ 616 Unf 0 722 1338 GasA Ex Y SBrkr 1338 0 0 1338 1 0 2 0 2 1 Gd 6 Typ 0 nan Attchd 2001 Fin 2 582 TA TA Y 0 0 170 0 0 0 nan nan nan 0 4 2010 WD Normal 213500
8 527145080 120 RL 43 5005 Pave nan IR1 HLS AllPub Inside Gtl StoneBr Norm Norm TwnhsE 1Story 8 5 1992 1992 Gable CompShg HdBoard HdBoard None 0 Gd TA PConc Gd TA No ALQ 263 Unf 0 1017 1280 GasA Ex Y SBrkr 1280 0 0 1280 0 0 2 0 2 1 Gd 5 Typ 0 nan Attchd 1992 RFn 2 506 TA TA Y 0 82 0 0 144 0 nan nan nan 0 1 2010 WD Normal 191500
9 527146030 120 RL 39 5389 Pave nan IR1 Lvl AllPub Inside Gtl StoneBr Norm Norm TwnhsE 1Story 8 5 1995 1996 Gable CompShg CemntBd CmentBd None 0 Gd TA PConc Gd TA No GLQ 1180 Unf 0 415 1595 GasA Ex Y SBrkr 1616 0 0 1616 1 0 2 0 2 1 Gd 5 Typ 1 TA Attchd 1995 RFn 2 608 TA TA Y 237 152 0 0 0 0 nan nan nan 0 3 2010 WD Normal 236500
10 527162130 60 RL 60 7500 Pave nan Reg Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 7 5 1999 1999 Gable CompShg VinylSd VinylSd None 0 TA TA PConc TA TA No Unf 0 Unf 0 994 994 GasA Gd Y SBrkr 1028 776 0 1804 0 0 2 1 3 1 Gd 7 Typ 1 TA Attchd 1999 Fin 2 442 TA TA Y 140 60 0 0 0 0 nan nan nan 0 6 2010 WD Normal 189000

... (2920 rows omitted)

House price prediction#

Let’s try to predict house price using features available in the extensive dataset.

Define target and features#

target = ['SalePrice']
features = ['1st Flr SF','Full Bath', '2nd Flr SF','TotRms AbvGrd' ]
sHOUSE = HOUSE.select(target[0])

Standardize#

for label in features:
    print('Standardizing: ',label)
    sHOUSE = sHOUSE.with_columns(label,standard_units(HOUSE[label]))
sHOUSE  
Standardizing:  1st Flr SF
Standardizing:  Full Bath
Standardizing:  2nd Flr SF
Standardizing:  TotRms AbvGrd
SalePrice 1st Flr SF Full Bath 2nd Flr SF TotRms AbvGrd
215000 1.267 -1.02479 -0.783185 0.354167
105000 -0.672643 -1.02479 -0.783185 -0.917535
172000 0.432445 -1.02479 -0.783185 -0.281684
244000 2.42569 0.784028 -0.783185 0.990018
189900 -0.590974 0.784028 0.853432 -0.281684
195500 -0.596078 0.784028 0.799734 0.354167
213500 0.455414 0.784028 -0.783185 -0.281684
191500 0.307389 0.784028 -0.783185 -0.917535
236500 1.16492 0.784028 -0.783185 -0.917535
189000 -0.335757 0.784028 1.02853 0.354167

... (2920 rows omitted)

Train, test split#

trainH, testH = sHOUSE.split(int(0.8*sHOUSE.num_rows))
print(trainH.num_rows, 'training and', testH.num_rows, 'test instances.')

trainH.show(3)
2344 training and 586 test instances.
SalePrice 1st Flr SF Full Bath 2nd Flr SF TotRms AbvGrd
243500 0.307389 0.784028 2.05346 1.62587
250000 1.21086 0.784028 -0.783185 0.354167
132000 0.276763 -1.02479 -0.783185 0.354167

... (2341 rows omitted)

predict_knn(16, trainH, testH, k=8, pr=True)
Predicting target value, SalePrice, for row = 16 using k=8 with features: ['1st Flr SF', 'Full Bath', '2nd Flr SF', 'TotRms AbvGrd']
Actual value: 147110.00
Predicted value: 151731.50
Closest neighbor values: [148800 151000 144152 149900 151000 149500 155000 164500]
151731.5

Test prediction accuracy using specified features#

k = 5
error = []
for i in np.arange(testH.num_rows):
    predict = predict_knn(i, trainH, testH, k=8, pr=False)
    error.append( predict - testH[target[0]][i])
print(f'Mean signed error: {np.mean(np.array(error)):.2f}')
print(f'Root mean squared error (RMSE): {np.sqrt(np.mean((np.array(error))**2)):.2f}')
Mean signed error: -6495.56
Root mean squared error (RMSE): 52968.99

Pretty good prediction, let’s see if we can do better with additional features

Add additional features#

target = ['SalePrice']
features = ['Lot Area','1st Flr SF','2nd Flr SF','Full Bath','TotRms AbvGrd', 'Overall Qual' ]
sHOUSE = HOUSE.select(target[0])
for label in features:
    print('Standardizing: ',label)
    sHOUSE = sHOUSE.with_columns(label,standard_units(HOUSE[label]))
sHOUSE  
Standardizing:  Lot Area
Standardizing:  1st Flr SF
Standardizing:  2nd Flr SF
Standardizing:  Full Bath
Standardizing:  TotRms AbvGrd
Standardizing:  Overall Qual
SalePrice Lot Area 1st Flr SF 2nd Flr SF Full Bath TotRms AbvGrd Overall Qual
215000 2.74438 1.267 -0.783185 -1.02479 0.354167 -0.0672537
105000 0.187097 -0.672643 -0.783185 -1.02479 -0.917535 -0.776079
172000 0.522814 0.432445 -0.783185 -1.02479 -0.281684 -0.0672537
244000 0.128458 2.42569 -0.783185 0.784028 0.990018 0.641571
189900 0.467348 -0.590974 0.853432 0.784028 -0.281684 -0.776079
195500 -0.0215673 -0.596078 0.799734 0.784028 0.354167 -0.0672537
213500 -0.663554 0.455414 -0.783185 0.784028 -0.281684 1.3504
191500 -0.652765 0.307389 -0.783185 0.784028 -0.917535 1.3504
236500 -0.604026 1.16492 -0.783185 0.784028 -0.917535 1.3504
189000 -0.336087 -0.335757 1.02853 0.784028 0.354167 0.641571

... (2920 rows omitted)

Train, test split#

trainH, testH = sHOUSE.split(int(0.8*sHOUSE.num_rows))
print(trainH.num_rows, 'training and', testH.num_rows, 'test instances.')

trainH.show(3)
2344 training and 586 test instances.
SalePrice Lot Area 1st Flr SF 2nd Flr SF Full Bath TotRms AbvGrd Overall Qual
150000 -0.374165 -0.345966 0.533579 -1.02479 -0.281684 -0.776079
310013 0.262618 -0.522065 1.3764 0.784028 0.990018 1.3504
85000 -0.959035 -1.59142 0.561595 -1.02479 -1.55339 -0.776079

... (2341 rows omitted)

predict_knn(16, trainH, testH, k=8, pr=True)
Predicting target value, SalePrice, for row = 16 using k=8 with features: ['Lot Area', '1st Flr SF', '2nd Flr SF', 'Full Bath', 'TotRms AbvGrd', 'Overall Qual']
Actual value: 87500.00
Predicted value: 125187.50
Closest neighbor values: [107000 135000  98000 132000 167500 174000  50000 138000]
125187.5

Test prediction accuracy using specified features#

k=5
error = []
for i in np.arange(testH.num_rows):
        predict = predict_knn(i, trainH, testH, k, pr=False)
        error.append( predict - testH[target[0]][i])
print(f'Mean signed error: {np.mean(np.array(error)):.2f}')
print(f'Root mean squared error (RMSE): {np.sqrt(np.mean((np.array(error))**2)):.2f}')
Mean signed error: -2410.57
Root mean squared error (RMSE): 34929.69

Improved…