Nearest Neighbor Regression#
Learning from training data#
A key concept in machine learning is using a subset of a dataset to train an algorithm to make estimates on a separate set of test data. The quality of the machine learning and algorithm can be assesed based on the accuracy of the predictions made on test data. Many times there are also parameters sometimes termed hyper-parameters which can be optimized through an iterative approach on test or validation data. In practice a dataset is randomly split into training and test sets using sampling.
k nearest neighbor#
We will examine one machine learning algorithm in the laboratory, k nearest neighbor. Many of the concepts are applicable to the broad range of machine learning algorithms available.
Nearest neighbor concept#
The training examines the characteristics of k nearest neighbors to the data point for which a prediction will be made. Nearness is measured using several different metrics with Euclidean distance being a common one for numerical attributes.
Euclidean distance:
import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
Distance function inspired from above equations (1-3)#
def distance(pt1, pt2):
"""The distance between two points, represented as arrays."""
return np.sqrt(np.sum((pt2-pt1)**2))
Nearest neighbor Functions#
These cells create the complete algorithm and use as part of a nearest neighbor toolbox
def row_distance(row1, row2):
"""The distance between two rows of a table."""
return distance(np.array(row1), np.array(row2)) # Need to convert rows into arrays
def distances(training, test, target, features):
"""Compute the distance from test for each row in training."""
dists = []
attributes = training.select(features)
for row in attributes.rows:
dists.append(row_distance(row, test))
return training.with_column('Distance', dists)
def closest(training, test, k, target, features):
"""Return a table of the k closest neighbors to example row from test data."""
return distances(training, test, target, features).sort('Distance').take(np.arange(k))
Prediction Functions#
def predict_knn(row, train, test, k=5, pr=False):
"""Return the predicting value or class among the
k nearest neighbors, pr=1 prints"""
predict = np.average(closest(train, test.select(features).row(row), k , target, features).column(target[0]))
if pr:
print(f'Predicting target value, {target[0]}, for row = {row} using k={k} with features: {features}')
print(f'Actual value: {test.select(target).take(row)[0][0]:.2f}')
print(f'Predicted value: {predict:.2f}')
print(f'Closest neighbor values: {closest(train, test.select(features).row(row), k , target, features).column(target[0])}')
return predict
def predict_knn_class(row, train, test, k=5, pr=False):
"""Return the predicting value or class among the
k nearest neighbors, pr=1 prints"""
closestclass = list(closest(train, test.select(features).row(row), k , target, features).column(target[0]))
if pr:
print(f'Predicting target value, {target[0]}, for row = {row} using k={k} with features: {features}')
print(f'Actual classification: {test.select(target).take(row)[0][0]}')
print(f'Predicted classification: {max(closestclass, key=closestclass.count)}')
print(f'Closest classifications: {closestclass}')
return max(closestclass, key=closestclass.count)
Regression Functions#
Use as part of a toolbox for later analysis and the project
def standard_units(any_array):
"Convert any array of numbers to standard units."
return (any_array - np.mean(any_array))/np.std(any_array)
def correlation(t, label_x, label_y):
"""Compute the correlation between two variables from a Table with column label_x and label_y.."""
return np.mean(standard_units(t.column(label_x))*standard_units(t.column(label_y)))
def slope(t, label_x, label_y):
"""Compute the slope between two variables from a Table with column label_x and label_y."""
r = correlation(t, label_x, label_y)
return r*np.std(t.column(label_y))/np.std(t.column(label_x))
def intercept(t, label_x, label_y):
"""Compute the slope between two variables from a Table with column label_x and label_y."""
return np.mean(t.column(label_y)) - slope(t, label_x, label_y)*np.mean(t.column(label_x))
Nearest neighbor regression example#
We will look at home sales in Ames, Iowa from 2006-2010. The dataset is described here#
HOUSE=Table().read_table('data/house.csv')
HOUSE
Order | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | Sale Condition | SalePrice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 526301100 | 20 | RL | 141 | 31770 | Pave | nan | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 5 | 1960 | 1960 | Hip | CompShg | BrkFace | Plywood | Stone | 112 | TA | TA | CBlock | TA | Gd | Gd | BLQ | 639 | Unf | 0 | 441 | 1080 | GasA | Fa | Y | SBrkr | 1656 | 0 | 0 | 1656 | 1 | 0 | 1 | 0 | 3 | 1 | TA | 7 | Typ | 2 | Gd | Attchd | 1960 | Fin | 2 | 528 | TA | TA | P | 210 | 62 | 0 | 0 | 0 | 0 | nan | nan | nan | 0 | 5 | 2010 | WD | Normal | 215000 |
2 | 526350040 | 20 | RH | 80 | 11622 | Pave | nan | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Feedr | Norm | 1Fam | 1Story | 5 | 6 | 1961 | 1961 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | CBlock | TA | TA | No | Rec | 468 | LwQ | 144 | 270 | 882 | GasA | TA | Y | SBrkr | 896 | 0 | 0 | 896 | 0 | 0 | 1 | 0 | 2 | 1 | TA | 5 | Typ | 0 | nan | Attchd | 1961 | Unf | 1 | 730 | TA | TA | Y | 140 | 0 | 0 | 0 | 120 | 0 | nan | MnPrv | nan | 0 | 6 | 2010 | WD | Normal | 105000 |
3 | 526351010 | 20 | RL | 81 | 14267 | Pave | nan | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 6 | 1958 | 1958 | Hip | CompShg | Wd Sdng | Wd Sdng | BrkFace | 108 | TA | TA | CBlock | TA | TA | No | ALQ | 923 | Unf | 0 | 406 | 1329 | GasA | TA | Y | SBrkr | 1329 | 0 | 0 | 1329 | 0 | 0 | 1 | 1 | 3 | 1 | Gd | 6 | Typ | 0 | nan | Attchd | 1958 | Unf | 1 | 312 | TA | TA | Y | 393 | 36 | 0 | 0 | 0 | 0 | nan | nan | Gar2 | 12500 | 6 | 2010 | WD | Normal | 172000 |
4 | 526353030 | 20 | RL | 93 | 11160 | Pave | nan | Reg | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 7 | 5 | 1968 | 1968 | Hip | CompShg | BrkFace | BrkFace | None | 0 | Gd | TA | CBlock | TA | TA | No | ALQ | 1065 | Unf | 0 | 1045 | 2110 | GasA | Ex | Y | SBrkr | 2110 | 0 | 0 | 2110 | 1 | 0 | 2 | 1 | 3 | 1 | Ex | 8 | Typ | 2 | TA | Attchd | 1968 | Fin | 2 | 522 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | 0 | 4 | 2010 | WD | Normal | 244000 |
5 | 527105010 | 60 | RL | 74 | 13830 | Pave | nan | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 5 | 5 | 1997 | 1998 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | PConc | Gd | TA | No | GLQ | 791 | Unf | 0 | 137 | 928 | GasA | Gd | Y | SBrkr | 928 | 701 | 0 | 1629 | 0 | 0 | 2 | 1 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1997 | Fin | 2 | 482 | TA | TA | Y | 212 | 34 | 0 | 0 | 0 | 0 | nan | MnPrv | nan | 0 | 3 | 2010 | WD | Normal | 189900 |
6 | 527105030 | 60 | RL | 78 | 9978 | Pave | nan | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 6 | 6 | 1998 | 1998 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 20 | TA | TA | PConc | TA | TA | No | GLQ | 602 | Unf | 0 | 324 | 926 | GasA | Ex | Y | SBrkr | 926 | 678 | 0 | 1604 | 0 | 0 | 2 | 1 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 1998 | Fin | 2 | 470 | TA | TA | Y | 360 | 36 | 0 | 0 | 0 | 0 | nan | nan | nan | 0 | 6 | 2010 | WD | Normal | 195500 |
7 | 527127150 | 120 | RL | 41 | 4920 | Pave | nan | Reg | Lvl | AllPub | Inside | Gtl | StoneBr | Norm | Norm | TwnhsE | 1Story | 8 | 5 | 2001 | 2001 | Gable | CompShg | CemntBd | CmentBd | None | 0 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 616 | Unf | 0 | 722 | 1338 | GasA | Ex | Y | SBrkr | 1338 | 0 | 0 | 1338 | 1 | 0 | 2 | 0 | 2 | 1 | Gd | 6 | Typ | 0 | nan | Attchd | 2001 | Fin | 2 | 582 | TA | TA | Y | 0 | 0 | 170 | 0 | 0 | 0 | nan | nan | nan | 0 | 4 | 2010 | WD | Normal | 213500 |
8 | 527145080 | 120 | RL | 43 | 5005 | Pave | nan | IR1 | HLS | AllPub | Inside | Gtl | StoneBr | Norm | Norm | TwnhsE | 1Story | 8 | 5 | 1992 | 1992 | Gable | CompShg | HdBoard | HdBoard | None | 0 | Gd | TA | PConc | Gd | TA | No | ALQ | 263 | Unf | 0 | 1017 | 1280 | GasA | Ex | Y | SBrkr | 1280 | 0 | 0 | 1280 | 0 | 0 | 2 | 0 | 2 | 1 | Gd | 5 | Typ | 0 | nan | Attchd | 1992 | RFn | 2 | 506 | TA | TA | Y | 0 | 82 | 0 | 0 | 144 | 0 | nan | nan | nan | 0 | 1 | 2010 | WD | Normal | 191500 |
9 | 527146030 | 120 | RL | 39 | 5389 | Pave | nan | IR1 | Lvl | AllPub | Inside | Gtl | StoneBr | Norm | Norm | TwnhsE | 1Story | 8 | 5 | 1995 | 1996 | Gable | CompShg | CemntBd | CmentBd | None | 0 | Gd | TA | PConc | Gd | TA | No | GLQ | 1180 | Unf | 0 | 415 | 1595 | GasA | Ex | Y | SBrkr | 1616 | 0 | 0 | 1616 | 1 | 0 | 2 | 0 | 2 | 1 | Gd | 5 | Typ | 1 | TA | Attchd | 1995 | RFn | 2 | 608 | TA | TA | Y | 237 | 152 | 0 | 0 | 0 | 0 | nan | nan | nan | 0 | 3 | 2010 | WD | Normal | 236500 |
10 | 527162130 | 60 | RL | 60 | 7500 | Pave | nan | Reg | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1999 | 1999 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | PConc | TA | TA | No | Unf | 0 | Unf | 0 | 994 | 994 | GasA | Gd | Y | SBrkr | 1028 | 776 | 0 | 1804 | 0 | 0 | 2 | 1 | 3 | 1 | Gd | 7 | Typ | 1 | TA | Attchd | 1999 | Fin | 2 | 442 | TA | TA | Y | 140 | 60 | 0 | 0 | 0 | 0 | nan | nan | nan | 0 | 6 | 2010 | WD | Normal | 189000 |
... (2920 rows omitted)
House price prediction#
Let’s try to predict house price using features available in the extensive dataset.
Define target and features#
target = ['SalePrice']
features = ['1st Flr SF','Full Bath', '2nd Flr SF','TotRms AbvGrd' ]
sHOUSE = HOUSE.select(target[0])
Standardize#
for label in features:
print('Standardizing: ',label)
sHOUSE = sHOUSE.with_columns(label,standard_units(HOUSE[label]))
sHOUSE
Standardizing: 1st Flr SF
Standardizing: Full Bath
Standardizing: 2nd Flr SF
Standardizing: TotRms AbvGrd
SalePrice | 1st Flr SF | Full Bath | 2nd Flr SF | TotRms AbvGrd |
---|---|---|---|---|
215000 | 1.267 | -1.02479 | -0.783185 | 0.354167 |
105000 | -0.672643 | -1.02479 | -0.783185 | -0.917535 |
172000 | 0.432445 | -1.02479 | -0.783185 | -0.281684 |
244000 | 2.42569 | 0.784028 | -0.783185 | 0.990018 |
189900 | -0.590974 | 0.784028 | 0.853432 | -0.281684 |
195500 | -0.596078 | 0.784028 | 0.799734 | 0.354167 |
213500 | 0.455414 | 0.784028 | -0.783185 | -0.281684 |
191500 | 0.307389 | 0.784028 | -0.783185 | -0.917535 |
236500 | 1.16492 | 0.784028 | -0.783185 | -0.917535 |
189000 | -0.335757 | 0.784028 | 1.02853 | 0.354167 |
... (2920 rows omitted)
Train, test split#
trainH, testH = sHOUSE.split(int(0.8*sHOUSE.num_rows))
print(trainH.num_rows, 'training and', testH.num_rows, 'test instances.')
trainH.show(3)
2344 training and 586 test instances.
SalePrice | 1st Flr SF | Full Bath | 2nd Flr SF | TotRms AbvGrd |
---|---|---|---|---|
243500 | 0.307389 | 0.784028 | 2.05346 | 1.62587 |
250000 | 1.21086 | 0.784028 | -0.783185 | 0.354167 |
132000 | 0.276763 | -1.02479 | -0.783185 | 0.354167 |
... (2341 rows omitted)
predict_knn(16, trainH, testH, k=8, pr=True)
Predicting target value, SalePrice, for row = 16 using k=8 with features: ['1st Flr SF', 'Full Bath', '2nd Flr SF', 'TotRms AbvGrd']
Actual value: 147110.00
Predicted value: 151731.50
Closest neighbor values: [148800 151000 144152 149900 151000 149500 155000 164500]
151731.5
Test prediction accuracy using specified features#
k = 5
error = []
for i in np.arange(testH.num_rows):
predict = predict_knn(i, trainH, testH, k=8, pr=False)
error.append( predict - testH[target[0]][i])
print(f'Mean signed error: {np.mean(np.array(error)):.2f}')
print(f'Root mean squared error (RMSE): {np.sqrt(np.mean((np.array(error))**2)):.2f}')
Mean signed error: -6495.56
Root mean squared error (RMSE): 52968.99
Pretty good prediction, let’s see if we can do better with additional features
Add additional features#
target = ['SalePrice']
features = ['Lot Area','1st Flr SF','2nd Flr SF','Full Bath','TotRms AbvGrd', 'Overall Qual' ]
sHOUSE = HOUSE.select(target[0])
for label in features:
print('Standardizing: ',label)
sHOUSE = sHOUSE.with_columns(label,standard_units(HOUSE[label]))
sHOUSE
Standardizing: Lot Area
Standardizing: 1st Flr SF
Standardizing: 2nd Flr SF
Standardizing: Full Bath
Standardizing: TotRms AbvGrd
Standardizing: Overall Qual
SalePrice | Lot Area | 1st Flr SF | 2nd Flr SF | Full Bath | TotRms AbvGrd | Overall Qual |
---|---|---|---|---|---|---|
215000 | 2.74438 | 1.267 | -0.783185 | -1.02479 | 0.354167 | -0.0672537 |
105000 | 0.187097 | -0.672643 | -0.783185 | -1.02479 | -0.917535 | -0.776079 |
172000 | 0.522814 | 0.432445 | -0.783185 | -1.02479 | -0.281684 | -0.0672537 |
244000 | 0.128458 | 2.42569 | -0.783185 | 0.784028 | 0.990018 | 0.641571 |
189900 | 0.467348 | -0.590974 | 0.853432 | 0.784028 | -0.281684 | -0.776079 |
195500 | -0.0215673 | -0.596078 | 0.799734 | 0.784028 | 0.354167 | -0.0672537 |
213500 | -0.663554 | 0.455414 | -0.783185 | 0.784028 | -0.281684 | 1.3504 |
191500 | -0.652765 | 0.307389 | -0.783185 | 0.784028 | -0.917535 | 1.3504 |
236500 | -0.604026 | 1.16492 | -0.783185 | 0.784028 | -0.917535 | 1.3504 |
189000 | -0.336087 | -0.335757 | 1.02853 | 0.784028 | 0.354167 | 0.641571 |
... (2920 rows omitted)
Train, test split#
trainH, testH = sHOUSE.split(int(0.8*sHOUSE.num_rows))
print(trainH.num_rows, 'training and', testH.num_rows, 'test instances.')
trainH.show(3)
2344 training and 586 test instances.
SalePrice | Lot Area | 1st Flr SF | 2nd Flr SF | Full Bath | TotRms AbvGrd | Overall Qual |
---|---|---|---|---|---|---|
150000 | -0.374165 | -0.345966 | 0.533579 | -1.02479 | -0.281684 | -0.776079 |
310013 | 0.262618 | -0.522065 | 1.3764 | 0.784028 | 0.990018 | 1.3504 |
85000 | -0.959035 | -1.59142 | 0.561595 | -1.02479 | -1.55339 | -0.776079 |
... (2341 rows omitted)
predict_knn(16, trainH, testH, k=8, pr=True)
Predicting target value, SalePrice, for row = 16 using k=8 with features: ['Lot Area', '1st Flr SF', '2nd Flr SF', 'Full Bath', 'TotRms AbvGrd', 'Overall Qual']
Actual value: 87500.00
Predicted value: 125187.50
Closest neighbor values: [107000 135000 98000 132000 167500 174000 50000 138000]
125187.5
Test prediction accuracy using specified features#
k=5
error = []
for i in np.arange(testH.num_rows):
predict = predict_knn(i, trainH, testH, k, pr=False)
error.append( predict - testH[target[0]][i])
print(f'Mean signed error: {np.mean(np.array(error)):.2f}')
print(f'Root mean squared error (RMSE): {np.sqrt(np.mean((np.array(error))**2)):.2f}')
Mean signed error: -2410.57
Root mean squared error (RMSE): 34929.69
Improved…