Nearest Neighbor Classification

Nearest Neighbor Classification#

Nearest neighbor

Learning from training data#

A key concept in machine learning is using a subset of a dataset to train an algorithm to make estimates on a separate set of test data. The quality of the machine learning and algorithm can be assesed based on the accuracy of the predictions made on test data. Many times there are also parameters sometimes termed hyper-parameters which can be optimized through an iterative approach on test or validation data. In practice a dataset is randomly split into training and test sets using sampling.

k nearest neighbor#

We will examine one machine learning algorithm in the laboratory, k nearest neighbor. Many of the concepts are applicable to the broad range of machine learning algorithms available.

Nearest neighbor concept #

The training examines the characteristics of k nearest neighbors to the data point for which a prediction will be made. Nearness is measured using several different metrics with Euclidean distance being a common one for numerical attributes.
Euclidean distance:
1-D:

(1)#\[\begin{align} d(p,q) = \sqrt{(p-q)^{2}} \end{align}\]

2-D:

(2)#\[\begin{align} d(p,q) = \sqrt{(p_1-q_1)^{2}+(p_2-q_2)^{2}} \end{align}\]

For multiple points (rows, multidimensional):

(3)#\[\begin{align} d(p_i,q_i) = \sqrt{\sum_{i}{((p_i-q_i)^{2})}} \end{align}\]

import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Distance function inspired from above equations (1-3).#

def distance(pt1, pt2):
    """The distance between two points, represented as arrays."""
    return np.sqrt(np.sum((pt2-pt1)**2))

Nearest neighbor Functions#

These cells create the complete algorithm and use as part of a nearest neighbor toolbox

def row_distance(row1, row2):
    """The distance between two rows of a table."""
    return distance(np.array(row1), np.array(row2)) # Need to convert rows into arrays

def distances(training, test, target, features):
    """Compute the distance from test for each row in training."""
    dists = []
    attributes = training.select(features)
    for row in attributes.rows:
        dists.append(row_distance(row, test))
    return training.with_column('Distance', dists)

def closest(training, test, k, target, features):
    """Return a table of the k closest neighbors to example row from test data."""
    return distances(training, test, target, features).sort('Distance').take(np.arange(k))

Prediction Functions#

def predict_knn(row, train, test, k=5, pr=False):
    """Return the predicting value or class among the 
     k nearest neighbors, pr=1 prints"""
    if pr:
            print(f'Predicting target value, {target[0]}, for row = {row} using k={k} with features: {features}')
    return np.average(closest(train, test.select(features).row(row), k , target, features).column(target[0]))

def predict_knn_class(row, train, test, k=5, pr=False):
    """Return the predicting value or class among the 
     k nearest neighbors, pr=1 prints"""
    closestclass = list(closest(train, test.select(features).row(row), k , target, features).column(target[0]))
    if pr:
            print(f'Predicting target value, {target[0]}, for row = {row} using k={k} with features: {features}')
            print(f'Actual classification: {test.select(target).take(row)[0][0]}')
            print(f'Predicted classification: {max(closestclass, key=closestclass.count)}')
            print(f'Closest classifications: {closestclass}')
    return max(closestclass, key=closestclass.count)

Regression Functions#

Use as part of a toolbox for later analysis and the project

def standard_units(any_array):
    "Convert any array of numbers to standard units."
    return (any_array - np.mean(any_array))/np.std(any_array)  
    
def correlation(t, label_x, label_y):
    """Compute the correlation between two variables from a Table with column label_x and label_y.."""
    return np.mean(standard_units(t.column(label_x))*standard_units(t.column(label_y)))

def slope(t, label_x, label_y):
    """Compute the slope between two variables from a Table with column label_x and label_y."""
    r = correlation(t, label_x, label_y)
    return r*np.std(t.column(label_y))/np.std(t.column(label_x))

def intercept(t, label_x, label_y):
    """Compute the slope between two variables from a Table with column label_x and label_y."""
    return np.mean(t.column(label_y)) - slope(t, label_x, label_y)*np.mean(t.column(label_x))

Classification example#

CKD=Table().read_table('data/ckd.csv')
CKD

Age	Blood Pressure	Specific Gravity	Albumin	Sugar	Red Blood Cells	Pus Cell	Pus Cell clumps	Bacteria	Blood Glucose Random	Blood Urea	Serum Creatinine	Sodium	Potassium	Hemoglobin	Packed Cell Volume	White Blood Cell Count	Red Blood Cell Count	Hypertension	Diabetes Mellitus	Coronary Artery Disease	Appetite	Pedal Edema	Anemia	Class
48	70	1.005	4	0	normal	abnormal	present	notpresent	117	56	3.8	111	2.5	11.2	32	6700	3.9	yes	no	no	poor	yes	yes	1
53	90	1.02	2	0	abnormal	abnormal	present	notpresent	70	107	7.2	114	3.7	9.5	29	12100	3.7	yes	yes	no	poor	no	yes	1
63	70	1.01	3	0	abnormal	abnormal	present	notpresent	380	60	2.7	131	4.2	10.8	32	4500	3.8	yes	yes	no	poor	yes	no	1
68	80	1.01	3	2	normal	abnormal	present	present	157	90	4.1	130	6.4	5.6	16	11000	2.6	yes	yes	yes	poor	yes	no	1
61	80	1.015	2	0	abnormal	abnormal	notpresent	notpresent	173	148	3.9	135	5.2	7.7	24	9200	3.2	yes	yes	yes	poor	yes	yes	1
48	80	1.025	4	0	normal	abnormal	notpresent	notpresent	95	163	7.7	136	3.8	9.8	32	6900	3.4	yes	no	no	good	no	yes	1
69	70	1.01	3	4	normal	abnormal	notpresent	notpresent	264	87	2.7	130	4	12.5	37	9600	4.1	yes	yes	yes	good	yes	no	1
73	70	1.005	0	0	normal	normal	notpresent	notpresent	70	32	0.9	125	4	10	29	18900	3.5	yes	yes	no	good	yes	no	1
73	80	1.02	2	0	abnormal	abnormal	notpresent	notpresent	253	142	4.6	138	5.8	10.5	33	7200	4.3	yes	yes	yes	good	no	no	1
46	60	1.01	1	0	normal	normal	notpresent	notpresent	163	92	3.3	141	4	9.8	28	14600	3.2	yes	yes	no	good	no	no	1

... (148 rows omitted)

Define target and features#

target = ['Class']
features = ['Blood Pressure','Blood Glucose Random','Hemoglobin','Serum Creatinine' ]

sCKD = CKD.select(target[0])

Standardize#

for label in features:
    print('Standardizing: ',label)
    sCKD = sCKD.with_columns(label,standard_units(CKD[label]))
sCKD   

Standardizing:  Blood Pressure
Standardizing:  Blood Glucose Random
Standardizing:  Hemoglobin
Standardizing:  Serum Creatinine

Class	Blood Pressure	Blood Glucose Random	Hemoglobin	Serum Creatinine
1	-0.363613	-0.221549	-0.865744	0.52525
1	1.43173	-0.947597	-1.45745	1.63351
1	-0.363613	3.84123	-1.00497	0.166693
1	0.534056	0.396364	-2.81488	0.623038
1	0.534056	0.643529	-2.08395	0.557846
1	0.534056	-0.561402	-1.35303	1.79649
1	-0.363613	2.04928	-0.413266	0.166693
1	-0.363613	-0.947597	-1.28342	-0.420035
1	0.534056	1.87936	-1.10939	0.786018
1	-1.26128	0.489051	-1.35303	0.36227

... (148 rows omitted)

Train, test split#

trainK, testK = sCKD.split(int(0.8*CKD.num_rows))
print(trainK.num_rows, 'training and', testK.num_rows, 'test instances.')

trainK.show(3)

126 training and 32 test instances.

Class	Blood Pressure	Blood Glucose Random	Hemoglobin	Serum Creatinine
0	0.534056	-0.576849	0.630914	-0.322247
0	0.534056	-0.762223	0.770138	-0.517823
1	-0.363613	2.04928	-0.413266	0.166693

... (123 rows omitted)

predict_knn_class(16, trainK, testK, k=8, pr=True)

Predicting target value, Class, for row = 16 using k=8 with features: ['Blood Pressure', 'Blood Glucose Random', 'Hemoglobin', 'Serum Creatinine']
Actual classification: 0
Predicted classification: 0
Closest classifications: [0, 0, 0, 0, 0, 0, 0, 0]

Test prediction accuracy using specified features#

correct = 0
predict_list = []
for i in np.arange(testK.num_rows):
    predict = predict_knn_class(i, trainK, testK, k=8, pr=False)
    predict_list.append(predict)
    correct += 1*(predict == testK[target[0]][i])
print(f'Percent correct: {correct/testK.num_rows*100:.1f}%')

Percent correct: 96.9%

Convert test data to original units and plot using below functions#

Examine relationship between two variables and prediction (0 = no CKD, 1 = CKD)

def original_val(x_z,xmean,xstd):
    """Convert standard z-value back to original"""
    x = x_z*xstd + xmean
    return x

def plot_relate(tbl, test, predict_list, labelx, labely):
    """Plot prediction in original units"""
    testvals = Table().with_columns(target[0],testK[target[0]], 'predict',predict_list)
    for label in features:
        xmean = np.mean(tbl[label])
        xstd = np.std(tbl[label])
        x = original_val(test[label],xmean,xstd)
        testvals=testvals.with_columns(label,x)
    scatter = plt.scatter(testvals[labelx],testvals[labely], c=testvals['predict'])
    plt.xlabel(labelx)
    plt.ylabel(labely)
    plt.legend(*scatter.legend_elements())
    return testvals
    

plot_relate(CKD, testK, predict_list, "Blood Glucose Random", "Hemoglobin")

Blood Pressure	Blood Glucose Random	Hemoglobin	Serum Creatinine
70	100	16.4	0.5
60	109	15.8	1.1
70	102	15	1.2
80	122	17	1.2
70	83	16.2	1.2
80	85	15.6	1.1
80	132	17.8	0.8
80	78	16.1	0.6
70	113	16.5	0.6
80	96	13.8	0.5

... (22 rows omitted)

_images/96318e89316994842c8b53a6aa36d904118159ba66f5f62f47e97bc5af353eda.png

Interpretation: Patients with high hemoglobin numbers and low blood glucose are less likely to be predicted to have CKD

plot_relate(CKD, testK, predict_list, "Blood Glucose Random", "Serum Creatinine")

Blood Pressure	Blood Glucose Random	Hemoglobin	Serum Creatinine
70	100	16.4	0.5
60	109	15.8	1.1
70	102	15	1.2
80	122	17	1.2
70	83	16.2	1.2
80	85	15.6	1.1
80	132	17.8	0.8
80	78	16.1	0.6
70	113	16.5	0.6
80	96	13.8	0.5

... (22 rows omitted)

_images/784b19e5be0ae4e90fb64c53504198b8fcea4c6c6125db6c3d86a748520d2a0b.png

Interpretation: Patients with low creatinine and blood glucose are less likely to be predicted to have CKD

Pretty good prediction, let’s see if we can do better with additional features

Add additional features#

features

['Blood Pressure', 'Blood Glucose Random', 'Hemoglobin', 'Serum Creatinine']

target = ['Class']
features = ['Age','Blood Pressure','Blood Glucose Random','Hemoglobin','Serum Creatinine', 'Sodium', 'Blood Urea' ]

sCKD = CKD.select(target[0])

for label in features:
    print('Standardizing: ',label)
    sCKD = sCKD.with_columns(label,standard_units(CKD[label]))
sCKD   

Standardizing:  Age
Standardizing:  Blood Pressure
Standardizing:  Blood Glucose Random
Standardizing:  Hemoglobin
Standardizing:  Serum Creatinine
Standardizing:  Sodium
Standardizing:  Blood Urea

Class	Age	Blood Pressure	Blood Glucose Random	Hemoglobin	Serum Creatinine	Sodium	Blood Urea
1	-0.101098	-0.363613	-0.221549	-0.865744	0.52525	-3.73015	0.0724741
1	0.222253	1.43173	-0.947597	-1.45745	1.63351	-3.32831	1.15195
1	0.868954	-0.363613	3.84123	-1.00497	0.166693	-1.05122	0.157139
1	1.19231	0.534056	0.396364	-2.81488	0.623038	-1.18517	0.792125
1	0.739614	0.534056	0.643529	-2.08395	0.557846	-0.515439	2.01976
1	-0.101098	0.534056	-0.561402	-1.35303	1.79649	-0.381492	2.33726
1	1.25698	-0.363613	2.04928	-0.413266	0.166693	-1.18517	0.728626
1	1.51566	-0.363613	-0.947597	-1.28342	-0.420035	-1.8549	-0.435514
1	1.51566	0.534056	1.87936	-1.10939	0.786018	-0.1136	1.89277
1	-0.230439	-1.26128	0.489051	-1.35303	0.36227	0.288239	0.834457

... (148 rows omitted)

trainK, testK = sCKD.split(int(0.8*CKD.num_rows))
print(trainK.num_rows, 'training and', testK.num_rows, 'test instances.')

trainK.show(3)

126 training and 32 test instances.

Age	Blood Pressure	Blood Glucose Random	Hemoglobin	Serum Creatinine	Sodium	Blood Urea
-0.424449	-1.26128	-0.221549	-0.239236	-0.485227	0.288239	-0.160354
0.610274	-1.26128	-0.283341	0.561302	-0.354843	0.0203463	-0.62601
0.351593	0.534056	0.025616	-0.169624	-0.322247	-0.515439	-0.753007

... (123 rows omitted)

Test prediction accuracy using specified features#

correct = 0
predict_list = []
for i in np.arange(testK.num_rows):
    predict = predict_knn_class(i, trainK, testK, k=8, pr=False)
    predict_list.append(predict)
    correct += 1*(predict == testK[target[0]][i])
print(f'Percent correct: {correct/testK.num_rows*100:.1f}%')

Percent correct: 93.8%

plot_relate(CKD, testK, predict_list, "Blood Urea", "Sodium")

Class	Age	Blood Pressure	Blood Glucose Random	Hemoglobin	Serum Creatinine	Sodium	Blood Urea
0	43	80	81	13.9	0.6	135	46
0	35	60	105	14.7	0.5	135	39
1	62	70	122	12.6	1.7	136	42
0	25	80	121	15	1.2	142	19
0	43	80	130	15.9	1.1	143	30
0	55	80	104	17.3	0.9	142	28
0	58	70	88	16.4	1.1	147	16
0	20	70	123	14.6	1	135	44
0	68	60	125	17.4	1.1	139	41
0	12	80	100	15.8	0.6	137	26

... (22 rows omitted)

_images/e9ed86607316b559217abcb3717574e014264825ab71591910ef693e08199147.png

Interpretation: Patients with high sodium and low urea are less likely to be predicted to have CKD

Improved…

Nearest Neighbor Classification

Contents

Nearest Neighbor Classification#

Learning from training data#

k nearest neighbor#

Nearest neighbor concept#

Distance function inspired from above equations (1-3).#

Nearest neighbor Functions#

Prediction Functions#

Regression Functions#

Classification example#

Define target and features#

Standardize#

Train, test split#

Test prediction accuracy using specified features#

Convert test data to original units and plot using below functions#

Add additional features#

Test prediction accuracy using specified features#

Nearest neighbor concept #