开篇
前段时间在b站上看斯坦福大学cs231n计算机视觉的公开课,其实这门课并不是一门进阶课程,他只是一门基础的CV课程,如果大家想入门计算机视觉的可以去b站上搜一下这门课,简介中前序课程建议学习cs229和cs131,但本人亲测没学过这两门也关系不大,因为cs231n讲的还是很基础的(我学过cs229,就是吴恩达老师的机器学习,但是没上过cs131,但感觉影响和知识预备关联不是很大,cs231n实在很基础)。虽然课程很基础,但是它的三个大作业确实是备受好评,我之所以会每天再花费一个多小时去学一遍这门较为基础的入门课只是因为去做一遍它的官方作业而已,如果大家有兴趣可以和我一起做cs231n作业1。这是github上的作业要求和一些指导,我们先从作业1开始做起,我现在做到了作业2,未来会在博客上更新整套三个作业的所有代码。我们开始吧。
最近邻KNN算法
最近邻算法很容易理解,在空间中某个样本点所属的类别与它周围出现的最多类别相同。
举个例子,假如现在样本点A,选出距离它最近的5个样本B,C,D,E,F,分别对应类别1,1,2,3,3.我们可以看到最近的5个点中,出现最多的类别是1,所以我们预测A的类别是1.就这么简单。
这里的参数我们一目了然,
1.k值,选出最近的k个点;2.用什么距离公式来衡量最近距离呢?我们可以选择欧氏距离,即L2距离;或者曼哈顿距离,即L1距离,d(i,j)=|X1-X2|+|Y1-Y2|.
欧氏距离为:
当p=2时,就是我们的欧氏距离。
我们确定了我们的可选参数和主要算法(选最近的k个点中出现的最多类别),现在我们来看作业1KNN的代码。
代码
import numpy as np
class KNearestNeighbor(object):
""" a kNN classifier with L2 distance """
def __init__(self):
pass
def train(self, X, y):
"""
Train the classifier. For k-nearest neighbors this is just
memorizing the training data.
Inputs:
- X: A numpy array of shape (num_train, D) containing the training data
consisting of num_train samples each of dimension D.
- y: A numpy array of shape (N,) containing the training labels, where
y[i] is the label for X[i].
"""
self.X_train = X
self.y_train = y
def predict(self, X, k=1, num_loops=0):
"""
Predict labels for test data using this classifier.
Inputs:
- X: A numpy array of shape (num_test, D) containing test data consisting
of num_test samples each of dimension D.
- k: The number of nearest neighbors that vote for the predicted labels.
- num_loops: Determines which implementation to use to compute distances
between training points and testing points.
Returns:
- y: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y[i] is the predicted label for the test point X[i].
"""
# 表示的是计算距离的方法
if num_loops == 0:
dists = self.compute_distances_no_loops(X)
elif num_loops == 1:
dists = self.compute_distances_one_loop(X)
elif num_loops == 2:
dists = self.compute_distances_two_loops(X)
else:
raise ValueError('Invalid value %d for num_loops' % num_loops)
return self.predict_labels(dists, k=k)
def compute_distances_two_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a nested loop over both the training data and the
test data.
Inputs:
- X: A numpy array of shape (num_test, D) containing test data.
Returns:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
is the Euclidean distance between the ith test point and the jth training
point.
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
# 这是一个距离矩阵
dists = np.zeros((num_test, num_train))
for i in range(num_test):
for j in range(num_train):
#####################################################################
# TODO: #
# Compute the l2 distance between the ith test point and the jth #
# training point, and store the result in dists[i, j]. You should #
# not use a loop over dimension. #
#####################################################################
dists[i][j] = np.linalg.norm((X[i] - self.X_train[j]), ord=2)
#####################################################################
# END OF YOUR CODE #
#####################################################################
return dists
def compute_distances_one_loop(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a single loop over the test data.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
# Compute the l2 distance between the ith test point and all training
# points, and store the result in dists[i, :].
# 有点儿向量化的意思
dists[i] = np.linalg.norm(X[i] - self.X_train, ord=2, axis=1)
return dists
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
#########################################################################
# TODO: #
# Compute the l2 distance between all test points and all training #
# points without using any explicit loops, and store the result in #
# dists. #
# #
# You should implement this function using only basic array operations; #
# in particular you should not use functions from scipy. #
# #
# HINT: Try to formulate the l2 distance using matrix multiplication #
# and two broadcast sums. #
#########################################################################
# 广播特性进行相加
# 由于我们的距离数组的行数是num_test,列数是num_train
# 所以我们应该让X_train的维度是(1,num_train)
# 而test的维度是(num_test,1)
dists += np.sum(self.X_train ** 2, axis=1).reshape(1, num_train)
dists += np.sum(X ** 2, axis=1).reshape(num_test, 1) # reshape for broadcasting
dists -= 2 * np.dot(X, self.X_train.T)
dists = np.sqrt(dists)
#########################################################################
# END OF YOUR CODE #
#########################################################################
return dists
def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point.
Inputs:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
gives the distance betwen the ith test point and the jth training point.
Returns:
- y: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y[i] is the predicted label for the test point X[i].
"""
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in range(num_test):
# A list of length k storing the labels of the k nearest neighbors to
# the ith test point.
closest_y = []
#########################################################################
# TODO: #
# Use the distance matrix to find the k nearest neighbors of the ith #
# testing point, and use self.y_train to find the labels of these #
# neighbors. Store these labels in closest_y. #
# Hint: Look up the function numpy.argsort. #
#########################################################################
# 将距离从小到大进行排列,然后取前k的索引放入列表
closest_y = self.y_train[np.argsort(dists[i])[: k]]
#########################################################################
# TODO: #
# Now that you have found the labels of the k nearest neighbors, you #
# need to find the most common label in the list closest_y of labels. #
# Store this label in y_pred[i]. Break ties by choosing the smaller #
# label. #
#########################################################################
# y_pred[i]是closest_y中出现最多的数字,即我们要找的那个类别
y_pred[i] = np.argmax(np.bincount(closest_y))
#########################################################################
# END OF YOUR CODE #
#########################################################################
return y_pred
这里我们的主要代码就是这三个计算距离的函数,我们采取的计算公式都是欧氏距离,但是具体细节不同,一种是直接计算两个向量之间的距离,另一种是取测试集中的每一个样本,计算与训练集向量之间的距离;第三种是一个二维矩阵,分别计算测试集中每一个样本与训练集中每个样本的距离并填入矩阵中。
然后我们将距离从小到大排序,选取前k个最小距离,然后找出出现做多的类别标签即可。(y_train中存储着每个训练集的正确标签)。
这里的英文提示都是原版作业给出的提示,对我们理解算法以及填充代码很有帮助。建议大家做完作业以后再自己实现一遍,把除了核心算法的细节也弄清楚。
总结
KNN算法很简单所以实现起来很easy,明天我们说linear_classifier线性分类器,一种简单且基础的分类模型。
说明:这一系列博客和我介绍的tensorflow或者pytorch实现的文章是不同,用pytorch实现的模型是利用了框架,而这里是一种从底层开始的造轮子过程,可以看到,我们只引入numpy这个库。两种方法都是有效的,只不过从底层实现可以更好地理解算法。
大家可以先参考下Pytorch实现线性回归