β

KNN算法-python实现

DianaCody's Shell 61 阅读
一、算法原理
  1. 计算抑制类别数据集中的点与当前点的距离(欧氏距离、马氏距离等)
  2. 按照距离递增依次排序
  3. 选取当前点距离最小的k个点
  4. 确定前k个点所在类别出现频率
  5. 返回前k个点出现频率最高的类别作为当前点的预测分类

注意:

  1. 关于k值个数的选择,其取决于数据。一般地,在分类时,较大k值可以减小噪声的影响,但会使类别界限变得模糊。

    • 好的k值可以通过各种启发式技术来获取(eg.交叉验证)
    • 噪声和非相关性特征向量的存在会使k近邻算法的准确性减小
  2. 近邻算法具有较强的一致性结果,随着数据趋于无线,算法的错误率不会超过贝叶斯算法错误率的2倍。对于一些好的k值,k近邻保证错误率不会超过贝叶斯理论误差率。

二、源码实现
from numpy import *  
import time  
import matplotlib.pyplot as plt  


# calculate Euclidean distance  
def euclDistance(vector1, vector2):  
    return sqrt(sum(power(vector2 - vector1, 2)))  

# init centroids with random samples  
def initCentroids(dataSet, k):  
    numSamples, dim = dataSet.shape  
    centroids = zeros((k, dim))  
    for i in range(k):  
        index = int(random.uniform(0, numSamples))  
        centroids[i, :] = dataSet[index, :]  
    return centroids  

# k-means cluster  
def kmeans(dataSet, k):  
    numSamples = dataSet.shape[0]  
    # first column stores which cluster this sample belongs to,  
    # second column stores the error between this sample and its centroid  
    clusterAssment = mat(zeros((numSamples, 2)))  
    clusterChanged = True  

    ## step 1: init centroids  
    centroids = initCentroids(dataSet, k)  

    while clusterChanged:  
        clusterChanged = False  
        ## for each sample  
        for i in xrange(numSamples):  
            minDist  = 100000.0  
            minIndex = 0  
            ## for each centroid  
            ## step 2: find the centroid who is closest  
            for j in range(k):  
                distance = euclDistance(centroids[j, :], dataSet[i, :])  
                if distance < minDist:  
                    minDist  = distance  
                    minIndex = j  

            ## step 3: update its cluster  
            if clusterAssment[i, 0] != minIndex:  
                clusterChanged = True  
                clusterAssment[i, :] = minIndex, minDist**2  

        ## step 4: update centroids  
        for j in range(k):  
            pointsInCluster = dataSet[nonzero(clusterAssment[:, 0].A == j)[0]]  
            centroids[j, :] = mean(pointsInCluster, axis = 0)  

    print 'Congratulations, cluster complete!'  
    return centroids, clusterAssment  

# show your cluster only available with 2-D data  
def showCluster(dataSet, k, centroids, clusterAssment):  
    numSamples, dim = dataSet.shape  
    if dim != 2:  
        print "Sorry! I can not draw because the dimension of your data is not 2!"  
        return 1  

    mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']  
    if k > len(mark):  
        print "Sorry! Your k is too large! please contact Zouxy"  
        return 1  

    # draw all samples  
    for i in xrange(numSamples):  
        markIndex = int(clusterAssment[i, 0])  
        plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])  

    mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']  
    # draw the centroids  
    for i in range(k):  
        plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)  

plt.show() 
三、测试代码
from numpy import *  
import time  
import matplotlib.pyplot as plt  

## step 1: load data  
print "step 1: load data..."  
dataSet = []  
fileIn = open('E:/Python/Machine Learning in Action/testSet.txt')  
for line in fileIn.readlines():  
    lineArr = line.strip().split('\t')  
    dataSet.append([float(lineArr[0]), float(lineArr[1])])  

## step 2: clustering...  
print "step 2: clustering..."  
dataSet = mat(dataSet)  
k = 4  
centroids, clusterAssment = kmeans(dataSet, k)  

## step 3: show the result  
print "step 3: show the result..."  
showCluster(dataSet, k, centroids, clusterAssment)  
作者:DianaCody's Shell
原文地址:KNN算法-python实现, 感谢原作者分享。

发表评论