# KNN 分類與迴歸

KNN 全名 K Nearest Neighbor，與 K-means 同有 K 字樣，然而它們具有不同的任務。K-means 用於一組沒有任何標記的資料，試圖以距離的概念對它們分為 K 群，可以直接用來分群、分析群聚趨勢、壓縮資訊等；KNN 是基於既有的資料，尋找對新進資料來說 K 個距離最近的資料，用鄰居資料來預測新資料的分類或值。

KNN 可以用來預測分類，例如，若有一組資料儲存在 height_waist2.csv，內容為：

``````171,110,-1
157,90,-1
164,115,-1
182,75,0
160,103,-1
199,68,1
152,103,-1
179,67,1
164,83,0
...
``````

sklearn 提供了 `sklearn.neighbors.KNeighborsClassifier`，來看看怎麼使用：

``````import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

height_waist = data[:,0:2]
label = data[:,2]

# 切分訓練與測試資料
height_waist_training, height_waist_test, lb_training, lb_test = train_test_split(
height_waist, label, stratify = label, random_state = 1
)

height = height_waist_training[:,0]
waist = height_waist_training[:,1]

normal_weight = lb_training == 0
overweight = lb_training == -1
rundown_weight = lb_training == 1

plt.xlabel('height')
plt.ylabel('waist')
plt.gca().set_aspect(1)

# 畫出訓練用資料
plt.scatter(height_waist_training[rundown_weight, 0], height_waist_training[rundown_weight, 1], marker = 'x')
plt.scatter(height_waist_training[normal_weight, 0], height_waist_training[normal_weight, 1], marker = 'o')
plt.scatter(height_waist_training[overweight, 0], height_waist_training[overweight, 1], marker = '^')

# KNN 預測分類
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(height_waist_training, lb_training)
predicted = knn.predict(height_waist_test)

normal_weight = predicted == 0
overweight = predicted == -1
rundown_weight = predicted == 1

# 畫出測試用資料
plt.scatter(height_waist_test[rundown_weight, 0], height_waist_test[rundown_weight, 1], marker = 'x', c = 'red')
plt.scatter(height_waist_test[normal_weight, 0], height_waist_test[normal_weight, 1], marker = 'o', c = 'red')
plt.scatter(height_waist_test[overweight, 0], height_waist_test[overweight, 1], marker = '^', c = 'red')

# 評分
plt.text(150, 118,
'score: ' + str(knn.score(height_waist_test, lb_test)))

plt.show()
``````

KNN 的概念也可以用於計算迴歸，在這個時候，最接近鄰居的計算方式，會使用原資料降一維，例如，原資料若為三維 (x, y, z) 形式，現在有一個新資料 (xn, yn)，想以 KNN 來預測它的 z 值，方式可以是計算原資料中 (x, y) 與 (xn, yn) 的距離，找出 K 個距離最小的鄰居，求其 z 的平均值。

sklearn 提供了 `sklearn.neighbors.KNeighborsRegressor`，來看看怎麼使用：

``````import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

def points(start, end, step, noise, f):
n = (end - start) // step
x = np.arange(start, end, step) + np.random.rand(n) * noise
y = np.arange(start, end, step) + np.random.rand(n) * noise
z = f(x, y) + np.random.rand(n) * noise
return np.dstack((x, y, z))[0]

# 用來產生資料的平面函式
def f(x, y):
return 2 * x + y + 10

# 資料來源
data = points(0, 300, 1, 200, f)

xy = data[:,0:2]   # 包含 [x, y] 的清單
z = data[:,2]      # 包含 z 的清單

# 切分訓練與測試資料
xy_training, xy_test, z_training, z_test = train_test_split(
xy, z, random_state = 1
)

ax = plt.axes(projection='3d')

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
ax.set_box_aspect((1, 1, 1))

# 畫出訓練用資料
ax.scatter(xy_training[:,0], xy_training[:,1], z_training)

# KNN 迴歸
knn = KNeighborsRegressor(n_neighbors = 5)
knn.fit(xy_training, z_training)
predicted = knn.predict(xy_test)

# 畫出測試用資料
ax.scatter(xy_test[:,0], xy_test[:,1], predicted)

# 評分
ax.text(150, 118, 2000,
'score: ' + str(knn.score(xy_test, z_test)))

plt.show()
``````