1.資料讀取與介紹

導入相關庫及子產品

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold

利用pandas進行資料讀取，通過info()函數了解該資料的大緻資訊

file_name='data.csv'
data=pd.read_csv(file_name)
print('****該資料的大緻資訊如下****')
print(data.info())

科比生涯資料分析——利用随機森林進行分類1.資料讀取與介紹2.特征資料可視化展示3.資料預處理4.使用scikit-learn建立模型

列印前五行資料

科比生涯資料分析——利用随機森林進行分類1.資料讀取與介紹2.特征資料可視化展示3.資料預處理4.使用scikit-learn建立模型

擷取該資料的行數與列數

輸出為：該資料共有30697條記錄，25個特征項

通過以上結果顯示，我們得到：該資料在shot_made_flag字段上缺失值較多，且該字段為标簽項，0表示未射入球門，1表示攝入球門，是以需删去shot_made_flag項為NaN的記錄行。

data=data[data['shot_made_flag'].notnull()]
data.info()

2.特征資料可視化展示

将射球時相對于球門的位置(loc_x,loc_y)，(lat,lon)在圖形中展示出來

#設定畫布大小
plt.figure(figsize=(12,12))
#畫第一個子圖
plt.subplot(121)
plt.title('the location of the shot')
plt.xlabel('loc_x')
plt.ylabel('loc_y')
plt.scatter(data['loc_x'], data['loc_y'], color='g', alpha = 0.02)
#畫第二個子圖
plt.subplot(122)
plt.title('the site of the shot')
plt.xlabel('longitude')#經度
plt.ylabel('latitude')#緯度
plt.scatter(data['lon'], data['lat'], color='r', alpha = 0.02)

科比生涯資料分析——利用随機森林進行分類1.資料讀取與介紹2.特征資料可視化展示3.資料預處理4.使用scikit-learn建立模型

3.資料預處理

上面可視化圖形顯示，科比的射門位置大緻呈半圓形，是以建構兩個新的字段值dist和angle，其中dist=sqtr(x^2 +y^2)，angle為夾角。

data['dist'] = np.sqrt(data['loc_x']**2 + data['loc_y']**2)

loc_x_zero = data['loc_x'] == 0
#print (loc_x_zero)
data['angle'] = np.array([0]*len(data))
data['angle'][~loc_x_zero] = np.arctan(data['loc_y'][~loc_x_zero] / data['loc_x'][~loc_x_zero])
data['angle'][loc_x_zero] = np.pi / 2

建構新的字段remaining_time

列印字段action_type、combined_shot_type、shot_type和shot_type

print(data.action_type.unique())
print(data.combined_shot_type.unique())
print(data.shot_type.unique())
print(data.shot_type.value_counts())

列印字段season

輸出：array([‘2000-01’, ‘2001-02’, ‘2002-03’, ‘2003-04’, ‘2004-05’, ‘2005-06’,

‘2006-07’, ‘2007-08’, ‘2008-09’, ‘2009-10’, ‘2010-11’, ‘2011-12’,

‘2012-13’, ‘2013-14’, ‘2014-15’, ‘2015-16’, ‘1996-97’, ‘1997-98’,

‘1998-99’, ‘1999-00’], dtype=object)

建構新列

data['season'] = data['season'].apply(lambda x: int(x.split('-')[1]) )
data['season'].unique()

可視化distance與dist之間的關系

plt.figure(figsize=(5,5))
plt.scatter(data['dist'], data['shot_distance'], color='blue')
plt.title('dist and shot_distance')
plt.xlabel('dist')
plt.ylabel('shot_distance')

科比生涯資料分析——利用随機森林進行分類1.資料讀取與介紹2.特征資料可視化展示3.資料預處理4.使用scikit-learn建立模型

删除多餘字段

drops = ['shot_id', 'team_id', 'team_name', 'shot_zone_area', 'shot_zone_range', 'shot_zone_basic', \
         'matchup', 'lon', 'lat', 'seconds_remaining', 'minutes_remaining', \
         'shot_distance', 'loc_x', 'loc_y', 'game_event_id', 'game_id', 'game_date']
for drop in drops:
    data = data.drop(drop, 1)

将分類字段轉化為數值型

categorical_vars = ['action_type', 'combined_shot_type', 'shot_type', 'opponent', 'period', 'season']
for var in categorical_vars:
    data = pd.concat([data, pd.get_dummies(data[var], prefix=var)], 1)
    data = data.drop(var, 1)

4.使用scikit-learn建立模型

構造訓練集

train_kobe = data.copy()
train_kobe = train_kobe.drop(axis=1, columns='shot_made_flag')
train_label = data['shot_made_flag']

導入相關庫及子產品

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import confusion_matrix,log_loss
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold

機器學習步驟如下：

在這裡面，我們采用随機森林內建算法，對科比是否進球進行合理分類，代碼主要實作目标為：
尋求随機森林中樹的最優建構數量
尋求樹的最優深度，防止過度拟合現象發生

# find the best n_estimators for RandomForestClassifier
print('Finding best n_estimators for RandomForestClassifier...')
min_score = 100000
best_n = 0
scores_n = []
range_n = np.logspace(0,2,num=3).astype(int)
for n in range_n:
    print("the number of trees : {0}".format(n))
    t1 = time.time()    
    rfc_score = 0.
    rfc = RandomForestClassifier(n_estimators=n)
    for train_k, test_k in KFold(len(train_kobe), n_folds=10, shuffle=True):
        rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k])
        #rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10
        pred = rfc.predict(train_kobe.iloc[test_k])
        rfc_score += log_loss(train_label.iloc[test_k], pred) / 10
    scores_n.append(rfc_score)
    if rfc_score < min_score:
        min_score = rfc_score
        best_n = n
        
    t2 = time.time()
    print('Done processing {0} trees ({1:.3f}sec)'.format(n, t2-t1))
print(best_n, min_score)

# find best max_depth for RandomForestClassifier
print('Finding best max_depth for RandomForestClassifier...')
min_score = 100000
best_m = 0
scores_m = []
range_m = np.logspace(0,2,num=3).astype(int)
for m in range_m:
    print("the max depth : {0}".format(m))
    t1 = time.time()   
    rfc_score = 0.
    rfc = RandomForestClassifier(max_depth=m, n_estimators=best_n)
    for train_k, test_k in KFold(len(train_kobe), n_folds=10, shuffle=True):
        rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k])
        #rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10
        pred = rfc.predict(train_kobe.iloc[test_k])
        rfc_score += log_loss(train_label.iloc[test_k], pred) / 10
    scores_m.append(rfc_score)
    if rfc_score < min_score:
        min_score = rfc_score
        best_m = m
    
    t2 = time.time()
    print('Done processing {0} trees ({1:.3f}sec)'.format(m, t2-t1))
print(best_m, min_score)

輸出結果如下：

科比生涯資料分析——利用随機森林進行分類1.資料讀取與介紹2.特征資料可視化展示3.資料預處理4.使用scikit-learn建立模型

結果顯示：最優樹數量為100，樹最大深度為10

可視化在不同數量樹以及樹深度下随機森林的資訊熵（香農熵），資訊熵越大，表明此時資訊不确定性越大，即預測的準确性越低

科比生涯資料分析——利用随機森林進行分類1.資料讀取與介紹2.特征資料可視化展示3.資料預處理4.使用scikit-learn建立模型

科比生涯資料分析——利用随機森林進行分類1.資料讀取與介紹2.特征資料可視化展示3.資料預處理4.使用scikit-learn建立模型

1.資料讀取與介紹

2.特征資料可視化展示

3.資料預處理

4.使用scikit-learn建立模型

繼續閱讀

簡單文檔分類——樸素貝葉斯算法樸素貝葉斯算法簡單文檔分類執行個體步驟總結樸素貝葉斯分類調用(sklearn)

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

分類算法的評價名額

K-近鄰算法以及圖像分類應用

weka之NB算法

使用weka的select attribute

weka中分類器算法

在weka中內建自己的算法

【多變量線性回歸】學習記錄序思路實作終

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

【趨高機器視覺】機器視覺技術原了解析及解決方案

吳恩達 coursera ML 第七課總結+作業答案前言目錄正文模型表示作業答案

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告