前言
開始接觸機器學習,一個必不可少的一個工具就是xgboost,這裡使用xgboost中最簡單的功能完成一個kaggle競賽:Boston Housing,而完成的代碼行數隻有不到40行,足以看出xgboost的強大!
賽題
根據給出的資料屬性進行對應房價的預測。
賽題的位址:https://www.kaggle.com/c/boston-housing#description
資料
作為入門,這裡沒有考慮更多一步的優化,隻對資料進行簡單的認識一下(打開train.csv):
![](https://img.laitimes.com/img/_0nNw4CM6IyYiwiM6ICdiwiIwczLcVmds92czlGZvwVP9EUTDZ0aRJkSwk0LcxGbpZ2LcBDM08CXlpXazRnbvZ2LcRlMMVDT2EWNvwFdu9mZvwVP9cXT1sGVNtWNXF2MWJTYwhnMMBjVtJWd0ckW65UbM5WOHJWa5kHT20ESjBjUIF2LcRHelR3LcJzLctmch1mclRXY39TM4UTM0ITNxIDMykDM4EDMy8CX0Vmbu4GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.jpg)
- 訓練集包括了15列,第一列是ID,最後一列是medv(要預測的資料),是以在訓練的時候将這兩個屬性去除。
- 訓練集中70%的資料取出用于訓練,30%的資料取出用于評價。
打開測試集(test.csv)
- 缺少了預測項medv.
代碼實作
代碼非常簡單,但還是需要對pandas,numpy庫有些了解,對應的方法可以直接查詢文檔。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2018/9/10 22:51
# @Author : likewind
# @mail : [email protected]
# @File : BostonHousing.py
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
#load dataset
dataset_folder = r'E:/kaggle_race/BostonHousing/'
dataset_train = dataset_folder + 'train.csv'
dataset_test = dataset_folder + 'test.csv'
data_train = pd.read_csv(dataset_train)
data_test = pd.read_csv(dataset_test)
#drop irrelevant properties
X = data_train.drop(['ID', 'medv'], axis=1)
#medv is train label
y = data_train.medv
#split train_dataset into train_data, train_label, test_data, test_label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
#use regressor to predict medv.
xg_reg = xgb.XGBRegressor(objective='reg:linear', colsample_bytree=0.3, learning_rate=0.1, max_depth=10,
alpha = 10, n_estimators=500, reg_lambda=2)
#training the regressor
xg_reg.fit(X_train, y_train)
#use the trained regressor to predict test_data
preds = xg_reg.predict(X_test)
#calc rmse
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("Trained RMSE: %f " % rmse)
#use the trained regressor to predict test dataset.
x_test = data_test.drop(['ID'], axis=1)
predictions = xg_reg.predict(x_test)
ID = (data_test.ID).astype(int)
result = np.c_[ID, predictions]
#output results
np.savetxt(dataset_folder + 'xgb_submission.csv', result, fmt="%d,%.4f" ,header='ID,medv', delimiter=',', comments='')
效果
送出到kaggle後,得分3.84706
大概在19名左右。
在完成競賽的過程中,我們并沒有考慮每個特征對結果的影響,是以在接下來的過程中,優化的空間會很大!