天天看點

西瓜書習題4.3 決策樹

要求程式設計實作基于資訊熵進行劃分選擇的決策樹算法。并為西瓜資料集3.0生成一棵決策樹;

其實說白了,決策樹就是一顆遞歸生成的樹,每個中間結點都是屬性的劃分,葉節點給出分類結果。

西瓜書習題4.3 決策樹

從算法流程來看顯然要構造一個node類,用來存儲樹的結構。這裡設計了四個屬性,分别來表示node的類别(-1表示中間結點,0表示負類,1表示正類),node的分支數組,node的最優劃分屬性和選擇這個屬性的值。

class Node:
    def __init__(self,ntype):
        self.ntype = ntype
        self.children = []
        self.a = -1
        self.limit = "None"
    def __str__(self):
        return "node->a : %s node->limit : %s node->ntype : %s"%(self.a,self.limit,self.ntype)
           

然後就是算法的主流程

def treeGenerate(x,y,a):
    node = Node(-1)
    if(sum(y)==len(y)):  #判斷資料集樣本是不是全屬于正類
        node.ntype = 1   #如果是,遞歸結束 傳回node頭節點
        return node
    if(sum(y)==0):       #同理,判斷樣本是不是全屬于負類
        node.ntype = 0
        return node
    if(len(a)==0 or judge(x,a)==1): #judge函數判斷所有樣本是否在所有屬性上取值相同
        node.ntype = maxy(y)        #如果相同,标記為資料集中最多的類
        return node
    a_ = getMaxAv(x,y,a)            #該函數選出最優劃分屬性a*
    d_ = getdict(x,a_)              #生成一個字典,傳回a*每一個值在資料集中出現的次數
    node.ntype = -1                 #node為中間結點
    node.a = a_
    for ai in set(x[a_]):           
        child = Node(-1)            #生成分支結點
        child.a = a_
        if(d_[ai] == 0):            #如果資料集中在a*上取值為ai的子集為空
            child.ntype = maxy(y)   #,标記為資料集中樣本最多的類
            node.children.append(child)
            return node
        else:
            xt,yt = getdata(x,y,a_,ai)        #擷取樣本子集
            xt.reset_index(drop = True,inplace = True) #重建索引
            at = [av for av in a if av != a_]          
            child = treeGenerate(xt,yt,at)             #遞歸建樹
            child.limit = ai
            node.children.append(child)
    return node
           

接下來就是實作各個功能函數

judge函數判斷x的每列列首元素是不是和每一列每個元素相等,這樣就達到了判斷每列所有元素是否相同的作用。

def judge(x,a):
    for ar in a:
        if((x.loc[:,ar][0]==x.loc[:,ar]).all() == False):
            return 0
    return 1
           

maxy函數比較簡單,傳回樣本數最多的類。

def maxy(y):
    if(sum(y)*2 >= len(y)):
        return 1
    else:
        return 0
           

getMaxAv函數擷取最大的資訊增益和最優劃分屬性,其中資訊增益由calGain函數給出

def getMaxAv(x,y,a):
    Max = -1
    MaxAv = a[0]
    for av in a:
        temp = calGain(x,y,av)
        if(temp > Max):
            Max = temp
            MaxAv = av
    return MaxAv
           

資訊增益的定義其實就是父節點的資訊熵減去子節點資訊熵的總合;而資訊熵度量了樣本集合的純度:資訊熵越小,純度越高。因而父節點又經過一次劃分得到的子節點的資訊熵一定更小。資訊增益就定義了這種變小的幅度,因而我們要找到最大的資訊增益。

def calGain(x,y,av):
    d = getdict(x,av)
    p1 = getp1(x,y,av)
    
    ans = calEnt(sum(y)/len(y))
    for key,value in d.items():
        ans -= (value/len(y))*calEnt(p1[key]/value) #Gain(D,a) = Ent(D) - sum(Ent(D.child))
    return ans
           
#計算資訊熵
def calEnt(p1):  #p1 = sum(y)/len(y)
    p0 = 1 - p1
    if(p1 == 0 or p0 == 0):
        return 0
    return -1*(math.log(p1,2)*p1+math.log(p0,2)*p0)

#輔助函數
def getdict(x,av):
    d = dict.fromkeys(set(x[av]),0)
    arr = np.array(x[av])
    for i in range(len(arr)):
        d[arr[i]] += 1
    return d

#輔助函數 傳回分支節點的p1
def getp1(x,y,av):
    p1 = dict.fromkeys(set(x[av]),0)
    arr = np.array(x[av])
    for i in range(len(arr)):
        if(y[i] == 1):
            p1[arr[i]] += 1
    return p1
           

 最後就是getData函數,很簡單,目的是得到遞歸建樹時要用的樣本子集。

def getdata(x,y,av,ai):
    xt = copy.deepcopy(x)
    yt = [] 
    for i in range(x.shape[0]):
        if(x[av][i] != ai):                #删去a*屬性上不符合要求取值的樣本
            xt.drop(i,axis = 0,inplace = True)
        if(x[av][i] == ai):
            yt.append(y[i])
    yt = pd.Series(yt,name = "target",dtype = "int64")
    return xt,yt
           

最最後就是import的庫和main函數了

import numpy as np
import pandas as pd
import math
import copy


def display(node):              #列印樹資訊
    print(node)
    if(len(node.children) > 0):
        for i in range(len(node.children)):
            display(node.children[i])
    return 0

def main():
    data = pd.read_csv("edata.txt",sep = ',')
    y = data.target
    x = data.drop(['target'],axis = 1)
    a = np.array(x.columns)
    #print(x)
    #print(y)
    node = treeGenerate(x,y,a)
    display(node)
if __name__ == "__main__":
    main()
           
列印是按照深度優先周遊的,是以遇到type為-1的分支點,會一直往下列印
node.a為-1,就是目前node沒有選擇最優劃分屬性。

最後結果:
node->a : v4 node->limit : None node->ntype : -1
node->a : -1 node->limit : 模糊 node->ntype : 0
node->a : v6 node->limit : 稍糊 node->ntype : -1
node->a : -1 node->limit : 硬滑 node->ntype : 0
node->a : -1 node->limit : 軟粘 node->ntype : 1
node->a : v2 node->limit : 清晰 node->ntype : -1
node->a : v1 node->limit : 稍蜷 node->ntype : -1
node->a : -1 node->limit : 青綠 node->ntype : 1
node->a : v6 node->limit : 烏黑 node->ntype : -1
node->a : -1 node->limit : 軟粘 node->ntype : 0
node->a : -1 node->limit : 硬滑 node->ntype : 1
node->a : -1 node->limit : 蜷縮 node->ntype : 1
node->a : -1 node->limit : 硬挺 node->ntype : 0
           

需要注意的是,這裡的決策樹沒有涉及連續屬性,當資料中由連續屬性時需要适當改寫。不過也很簡單。

def getTb(x,av):
    Ta = []
    Tb = []
    for i in range(x.shape[0]):
        Ta.append(x[av][i])
    list.sort(Ta)
    
    for i in range(len(Ta)-1):
        Tb.append((Ta[i] + Ta[i+1])/2)
    return Tb

def divideByT(x,y,av,tb):
    t = f = 0
    pt1 = pf1 = 0
    for i in range(x.shape[0]):
        if(x[av][i] >= tb):
            t += 1
            if(y[i] == 1):
                pt1 += 1
        else:
            f += 1
            if(y[i] == 1):
                pf1 +=1
    return t,f,pt1,pf1

def calGain(x,y,av):
    if(x[av].dtype != "float64"):
        d = getdict(x,av)
        p1 = getp1(x,y,av)
    
        ans = calEnt(sum(y)/len(y))
        for key,value in d.items():
            ans -= (value/len(y))*calEnt(p1[key]/value)
        return ans,None
    else:
        Tb = getTb(x,av)
        Max = -1
        Maxt = Tb[0]
        for tb in Tb:
            ans = calEnt(sum(y)/len(y))
            t,f,pt1,pf1 = divideByT(x,y,av,tb)
            ans -= ((t/len(y))*calEnt(pt1/t) + (f/len(y))*calEnt(pf1/f))
            if ans > Max:
                Max = ans
                Maxt = tb
        return Max,Maxt  
  
def getdata(x,y,av,ai):
    xt = copy.deepcopy(x)
    yt = [] 
    for i in range(x.shape[0]):
        if(x[av][i] != ai):
            xt.drop(i,axis = 0,inplace = True)
        if(x[av][i] == ai):
            yt.append(y[i])
    yt = pd.Series(yt,name = "target",dtype = "int64")
    return xt,yt

def getcontinuousdata(x,y,av,ai,flag):
    xt = copy.deepcopy(x)
    xf = copy.deepcopy(x)
    yt = [] 
    yf = []
    for i in range(x.shape[0]):
        if(x[av][i] > ai):
            xt.drop(i,axis = 0,inplace = True)
            yf.append(y[i])
        if(x[av][i] <= ai):
            yt.append(y[i])
            xf.drop(i,axis = 0,inplace = True)
    yt = pd.Series(yt,name = "target",dtype = "int64")
    yf = pd.Series(yf,name = "target",dtype = "int64")
    if(flag == 0):
        return xt,yt
    if(flag == 1):
        return xf,yf

def treeGenerate(x,y,a):
    node = Node(-1)
    if(sum(y)==len(y)):
        node.ntype = 1
        return node
    if(sum(y)==0):
        node.ntype = 0
        return node
    if(len(a)==0 or judge(x,a)==1):
        node.ntype = maxy(y)
        return node
    a_,t_ = getMaxAv(x,y,a)
    d_ = getdict(x,a_)
    node.ntype = -1
    node.a = a_
    if(x[a_].dtype != "float64"):
        for ai in set(x[a_]):
            child = Node(-1)
            child.a = a_
            if(d_[ai] == 0):
                child.ntype = maxy(y)
                node.children.append(child)
                return node
            else:
                xt,yt = getdata(x,y,a_,ai)
                xt.reset_index(drop = True,inplace = True)
                at = [av for av in a if av != a_]
                child = treeGenerate(xt,yt,at)
                child.limit = ai
                node.children.append(child)
        return node
    else:
        t,f,pt1,pf1 = divideByT(x,y,a_,t_)
        if(t == 0 or f == 0):
            child.ntype = maxy(y)
            node.children.append(child)
            return node
        else:
            for i in range(2):
                xt,yt = getcontinuousdata(x,y,a_,t_,i)
                xt.reset_index(drop = True,inplace = True)
                at = [av for av in a]
                child = treeGenerate(xt,yt,at)
                if(i == 0):
                    child.limit = "<=" + str(t_)
                else :
                    child.limit = ">" + str(t_)
                node.children.append(child)
        return node
           
結果:
node->a : v4 node->limit : None node->ntype : -1
node->a : -1 node->limit : 模糊 node->ntype : 0
node->a : v7 node->limit : 清晰 node->ntype : -1
node->a : -1 node->limit : <=0.38149999999999995 node->ntype : 0
node->a : -1 node->limit : >0.38149999999999995 node->ntype : 1
node->a : v6 node->limit : 稍糊 node->ntype : -1
node->a : -1 node->limit : 軟粘 node->ntype : 1
node->a : -1 node->limit : 硬滑 node->ntype : 0
           
這裡的資料集
v1,v2,v3,v4,v5,v6,target
青綠,蜷縮,濁響,清晰,凹陷,硬滑,1
烏黑,蜷縮,沉悶,清晰,凹陷,硬滑,1
烏黑,蜷縮,濁響,清晰,凹陷,硬滑,1
青綠,蜷縮,沉悶,清晰,凹陷,硬滑,1
淺白,蜷縮,濁響,清晰,凹陷,硬滑,1
青綠,稍蜷,濁響,清晰,稍凹,軟粘,1
烏黑,稍蜷,濁響,稍糊,稍凹,軟粘,1
烏黑,稍蜷,濁響,清晰,稍凹,硬滑,1
烏黑,稍蜷,沉悶,稍糊,稍凹,硬滑,0
青綠,硬挺,清脆,清晰,平坦,軟粘,0
淺白,硬挺,清脆,模糊,平坦,硬滑,0
淺白,蜷縮,濁響,模糊,平坦,軟粘,0
青綠,稍蜷,濁響,稍糊,凹陷,硬滑,0
淺白,稍蜷,沉悶,稍糊,凹陷,硬滑,0
烏黑,稍蜷,濁響,清晰,稍凹,軟粘,0
淺白,蜷縮,濁響,模糊,平坦,硬滑,0
青綠,蜷縮,沉悶,稍糊,稍凹,硬滑,0

西瓜資料集3.0
v1,v2,v3,v4,v5,v6,v7,v8,target
青綠,蜷縮,濁響,清晰,凹陷,硬滑,0.697,0.46,1
烏黑,蜷縮,沉悶,清晰,凹陷,硬滑,0.774,0.376,1
烏黑,蜷縮,濁響,清晰,凹陷,硬滑,0.634,0.264,1
青綠,蜷縮,沉悶,清晰,凹陷,硬滑,0.608,0.318,1
淺白,蜷縮,濁響,清晰,凹陷,硬滑,0.556,0.215,1
青綠,稍蜷,濁響,清晰,稍凹,軟粘,0.403,0.237,1
烏黑,稍蜷,濁響,稍糊,稍凹,軟粘,0.481,0.149,1
烏黑,稍蜷,濁響,清晰,稍凹,硬滑,0.437,0.211,1
烏黑,稍蜷,沉悶,稍糊,稍凹,硬滑,0.666,0.091,0
青綠,硬挺,清脆,清晰,平坦,軟粘,0.243,0.267,0
淺白,硬挺,清脆,模糊,平坦,硬滑,0.245,0.057,0
淺白,蜷縮,濁響,模糊,平坦,軟粘,0.343,0.099,0
青綠,稍蜷,濁響,稍糊,凹陷,硬滑,0.639,0.161,0
淺白,稍蜷,沉悶,稍糊,凹陷,硬滑,0.657,0.198,0
烏黑,稍蜷,濁響,清晰,稍凹,軟粘,0.36,0.37,0
淺白,蜷縮,濁響,模糊,平坦,硬滑,0.593,0.042,0
青綠,蜷縮,沉悶,稍糊,稍凹,硬滑,0.719,0.103,0