轉載請附上連結http://blog.csdn.net/iemyxie/article/details/39520537
分類回歸樹算法:CART(Classification And Regression Tree)算法采用一種二分遞歸分割的技術,将目前的樣本集分為兩個子樣本集,使得生成的的每個非葉子節點都有兩個分支。是以,CART算法生成的決策樹是結構簡潔的二叉樹。
分類樹兩個基本思想:第一個是将訓練樣本進行遞歸地劃分自變量空間進行建樹的想法,第二個想法是用驗證資料進行剪枝。
CART 與 C4.5 的不同之處是節點在分裂時使用 GINI 指數。 GINI 名額主要是度量資料劃分或訓練資料集 D 的不純度為主,系數值的屬性作為測試屬性, GINI 值越小,表明樣本的純淨度越高(即該樣本隻屬于同一類的機率越高)。選擇該屬性産生最小的 GINI 名額的子集作為它的分裂子集。
算法步驟:
CART_classification(DataSet,featureList, alpha,):
建立根節點R
如果目前DataSet中的資料的類别相同,則标記R的類别标記為該類
如果決策樹高度大于alpha,則不再分解,标記R的類别classify(DataSet)
遞歸情況:
标記R的類别classify(DataSet)
從featureList中選擇屬性F(選擇Gini(DataSet,F)最小的屬性劃分,連續屬性參考C4.5的離散化過程(以Gini最小作為劃分标準))
根據F,将DataSet做二進制劃分DS_L和DS_R:
如果DS_L或DS_R為空,則不再分解
如果DS_L和DS_R都不為空,節點
C_L= CART_classification(DS_L,featureList, alpha);
C_R= CART_classification(DS_RfeatureList, alpha)
将節點C_L和C_R添加為R的左右子節點
使用SQL實作核心代碼:
rr:while (1=1) do
set @weather = (select id from weather where class = 0 limit 0,1);
set @feature =(select parent from finalgini where statetemp=1 limit 0,1);
if (@weather is null ) then
leave rr;
else if(@feature is null) then
update finalgini set statetemp = state;
end if;
end if;
if (@weather is not null) then
b:begin
set current_gini = (select min(gini) from finalgini where statetemp=1);
set current_class = (select parent from finalgini where gini = current_gini);
drop table if exists aa;
create temporary table aa (namee varchar(100));
insert into aa select class from finalgini where parent=current_class;
insert into aa select class2 from finalgini where parent=current_class;
tt:while (1=1) do
set @x = (select namee from aa limit 0,1);
if (@x is not null) then
a0:begin
drop table if exists bb;
set @b=concat('create temporary table bb as \(select id from ', current_table,' where ',current_class,' regexp \'',@x,'\' and class = 0 \)');
prepare stmt2 from @b;
execute stmt2;
set @count = (select count(distinct play) from bb left join weather on bb.id = weather.id);
if (@count =1) then
a1:begin
update bb left join weather on bb.id=weather.id set class = current_num;
set current_num = current_num+1;
if (current_table ='cc') then
delete from cc where id in (select id from bb);
end if;
set @f=(select play from cc limit 0,1);
if (@f is null) then
set current_table='weather';
update finalgini set statetemp=state;
end if;
delete from aa where namee = @x;
end a1;
end if;
if (@count>1) then
set @id = (select count(id) from bb);
if(@id = 2) then
w:begin
update bb left join weather on bb.id=weather.id set class = current_num where play='yes';
set current_num = current_num+1;
update bb left join weather on bb.id=weather.id set class = current_num where play='no';
set current_num = current_num+1;
if (current_table ='cc') then
delete from cc where id in (select id from bb);
end if;
set @f=(select play from cc limit 0,1);
if (@f is null) then
set current_table='weather';
update finalgini set statetemp=state;
end if;
delete from aa where namee = @x;
end w;
end if;
if(@id > 2) then
drop table if exists cc;
create temporary table cc select * from weather inner join bb using(id);
set current_table = 'cc';
leave tt;
end if;
end if;
if(@count=0) then
delete from aa where namee = @x;
end if;
end a0;
else
update finalgini set state=0 where parent=current_class;
leave tt;
end if;
end while;
update finalgini set statetemp=0 where parent=current_class;
end b;
end if;
end while;
end |
delimiter ;
程式中表的解釋:
• 表 2 classgini 各 屬性不同分類集合的 gini 值 • 表 3 finalgini 存放各個屬性的最優分類及對應 gini 值
轉載請附上連結http://blog.csdn.net/iemyxie/article/details/39520537