天天看點

資料挖掘算法學習(六)CART算法

轉載請附上連結http://blog.csdn.net/iemyxie/article/details/39520537

分類回歸樹算法:CART(Classification And Regression Tree)算法采用一種二分遞歸分割的技術,将目前的樣本集分為兩個子樣本集,使得生成的的每個非葉子節點都有兩個分支。是以,CART算法生成的決策樹是結構簡潔的二叉樹。

分類樹兩個基本思想:第一個是将訓練樣本進行遞歸地劃分自變量空間進行建樹的想法,第二個想法是用驗證資料進行剪枝。

CART 與 C4.5 的不同之處是節點在分裂時使用 GINI 指數。 GINI 名額主要是度量資料劃分或訓練資料集 D 的不純度為主,系數值的屬性作為測試屬性, GINI 值越小,表明樣本的純淨度越高(即該樣本隻屬于同一類的機率越高)。選擇該屬性産生最小的 GINI 名額的子集作為它的分裂子集。

算法步驟:

CART_classification(DataSet,featureList, alpha,):

建立根節點R

如果目前DataSet中的資料的類别相同,則标記R的類别标記為該類

如果決策樹高度大于alpha,則不再分解,标記R的類别classify(DataSet)

遞歸情況:

标記R的類别classify(DataSet)

從featureList中選擇屬性F(選擇Gini(DataSet,F)最小的屬性劃分,連續屬性參考C4.5的離散化過程(以Gini最小作為劃分标準))

根據F,将DataSet做二進制劃分DS_L和DS_R:

如果DS_L或DS_R為空,則不再分解

如果DS_L和DS_R都不為空,節點

    C_L= CART_classification(DS_L,featureList, alpha);

    C_R= CART_classification(DS_RfeatureList, alpha)

将節點C_L和C_R添加為R的左右子節點

使用SQL實作核心代碼:

rr:while (1=1) do
	set @weather = (select id from weather where class = 0 limit 0,1);
	set @feature =(select parent from finalgini where statetemp=1 limit 0,1);
	if (@weather is null ) then
		leave rr;
	else if(@feature is null) then
		update finalgini set statetemp = state; 
	end if;
	end if;
	if (@weather is not null) then
		b:begin
			set current_gini = (select min(gini) from finalgini where statetemp=1);
			set current_class = (select parent from finalgini where gini = current_gini);
			drop table if exists aa;
			create temporary table aa (namee varchar(100));
			insert into aa select class from finalgini where parent=current_class;
			insert into aa select class2 from finalgini where parent=current_class;
			tt:while (1=1) do
				set @x = (select namee from aa limit 0,1);
				if (@x is not null) then
					a0:begin
						drop table if exists bb;
						set @b=concat('create temporary table bb as \(select id from ', current_table,' where ',current_class,' regexp \'',@x,'\' and class = 0 \)');
						prepare stmt2 from @b;
						execute stmt2;
						set @count = (select count(distinct play) from bb left join weather on bb.id = weather.id); 
						if (@count =1) then
							a1:begin
								update bb left join weather on bb.id=weather.id set class = current_num;
								set current_num = current_num+1;
								if (current_table ='cc') then
									delete from cc where id in (select id from bb);
								end if;
								set @f=(select play from cc limit 0,1);
								if (@f is null) then
									set current_table='weather';
									update finalgini set statetemp=state; 
								end if;
							delete from aa where namee = @x;
							end a1;
							end if;
						if (@count>1) then
								set @id = (select count(id) from bb); 
								if(@id = 2) then
									w:begin
									update bb left join weather on bb.id=weather.id set class = current_num where play='yes';
									set current_num = current_num+1;
									update bb left join weather on bb.id=weather.id set class = current_num where play='no';
									set current_num = current_num+1;
									if (current_table ='cc') then
										delete from cc where id in (select id from bb);
									end if;
									set @f=(select play from cc limit 0,1);
									if (@f is null) then
										set current_table='weather';
										update finalgini set statetemp=state; 	
									end if;
									delete from aa where namee = @x;
									end w;
									end if;
								if(@id > 2) then 
									drop table if exists cc;
									create temporary table cc select * from weather inner join bb using(id);
									set current_table = 'cc';
									leave tt;
								end if;
							end if;
						if(@count=0) then
								delete from aa where namee = @x; 	
							end if;
				end a0;
				else 
					update finalgini set state=0 where parent=current_class;
					leave tt;
				end if;
			end while;	
			update finalgini set statetemp=0 where parent=current_class; 
	 end b;
end if;
end while;
end |
delimiter ;  
           

程式中表的解釋:

• 表 2 classgini 各 屬性不同分類集合的 gini 值 • 表 3 finalgini 存放各個屬性的最優分類及對應 gini 值

轉載請附上連結http://blog.csdn.net/iemyxie/article/details/39520537