R-Modeling(step 4)

[I]Regression

OLSregression

Description	Function
simple linear regression	lm(Y~X1,data)
polynomial regression	lm(Y~X1+...+I(X^2),data)
multiple linear regression	lm(Y~X1+X2+...+Xk)
multiple linear regression with interaction terms	lm(Y~X1+X2+X1:X2)

selection the optimal regression model


anova(fit1,fit2)	nested model fit
AIC(fit1,fit2)	statistical fit
stepAIC(fit,direction=)	stepwise method:"forward""backward"
regsubsets()	all-subsets regression

test


plot()	plot
qqplot()	quantile comparison
durbinWatsonTest()	Durbin-Waston
crPlots()	component and residual
ncvTest()	score test of non-constant error variance
spreadLevelPlot()	dispersion level
outlierTest()	Bonferroni outlier
avPlots()	added variable
inluencePlot()	regression
scatterplot()	enhanced scatter plot
scatterplotMatrix()	enhanced scatter plot matrix
vif()	variance expansion factor

abnormal

abnormal observation

1.outlier:outlierTest()

2.high leverage point:hat.plot()

3.strong influence point:Cook's D

improvement measures

1.delete observation point

2.variable transformation

3.addition and deletion of variables

Generalized Linear

generalized linear

glm()

Distribution Family | Default Connection Function

binomial	(link = "logit")
gaussian	(link = "identity")
gamma	(link = "inverse")
inverse.gaussian	(link = "1/mu^2")
poisson	(link = "log")
quasi	(link = "identity",variance = "constant")
quasibinomial
quasipoisson

function with glm()

Function | Description

summary()	show details of the fitting model
coefficients()/coef()	list the parameters of the fitting model
confint()	give a confidence interval for the model paramenters
residuals()	list residual value of the fitting model
anova()	generate an analysis table of variance for two fitting model
	generate a diagnostic map of the evaluation fitting model
predict()	forecast new datasets with a fitting model
deviance()	the deviation of the fitting model
df.residual()	the residual freedom of the fitting model

logistic

> data(Affairs,package="ARE")
> summary(Affairs)
> table(Affair$affairs)
> Affairs$ynaffair[Affairs$affairs > 0] <-1
> Affairs$ynaffair[Affairs$affairs == 0] <-0
> Affairs$naffair <- factor(Affairs$ynaffair),levels=c(0,1),labels=c("No","Yes"))
> table(Affairs$ynaffair)
> fit.full <-glm(ynaffair~gender + age + yearmarried +children + 
             religiousness + education + occupation + rating,family = binomial()
             data = Affairs)
> fit.reduced <-glm(ynaffair ~ age + yearmarried + religiousness +rating,
                      family = binomial(),data = Affairs)
> anova(fit.reduced,fit.full,test="Chisq")
> coef(fit.reduced)
> exp(coef(fit.reduced))
> testdata<-data.frame(rating=c(1,2,3,4,5),age=mean(Affair$age),
                                   yearsmarried=mean(Affairs$yearsmsrraied),
                                   religiousness=mean(Affairs$religiousness))
> teatdata
> testdata$prob <- predict(fit.reduced,newdata=testdata,type="response")
> testdata <- data.frame(rating = mean(Affair$rating),
                                    age=seq(17,57,10),
                                    yearsmarried=mean(Affair$yearsmarried),
                                    religiousness=mean(Affairs$religiousness))
> testdata
> testdata$prob<-predict(fit.reduced,newdata=testdata,type="response")
> testdata
> deviance(fit.reduced)/df.residual(fit.reduced)
> fit <- glm(ynaffair ~ age + yearmarried + religiousness + rating,
                 family = quasibinomial(),data=Affairs)
> pchisq(summary(fit.od)$dispersion * fit$df.residual,
             fit$df.residual,lower=F)

> data(breslow.dat,package="robust")
> name(breslow.dat)
> summary(breslow.dat[c(6,7,8,10)])
> opar<-par(no.readonly=TRUE)
> par(mfrow=c(1,2))
> attach(breakslow.dat)
> hist(sumY,breaks=20,xlab="Seizure Count",main="Distribution of Seizures")
> boxplot(sumY~Trt,xlab="Treatment",main="Group Comparisons")
> par(opar)
> fit<-glm(sumY~Base + Age +Trt,data=breslow.dat,family=poisson())
> summary(fit)
> coef(fit)
> exp(coef(fit))
> deviance(fit)/df.residual(fit)
> library(qcc)
> qcc.overdispersion.test(breslow.dat$sumY,type="poisson")
> fit.od<-glm(sumY~Base + Age + Trt,data=breslow.dat,
       family=quassipossion())
> summary(fit.od)
> fit glm(sumY~Base + Age +Trt,data=breslow.dat,offset=log(time),family=poisson)

[II]Cluster

1.cluster analysis step

1.choose the right variable

2.scale data

3.looking for anomalies

4.calculated distance

5.selection clustering algorithm

6.obtain one or more clustering methods

7.determine the number of classes

8.get the ultimate clustering solution

9.visualization of results

10.interpretation class

11.validation results

2.calculated distance

> data(nurtrient,package="flexclust")
> head(nutrient,4)
> d<-dist(nutrient)
> as.martrix(d)[1:4,1:4]

3.hierarchical clustering analysis

> data(nutrient,package="flexclust")
> row.name(nutrient) <-tolower(row.names(nutrient))
> nutrient.scaled<-scale(nutrient)

> d<-dist(nutrient.scaled)

> fit.average <-hclust(d,method="average")
> plot(fit.average,hang=-1,cex=.8,main="Average Linkage Clustering")
> library(Nbcluster)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(nutrient,scaled,distance="euclidean",
                       min.nc=2,max.nc=15,method="average")
> table(nc$Best.n[1,])
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Cluster Chosen by 26 Criteria")  
> clusters<-cutree(fit.average,k=5)
> table(clusters)
> aggregate(nutrient,by=list(cluster=clusters),median)
> aggregate(as.data.frame(nutrient.scaled),bu=list(cluster=clusters),median)
> plot(fit.average,hang=-1,cex=.8,
         main="Average Linkage Cluster\n5 Cluster Solution")
> rect.hclust(fit.average,k=5)

4.Clustering analysis

K-means clustering

> wssplot<-function(data,nc=15,seed=1234)(
> head(wine)
> df<-scale(wine[-1])
> wssplot(df)
> library(NcClust)
> set.seed(1234)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmeans")
> table(nc$Best.n[1,)
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Clusters Chosen by 26 Criteria")
> set.seed(1234)
> fit.km<-kmeans(df,3,nstart=25)
> fit.km$size
> fit.km$centers
> aggregate(wine[-1],by=list(cluster=fit.km$cluster),mean)

Division around the center point

> library(cluster)
> set.seed(1234)
> fit.pam<-pam(wine[-1],k=3,stand=TRUE)
> fit.pam$medoids
> clusplot(fit.pam,main="Bivariate Cluster Plot")
> ct.pam<-table(wine$Type,fit.pam$clustering)
> randIndex(ct.pam)

5.avoid non-existing classes

> library(fMultivar)
> set.seed(1234)
> df<-rnom2d(1000,rho=.5)
> df<-as.data.frame(df)
> plot(df,main="Bivariate Normal Distribution with rho=0.5")
> wssplot(df)
> library(NbClust)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmean")
> dev.new()
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Clusters Chosen by 26 Criteria")
> library(ggplot2)
> library(cluster)
> fit<-pam(df,k=2)
> df$clustering<-factor(fit$clustering)
> ggplot(data=df,aes(x=V1,y=V2,color=clustering,shape=clustering)) +
             geom_point() + ggtitle("Clustering of Bivariate Normal Data")
> plot(nc$All,index[,4],type="o",ylab="CCC",xlab="Number of clusters",col="blue")

[III]Classification

data preparation

> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"breast-cancer-wisconsin/breast-cancer-wisconsin.data"
> url<-paste(loc,ds,sep="")
> breast<-read.table(url,sep=",",header=FALSE,na.string="?")
> names(breast)<-c("ID","clumpThickness","sizeUniformity",
                              "shapeUniformity","maginalAdhesion",
                              "singleEpitheliaCellSize","bareNuclei",
                              "blandChromatin","normalNucleoli","mitosis","class")
> df<-breast[-1]
> df$class<-factor(df$class,levels=c(2,4),
                           labels=c("benign","malignant"))
> set.seed(1234)
> train<-sample(nrow(df),0.7*nrow(df))
> df.train<-df[train,]
> df.validate<-df[-train,]
> table(df.train$class)
> table(df.validate$class)

logistic regression

> fit.logit<-glm(class~.,data=df.train,family=binomial())
> summary(fit.logit)
> prob<-predict(fit.logit,df.validate,type="response")
> logit.perd<-factor(prob>.5,levels=c(FALSE,TRUE),
                             labels=c("benign","malignant"))
> logit.perf<-table(df.validate$class,logit.pred,
                            dnn=c("Actual","Predicted"))
> logit.perf

decision tree

classic decision tree

> library(repart)
> set.seed(1234)
> dtree<-repart(class~.,data=df.train,method="class",
                       parms=list(split="information"))
> dtree$cptable
> plotcp(dtree)
> dtree.pruned<-prune(dtree,cp=.0125)
> library(repart.plot)
> prp(dtree.pruned,type=2,extra=104,
        fallen.leaves=TRUE,main="Decesion Tree")
> dtree.pred<-predict(dtree.pruned,df.validate,type="class")
> dtree.perf<-table(df.validate$class,dtree.pred,
                             dnn=c("Actual","Predicted"))
> dtree.perf

conditional inference tree

> library(party)
> fit.ctree<-ctree(class~.,data=sf.train)
> plot(fit.ctree,main="Conditional Inference Tree")
> ctree.pred<-predict(fit.ctree,df.validate,type="response")
> ctree.perf<-table(df.validate$class,ctree.pred
                            dnn=c("Actual","Predicted"))
> stree.perf

random forest

> library(randomForest)
> set.seed(1234)
> fit.forest<-randomForest(class~.,data=df.train,na.action=na.roughfix,importance=TRUE)
> fit.forest
> importance(fit.forest,type=2)
> forest.pred<-predict(fit.forest,df.validate)
> forest.perf<-table(df.validate$class,forest.pred,
                              dnn=c("Actual","Predicted"))
> forest.perf

support vector machines

> library(e1071)
> set.seed(1234)
> fit.svm<-svm(class~.,data=df.train)
> fit.svm
> svm,pred<-predict(fit.svm,na.omit(df.validate))
> svm.perf<-table(na.omit(df.validate)$class,
                           svm,pred,dnn=c("Actual","Predicted"))
> svm.perf

svm model with rbf core

> set.seed(1234)
> tuned<-tune.svm(class~.,data=df.train,gamma=10^(-6:1),cost=10^(-10:10))
> turned
> fit.svm<-svm(class~.,data=df.train,gamma=.01,cost=1)
> svm.pred<-predict(fit.svm,na.omit(df,validate))
> svm.perf<-table(na.omit(df.validate)$class,
                            svm.pred,dnn=c("Actual","Predicted"))
> svm.perf

choose the best solution for forecasting

> performance<-function(table,n=2){
        if(!all(dim(table)==c(2,2)))
              stop("Must be a 2 × 2 table")
   tn=table[1,1]
   fp=table[1,2]
   fn=table[2,1]
   tp=table[2,2]
   sensitivity=tp/(tp+fn)
   specificity=tn/(tn+fp)
   ppp=tp/(tp+fp)
   npp=tn/(tn+fn)
   hitrate=(tp+tn)/(tp+tn+fp+fn)
   result<-paste("Sensitivity=",round(ppp,n),
           "\nSpecificity=",round(specificity,n),
           "\nPositive Predictive Value=",round(ppp,n),
           "\nNegative Predictive Value=",round(npp,n),
           "\nAccuracy=",round(hitrate,n),"\n",seq="")
   cat(result)
  }

data mining with the rattle package

> install.packages("rattle")
> rattle()
> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"pima-indians-diabetes/pima-indians-diabetes.data"
> url <-paste(loc,ds,sep="")
> diabetes<-read.table(url,seq=",",header=FALSE)
> names(diabetes)<-c("npregant","plasma","bp","triceps",
                                  "insulin","bmi","pedigree","age","class")
> diabetes$class<-factor(diabetes$class,levels=c(0,1),
                                      labels=c("normal","diabetic"))
> library(rattle)
> rattle()
> cv<-matrix(c(145,50,8,27),nrow=2)
> performance(as.table(cv))

[IV]Time Series

	Package
ts()	stats	generate timing objects
	graphics	draw a line graph of the time series
start()		return the start time of the time series
end()		return the end time of the time series
frequency()		return the number of time points of the time series
windows()		subset a sequence object
ma()	forecast	fit a simple moving average model
stl()		use LOESS smooth to decompose timing into seasonal terms,trend terms and random terms
monthplot()		draw seasonal terms in time series
seasonplot() forecast	generate seasonal plot
HoltWinters()	ststs	fit exponential smoothing model
forecast()		predicting the future value of the timing
accuracy()		return time-of-fit goodness variable
ets()		fit exponential smoothing model and automatically select the optimal model
lag()		return the timing after the specified hysteresis is taken
Acf()		estimated autocorrelation function
Pacf()		estimated partial autocorrelation function
diff()	base	return the sequence after the lag term or difference
ndiffs()		find the optimal difference count to remove the trend terms in the sequence
adf.test()	tseries	perform an ADF test on the sequence to determine if it is stable
arima		fit the ARIMA model
Box.test()		perform a Ljung-Box test to determine if the model's residuals are independent
bds.test()	teeries	perform a BDS test to determine whether random variables in the sequence are subject to independent and identical distribution
auto.arima		automatically select the ARIMA model

END!

R-Modeling(step 4)

[I]Regression

OLSregression

test

abnormal

Generalized Linear

generalized linear

logistic

[II]Cluster

1.cluster analysis step

2.calculated distance

3.hierarchical clustering analysis

4.Clustering analysis

5.avoid non-existing classes

[III]Classification

data preparation

logistic regression

decision tree

random forest

support vector machines

choose the best solution for forecasting

data mining with the rattle package

[IV]Time Series

繼續閱讀

如果你想要學習深度學習，但是不知道從何入手，那麼《每天五分鐘深度學習》專欄一定是你不容錯過的學習資源。這個專欄包含了神經

tensorflow學習——keras進階API——序列模型Sequential

SVD原理和案例(奇異值分解)

連續兩年入圍全球Gartner ABI魔力象限，Quick BI在商業智能領域究竟有何魔力？1、互動式的分析和可視化2、建構資料故事3、釘釘內建4、增強分析Quick BI

技術解密｜阿裡雲多媒體 AI 團隊是憑借什麼拿下 CVPR2021 5冠1亞的？頂級挑戰賽戰績顯赫四大挑戰的關鍵技術探索基于視訊了解技術打造多媒體 AI 雲産品

算法專家解讀 | 開放搜尋教育搜題能力和實踐

Keras使用分批疊代（fit_generate）的方式訓練資料

圖像分割UNet系列------UNet3+（UNet3plus）詳解

圖像分割UNet系列------UNet詳解

特征：什麼是特征和特征選擇？

Pytorch(二) Tensor Tensor的建立Tensor是什麼Tensor的建立

2023了，學習深度學習架構哪個比較好？

VGGNet------超經典神經網絡結構與PyTorch實作

tensorflow學習——（imdb資料集）文本分類first_2.py

Matlab深度學習-手寫體數字識别Matlab深度學習前言一、MNIST手寫體數字資料二、用到的深度學習架構-LeNet5三、代碼最後

K-近鄰算法以及圖像分類應用