天天看點

R-Modeling(step 4)

[I]Regression

OLSregression

Description Function
simple linear regression lm(Y~X1,data)
polynomial regression lm(Y~X1+...+I(X^2),data)
multiple linear regression lm(Y~X1+X2+...+Xk)
multiple linear regression with interaction terms lm(Y~X1+X2+X1:X2)

selection the optimal regression model

anova(fit1,fit2) nested model fit
AIC(fit1,fit2) statistical fit
stepAIC(fit,direction=) stepwise method:"forward""backward"
regsubsets() all-subsets regression

test

plot() plot
qqplot() quantile comparison
durbinWatsonTest() Durbin-Waston
crPlots() component and residual
ncvTest() score test of non-constant error variance
spreadLevelPlot() dispersion level
outlierTest() Bonferroni outlier
avPlots() added variable
inluencePlot() regression
scatterplot() enhanced scatter plot
scatterplotMatrix() enhanced scatter plot matrix
vif() variance expansion factor

abnormal

  • abnormal observation

1.outlier:outlierTest()

2.high leverage point:hat.plot()

3.strong influence point:Cook's D

  • improvement measures

1.delete observation point

2.variable transformation

3.addition and deletion of variables

Generalized Linear

generalized linear

  • glm()

    Distribution Family | Default Connection Function

binomial (link = "logit")
gaussian (link = "identity")
gamma (link = "inverse")
inverse.gaussian (link = "1/mu^2")
poisson (link = "log")
quasi (link = "identity",variance = "constant")
quasibinomial
quasipoisson
  • function with glm()

    Function | Description

summary() show details of the fitting model
coefficients()/coef() list the parameters of the fitting model
confint() give a confidence interval for the model paramenters
residuals() list residual value of the fitting model
anova() generate an analysis table of variance for two fitting model
generate a diagnostic map of the evaluation fitting model
predict() forecast new datasets with a fitting model
deviance() the deviation of the fitting model
df.residual() the residual freedom of the fitting model

logistic

> data(Affairs,package="ARE")
> summary(Affairs)
> table(Affair$affairs)
> Affairs$ynaffair[Affairs$affairs > 0] <-1
> Affairs$ynaffair[Affairs$affairs == 0] <-0
> Affairs$naffair <- factor(Affairs$ynaffair),levels=c(0,1),labels=c("No","Yes"))
> table(Affairs$ynaffair)
> fit.full <-glm(ynaffair~gender + age + yearmarried +children + 
             religiousness + education + occupation + rating,family = binomial()
             data = Affairs)
> fit.reduced <-glm(ynaffair ~ age + yearmarried + religiousness +rating,
                      family = binomial(),data = Affairs)
> anova(fit.reduced,fit.full,test="Chisq")
> coef(fit.reduced)
> exp(coef(fit.reduced))
> testdata<-data.frame(rating=c(1,2,3,4,5),age=mean(Affair$age),
                                   yearsmarried=mean(Affairs$yearsmsrraied),
                                   religiousness=mean(Affairs$religiousness))
> teatdata
> testdata$prob <- predict(fit.reduced,newdata=testdata,type="response")
> testdata <- data.frame(rating = mean(Affair$rating),
                                    age=seq(17,57,10),
                                    yearsmarried=mean(Affair$yearsmarried),
                                    religiousness=mean(Affairs$religiousness))
> testdata
> testdata$prob<-predict(fit.reduced,newdata=testdata,type="response")
> testdata
> deviance(fit.reduced)/df.residual(fit.reduced)
> fit <- glm(ynaffair ~ age + yearmarried + religiousness + rating,
                 family = quasibinomial(),data=Affairs)
> pchisq(summary(fit.od)$dispersion * fit$df.residual,
             fit$df.residual,lower=F)           

> data(breslow.dat,package="robust")
> name(breslow.dat)
> summary(breslow.dat[c(6,7,8,10)])
> opar<-par(no.readonly=TRUE)
> par(mfrow=c(1,2))
> attach(breakslow.dat)
> hist(sumY,breaks=20,xlab="Seizure Count",main="Distribution of Seizures")
> boxplot(sumY~Trt,xlab="Treatment",main="Group Comparisons")
> par(opar)
> fit<-glm(sumY~Base + Age +Trt,data=breslow.dat,family=poisson())
> summary(fit)
> coef(fit)
> exp(coef(fit))
> deviance(fit)/df.residual(fit)
> library(qcc)
> qcc.overdispersion.test(breslow.dat$sumY,type="poisson")
> fit.od<-glm(sumY~Base + Age + Trt,data=breslow.dat,
       family=quassipossion())
> summary(fit.od)
> fit glm(sumY~Base + Age +Trt,data=breslow.dat,offset=log(time),family=poisson)           

[II]Cluster

1.cluster analysis step

1.choose the right variable

2.scale data

3.looking for anomalies

4.calculated distance

5.selection clustering algorithm

6.obtain one or more clustering methods

7.determine the number of classes

8.get the ultimate clustering solution

9.visualization of results

10.interpretation class

11.validation results

2.calculated distance

> data(nurtrient,package="flexclust")
> head(nutrient,4)
> d<-dist(nutrient)
> as.martrix(d)[1:4,1:4]           

3.hierarchical clustering analysis

> data(nutrient,package="flexclust")
> row.name(nutrient) <-tolower(row.names(nutrient))
> nutrient.scaled<-scale(nutrient)

> d<-dist(nutrient.scaled)

> fit.average <-hclust(d,method="average")
> plot(fit.average,hang=-1,cex=.8,main="Average Linkage Clustering")
> library(Nbcluster)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(nutrient,scaled,distance="euclidean",
                       min.nc=2,max.nc=15,method="average")
> table(nc$Best.n[1,])
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Cluster Chosen by 26 Criteria")  
> clusters<-cutree(fit.average,k=5)
> table(clusters)
> aggregate(nutrient,by=list(cluster=clusters),median)
> aggregate(as.data.frame(nutrient.scaled),bu=list(cluster=clusters),median)
> plot(fit.average,hang=-1,cex=.8,
         main="Average Linkage Cluster\n5 Cluster Solution")
> rect.hclust(fit.average,k=5)           

4.Clustering analysis

  • K-means clustering
> wssplot<-function(data,nc=15,seed=1234)(
> head(wine)
> df<-scale(wine[-1])
> wssplot(df)
> library(NcClust)
> set.seed(1234)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmeans")
> table(nc$Best.n[1,)
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Clusters Chosen by 26 Criteria")
> set.seed(1234)
> fit.km<-kmeans(df,3,nstart=25)
> fit.km$size
> fit.km$centers
> aggregate(wine[-1],by=list(cluster=fit.km$cluster),mean)           
  • Division around the center point
> library(cluster)
> set.seed(1234)
> fit.pam<-pam(wine[-1],k=3,stand=TRUE)
> fit.pam$medoids
> clusplot(fit.pam,main="Bivariate Cluster Plot")
> ct.pam<-table(wine$Type,fit.pam$clustering)
> randIndex(ct.pam)           

5.avoid non-existing classes

> library(fMultivar)
> set.seed(1234)
> df<-rnom2d(1000,rho=.5)
> df<-as.data.frame(df)
> plot(df,main="Bivariate Normal Distribution with rho=0.5")
> wssplot(df)
> library(NbClust)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmean")
> dev.new()
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Clusters Chosen by 26 Criteria")
> library(ggplot2)
> library(cluster)
> fit<-pam(df,k=2)
> df$clustering<-factor(fit$clustering)
> ggplot(data=df,aes(x=V1,y=V2,color=clustering,shape=clustering)) +
             geom_point() + ggtitle("Clustering of Bivariate Normal Data")
> plot(nc$All,index[,4],type="o",ylab="CCC",xlab="Number of clusters",col="blue")           

[III]Classification

data preparation

> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"breast-cancer-wisconsin/breast-cancer-wisconsin.data"
> url<-paste(loc,ds,sep="")
> breast<-read.table(url,sep=",",header=FALSE,na.string="?")
> names(breast)<-c("ID","clumpThickness","sizeUniformity",
                              "shapeUniformity","maginalAdhesion",
                              "singleEpitheliaCellSize","bareNuclei",
                              "blandChromatin","normalNucleoli","mitosis","class")
> df<-breast[-1]
> df$class<-factor(df$class,levels=c(2,4),
                           labels=c("benign","malignant"))
> set.seed(1234)
> train<-sample(nrow(df),0.7*nrow(df))
> df.train<-df[train,]
> df.validate<-df[-train,]
> table(df.train$class)
> table(df.validate$class)           

logistic regression

> fit.logit<-glm(class~.,data=df.train,family=binomial())
> summary(fit.logit)
> prob<-predict(fit.logit,df.validate,type="response")
> logit.perd<-factor(prob>.5,levels=c(FALSE,TRUE),
                             labels=c("benign","malignant"))
> logit.perf<-table(df.validate$class,logit.pred,
                            dnn=c("Actual","Predicted"))
> logit.perf           

decision tree

  • classic decision tree
> library(repart)
> set.seed(1234)
> dtree<-repart(class~.,data=df.train,method="class",
                       parms=list(split="information"))
> dtree$cptable
> plotcp(dtree)
> dtree.pruned<-prune(dtree,cp=.0125)
> library(repart.plot)
> prp(dtree.pruned,type=2,extra=104,
        fallen.leaves=TRUE,main="Decesion Tree")
> dtree.pred<-predict(dtree.pruned,df.validate,type="class")
> dtree.perf<-table(df.validate$class,dtree.pred,
                             dnn=c("Actual","Predicted"))
> dtree.perf           
  • conditional inference tree
> library(party)
> fit.ctree<-ctree(class~.,data=sf.train)
> plot(fit.ctree,main="Conditional Inference Tree")
> ctree.pred<-predict(fit.ctree,df.validate,type="response")
> ctree.perf<-table(df.validate$class,ctree.pred
                            dnn=c("Actual","Predicted"))
> stree.perf           

random forest

> library(randomForest)
> set.seed(1234)
> fit.forest<-randomForest(class~.,data=df.train,na.action=na.roughfix,importance=TRUE)
> fit.forest
> importance(fit.forest,type=2)
> forest.pred<-predict(fit.forest,df.validate)
> forest.perf<-table(df.validate$class,forest.pred,
                              dnn=c("Actual","Predicted"))
> forest.perf           

support vector machines

  • svm
> library(e1071)
> set.seed(1234)
> fit.svm<-svm(class~.,data=df.train)
> fit.svm
> svm,pred<-predict(fit.svm,na.omit(df.validate))
> svm.perf<-table(na.omit(df.validate)$class,
                           svm,pred,dnn=c("Actual","Predicted"))
> svm.perf           
  • svm model with rbf core
> set.seed(1234)
> tuned<-tune.svm(class~.,data=df.train,gamma=10^(-6:1),cost=10^(-10:10))
> turned
> fit.svm<-svm(class~.,data=df.train,gamma=.01,cost=1)
> svm.pred<-predict(fit.svm,na.omit(df,validate))
> svm.perf<-table(na.omit(df.validate)$class,
                            svm.pred,dnn=c("Actual","Predicted"))
> svm.perf           

choose the best solution for forecasting

> performance<-function(table,n=2){
        if(!all(dim(table)==c(2,2)))
              stop("Must be a 2 × 2 table")
   tn=table[1,1]
   fp=table[1,2]
   fn=table[2,1]
   tp=table[2,2]
   sensitivity=tp/(tp+fn)
   specificity=tn/(tn+fp)
   ppp=tp/(tp+fp)
   npp=tn/(tn+fn)
   hitrate=(tp+tn)/(tp+tn+fp+fn)
   result<-paste("Sensitivity=",round(ppp,n),
           "\nSpecificity=",round(specificity,n),
           "\nPositive Predictive Value=",round(ppp,n),
           "\nNegative Predictive Value=",round(npp,n),
           "\nAccuracy=",round(hitrate,n),"\n",seq="")
   cat(result)
  }           

data mining with the rattle package

> install.packages("rattle")
> rattle()
> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"pima-indians-diabetes/pima-indians-diabetes.data"
> url <-paste(loc,ds,sep="")
> diabetes<-read.table(url,seq=",",header=FALSE)
> names(diabetes)<-c("npregant","plasma","bp","triceps",
                                  "insulin","bmi","pedigree","age","class")
> diabetes$class<-factor(diabetes$class,levels=c(0,1),
                                      labels=c("normal","diabetic"))
> library(rattle)
> rattle()
> cv<-matrix(c(145,50,8,27),nrow=2)
> performance(as.table(cv))           

[IV]Time Series

Package
ts() stats generate timing objects
graphics draw a line graph of the time series
start() return the start time of the time series
end() return the end time of the time series
frequency() return the number of time points of the time series
windows() subset a sequence object
ma() forecast fit a simple moving average model
stl() use LOESS smooth to decompose timing into seasonal terms,trend terms and random terms
monthplot() draw seasonal terms in time series
seasonplot() forecast generate seasonal plot
HoltWinters() ststs fit exponential smoothing model
forecast() predicting the future value of the timing
accuracy() return time-of-fit goodness variable
ets() fit exponential smoothing model and automatically select the optimal model
lag() return the timing after the specified hysteresis is taken
Acf() estimated autocorrelation function
Pacf() estimated partial autocorrelation function
diff() base return the sequence after the lag term or difference
ndiffs() find the optimal difference count to remove the trend terms in the sequence
adf.test() tseries perform an ADF test on the sequence to determine if it is stable
arima fit the ARIMA model
Box.test() perform a Ljung-Box test to determine if the model's residuals are independent
bds.test() teeries

perform a BDS test to determine whether random variables in the sequence

are subject to independent and identical distribution

auto.arima automatically select the ARIMA model

END!

繼續閱讀