[I]Regression
OLSregression
Description | Function |
---|---|
simple linear regression | lm(Y~X1,data) |
polynomial regression | lm(Y~X1+...+I(X^2),data) |
multiple linear regression | lm(Y~X1+X2+...+Xk) |
multiple linear regression with interaction terms | lm(Y~X1+X2+X1:X2) |
selection the optimal regression model
anova(fit1,fit2) | nested model fit |
AIC(fit1,fit2) | statistical fit |
stepAIC(fit,direction=) | stepwise method:"forward""backward" |
regsubsets() | all-subsets regression |
test
plot() | plot |
qqplot() | quantile comparison |
durbinWatsonTest() | Durbin-Waston |
crPlots() | component and residual |
ncvTest() | score test of non-constant error variance |
spreadLevelPlot() | dispersion level |
outlierTest() | Bonferroni outlier |
avPlots() | added variable |
inluencePlot() | regression |
scatterplot() | enhanced scatter plot |
scatterplotMatrix() | enhanced scatter plot matrix |
vif() | variance expansion factor |
abnormal
- abnormal observation
1.outlier:outlierTest()
2.high leverage point:hat.plot()
3.strong influence point:Cook's D
- improvement measures
1.delete observation point
2.variable transformation
3.addition and deletion of variables
Generalized Linear
generalized linear
-
glm()
Distribution Family | Default Connection Function
binomial | (link = "logit") |
gaussian | (link = "identity") |
gamma | (link = "inverse") |
inverse.gaussian | (link = "1/mu^2") |
poisson | (link = "log") |
quasi | (link = "identity",variance = "constant") |
quasibinomial | |
quasipoisson |
-
function with glm()
Function | Description
summary() | show details of the fitting model |
coefficients()/coef() | list the parameters of the fitting model |
confint() | give a confidence interval for the model paramenters |
residuals() | list residual value of the fitting model |
anova() | generate an analysis table of variance for two fitting model |
generate a diagnostic map of the evaluation fitting model | |
predict() | forecast new datasets with a fitting model |
deviance() | the deviation of the fitting model |
df.residual() | the residual freedom of the fitting model |
logistic
> data(Affairs,package="ARE")
> summary(Affairs)
> table(Affair$affairs)
> Affairs$ynaffair[Affairs$affairs > 0] <-1
> Affairs$ynaffair[Affairs$affairs == 0] <-0
> Affairs$naffair <- factor(Affairs$ynaffair),levels=c(0,1),labels=c("No","Yes"))
> table(Affairs$ynaffair)
> fit.full <-glm(ynaffair~gender + age + yearmarried +children +
religiousness + education + occupation + rating,family = binomial()
data = Affairs)
> fit.reduced <-glm(ynaffair ~ age + yearmarried + religiousness +rating,
family = binomial(),data = Affairs)
> anova(fit.reduced,fit.full,test="Chisq")
> coef(fit.reduced)
> exp(coef(fit.reduced))
> testdata<-data.frame(rating=c(1,2,3,4,5),age=mean(Affair$age),
yearsmarried=mean(Affairs$yearsmsrraied),
religiousness=mean(Affairs$religiousness))
> teatdata
> testdata$prob <- predict(fit.reduced,newdata=testdata,type="response")
> testdata <- data.frame(rating = mean(Affair$rating),
age=seq(17,57,10),
yearsmarried=mean(Affair$yearsmarried),
religiousness=mean(Affairs$religiousness))
> testdata
> testdata$prob<-predict(fit.reduced,newdata=testdata,type="response")
> testdata
> deviance(fit.reduced)/df.residual(fit.reduced)
> fit <- glm(ynaffair ~ age + yearmarried + religiousness + rating,
family = quasibinomial(),data=Affairs)
> pchisq(summary(fit.od)$dispersion * fit$df.residual,
fit$df.residual,lower=F)
> data(breslow.dat,package="robust")
> name(breslow.dat)
> summary(breslow.dat[c(6,7,8,10)])
> opar<-par(no.readonly=TRUE)
> par(mfrow=c(1,2))
> attach(breakslow.dat)
> hist(sumY,breaks=20,xlab="Seizure Count",main="Distribution of Seizures")
> boxplot(sumY~Trt,xlab="Treatment",main="Group Comparisons")
> par(opar)
> fit<-glm(sumY~Base + Age +Trt,data=breslow.dat,family=poisson())
> summary(fit)
> coef(fit)
> exp(coef(fit))
> deviance(fit)/df.residual(fit)
> library(qcc)
> qcc.overdispersion.test(breslow.dat$sumY,type="poisson")
> fit.od<-glm(sumY~Base + Age + Trt,data=breslow.dat,
family=quassipossion())
> summary(fit.od)
> fit glm(sumY~Base + Age +Trt,data=breslow.dat,offset=log(time),family=poisson)
[II]Cluster
1.cluster analysis step
1.choose the right variable
2.scale data
3.looking for anomalies
4.calculated distance
5.selection clustering algorithm
6.obtain one or more clustering methods
7.determine the number of classes
8.get the ultimate clustering solution
9.visualization of results
10.interpretation class
11.validation results
2.calculated distance
> data(nurtrient,package="flexclust")
> head(nutrient,4)
> d<-dist(nutrient)
> as.martrix(d)[1:4,1:4]
3.hierarchical clustering analysis
> data(nutrient,package="flexclust")
> row.name(nutrient) <-tolower(row.names(nutrient))
> nutrient.scaled<-scale(nutrient)
> d<-dist(nutrient.scaled)
> fit.average <-hclust(d,method="average")
> plot(fit.average,hang=-1,cex=.8,main="Average Linkage Clustering")
> library(Nbcluster)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(nutrient,scaled,distance="euclidean",
min.nc=2,max.nc=15,method="average")
> table(nc$Best.n[1,])
> barplot(table(nc$Best,n[1,]),
xlab="Number of Clusters",ylab="Number of Criteria",
main="Number of Cluster Chosen by 26 Criteria")
> clusters<-cutree(fit.average,k=5)
> table(clusters)
> aggregate(nutrient,by=list(cluster=clusters),median)
> aggregate(as.data.frame(nutrient.scaled),bu=list(cluster=clusters),median)
> plot(fit.average,hang=-1,cex=.8,
main="Average Linkage Cluster\n5 Cluster Solution")
> rect.hclust(fit.average,k=5)
4.Clustering analysis
- K-means clustering
> wssplot<-function(data,nc=15,seed=1234)(
> head(wine)
> df<-scale(wine[-1])
> wssplot(df)
> library(NcClust)
> set.seed(1234)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmeans")
> table(nc$Best.n[1,)
> barplot(table(nc$Best,n[1,]),
xlab="Number of Clusters",ylab="Number of Criteria",
main="Number of Clusters Chosen by 26 Criteria")
> set.seed(1234)
> fit.km<-kmeans(df,3,nstart=25)
> fit.km$size
> fit.km$centers
> aggregate(wine[-1],by=list(cluster=fit.km$cluster),mean)
- Division around the center point
> library(cluster)
> set.seed(1234)
> fit.pam<-pam(wine[-1],k=3,stand=TRUE)
> fit.pam$medoids
> clusplot(fit.pam,main="Bivariate Cluster Plot")
> ct.pam<-table(wine$Type,fit.pam$clustering)
> randIndex(ct.pam)
5.avoid non-existing classes
> library(fMultivar)
> set.seed(1234)
> df<-rnom2d(1000,rho=.5)
> df<-as.data.frame(df)
> plot(df,main="Bivariate Normal Distribution with rho=0.5")
> wssplot(df)
> library(NbClust)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmean")
> dev.new()
> barplot(table(nc$Best,n[1,]),
xlab="Number of Clusters",ylab="Number of Criteria",
main="Number of Clusters Chosen by 26 Criteria")
> library(ggplot2)
> library(cluster)
> fit<-pam(df,k=2)
> df$clustering<-factor(fit$clustering)
> ggplot(data=df,aes(x=V1,y=V2,color=clustering,shape=clustering)) +
geom_point() + ggtitle("Clustering of Bivariate Normal Data")
> plot(nc$All,index[,4],type="o",ylab="CCC",xlab="Number of clusters",col="blue")
[III]Classification
data preparation
> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"breast-cancer-wisconsin/breast-cancer-wisconsin.data"
> url<-paste(loc,ds,sep="")
> breast<-read.table(url,sep=",",header=FALSE,na.string="?")
> names(breast)<-c("ID","clumpThickness","sizeUniformity",
"shapeUniformity","maginalAdhesion",
"singleEpitheliaCellSize","bareNuclei",
"blandChromatin","normalNucleoli","mitosis","class")
> df<-breast[-1]
> df$class<-factor(df$class,levels=c(2,4),
labels=c("benign","malignant"))
> set.seed(1234)
> train<-sample(nrow(df),0.7*nrow(df))
> df.train<-df[train,]
> df.validate<-df[-train,]
> table(df.train$class)
> table(df.validate$class)
logistic regression
> fit.logit<-glm(class~.,data=df.train,family=binomial())
> summary(fit.logit)
> prob<-predict(fit.logit,df.validate,type="response")
> logit.perd<-factor(prob>.5,levels=c(FALSE,TRUE),
labels=c("benign","malignant"))
> logit.perf<-table(df.validate$class,logit.pred,
dnn=c("Actual","Predicted"))
> logit.perf
decision tree
- classic decision tree
> library(repart)
> set.seed(1234)
> dtree<-repart(class~.,data=df.train,method="class",
parms=list(split="information"))
> dtree$cptable
> plotcp(dtree)
> dtree.pruned<-prune(dtree,cp=.0125)
> library(repart.plot)
> prp(dtree.pruned,type=2,extra=104,
fallen.leaves=TRUE,main="Decesion Tree")
> dtree.pred<-predict(dtree.pruned,df.validate,type="class")
> dtree.perf<-table(df.validate$class,dtree.pred,
dnn=c("Actual","Predicted"))
> dtree.perf
- conditional inference tree
> library(party)
> fit.ctree<-ctree(class~.,data=sf.train)
> plot(fit.ctree,main="Conditional Inference Tree")
> ctree.pred<-predict(fit.ctree,df.validate,type="response")
> ctree.perf<-table(df.validate$class,ctree.pred
dnn=c("Actual","Predicted"))
> stree.perf
random forest
> library(randomForest)
> set.seed(1234)
> fit.forest<-randomForest(class~.,data=df.train,na.action=na.roughfix,importance=TRUE)
> fit.forest
> importance(fit.forest,type=2)
> forest.pred<-predict(fit.forest,df.validate)
> forest.perf<-table(df.validate$class,forest.pred,
dnn=c("Actual","Predicted"))
> forest.perf
support vector machines
- svm
> library(e1071)
> set.seed(1234)
> fit.svm<-svm(class~.,data=df.train)
> fit.svm
> svm,pred<-predict(fit.svm,na.omit(df.validate))
> svm.perf<-table(na.omit(df.validate)$class,
svm,pred,dnn=c("Actual","Predicted"))
> svm.perf
- svm model with rbf core
> set.seed(1234)
> tuned<-tune.svm(class~.,data=df.train,gamma=10^(-6:1),cost=10^(-10:10))
> turned
> fit.svm<-svm(class~.,data=df.train,gamma=.01,cost=1)
> svm.pred<-predict(fit.svm,na.omit(df,validate))
> svm.perf<-table(na.omit(df.validate)$class,
svm.pred,dnn=c("Actual","Predicted"))
> svm.perf
choose the best solution for forecasting
> performance<-function(table,n=2){
if(!all(dim(table)==c(2,2)))
stop("Must be a 2 × 2 table")
tn=table[1,1]
fp=table[1,2]
fn=table[2,1]
tp=table[2,2]
sensitivity=tp/(tp+fn)
specificity=tn/(tn+fp)
ppp=tp/(tp+fp)
npp=tn/(tn+fn)
hitrate=(tp+tn)/(tp+tn+fp+fn)
result<-paste("Sensitivity=",round(ppp,n),
"\nSpecificity=",round(specificity,n),
"\nPositive Predictive Value=",round(ppp,n),
"\nNegative Predictive Value=",round(npp,n),
"\nAccuracy=",round(hitrate,n),"\n",seq="")
cat(result)
}
data mining with the rattle package
> install.packages("rattle")
> rattle()
> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"pima-indians-diabetes/pima-indians-diabetes.data"
> url <-paste(loc,ds,sep="")
> diabetes<-read.table(url,seq=",",header=FALSE)
> names(diabetes)<-c("npregant","plasma","bp","triceps",
"insulin","bmi","pedigree","age","class")
> diabetes$class<-factor(diabetes$class,levels=c(0,1),
labels=c("normal","diabetic"))
> library(rattle)
> rattle()
> cv<-matrix(c(145,50,8,27),nrow=2)
> performance(as.table(cv))
[IV]Time Series
Package | ||
---|---|---|
ts() | stats | generate timing objects |
graphics | draw a line graph of the time series | |
start() | return the start time of the time series | |
end() | return the end time of the time series | |
frequency() | return the number of time points of the time series | |
windows() | subset a sequence object | |
ma() | forecast | fit a simple moving average model |
stl() | use LOESS smooth to decompose timing into seasonal terms,trend terms and random terms | |
monthplot() | draw seasonal terms in time series | |
seasonplot() forecast | generate seasonal plot | |
HoltWinters() | ststs | fit exponential smoothing model |
forecast() | predicting the future value of the timing | |
accuracy() | return time-of-fit goodness variable | |
ets() | fit exponential smoothing model and automatically select the optimal model | |
lag() | return the timing after the specified hysteresis is taken | |
Acf() | estimated autocorrelation function | |
Pacf() | estimated partial autocorrelation function | |
diff() | base | return the sequence after the lag term or difference |
ndiffs() | find the optimal difference count to remove the trend terms in the sequence | |
adf.test() | tseries | perform an ADF test on the sequence to determine if it is stable |
arima | fit the ARIMA model | |
Box.test() | perform a Ljung-Box test to determine if the model's residuals are independent | |
bds.test() | teeries | perform a BDS test to determine whether random variables in the sequence are subject to independent and identical distribution |
auto.arima | automatically select the ARIMA model |
END!