This is the note of book “Applied multivariate methods for data analysts”1.
Outlines:
- Data: Response Variable v.s. Experimental Units;
- Type of Methods: Variable-directed techinique v.s. Individual-directed techniques;
- Notations and definitions;
- Estimator Statistics;
- About multivariate computing: outliers, missing values, standardization;
1. Data
Multivariate data is common but complex, thus is very important and the major goal is simplify it.
Two aspects of data:
- Response variables;
- The experimental units;
Methods focus on the relationship among the response variables, the relationship among the experimental units and the relationship between the response variables and the experimental units;
2. Type of Methods
2.1 Variable-directed techinique
- PCA (Principle component analysis)
- FA (Factor analysis)
- regression (Logistic regression)
- CCA (Canonical correlation analysis)
These methods mainly operates on the correlation matrix and focus on the column of data matrix: the response variable.
2.2 Individual-directed techniques
- DA (Discriminant analysis)
- CA (Cluster analysis)
- MANOVA (Multivariate analysis of variance)
Remarks:
- These methods focus on the row of data matrix: the observations or experimental units;
- Many MA methods require the Independence of experimental units;
3. Notations and definitions
Notation | Explanation |
---|---|
p | variables |
N | sample size |
X=(xrj)N×p,r=1,...,N;j=1,...,p | data matrix |
xr=(xr1,...,xrp)′ | the r ’th observation |
r,s,t | subscript for experimental units |
i,j,k | subscript for response variables |
3.1 Multivariate normal distribution
(Def): x=(x1,...,xp)′ follows a multivariate normal distribution if ∀a ,
a′x=∑i=1paixi
follows a univariate normal distribution.
Denote it by: x∼Np(μ,Σ) , the p.d.f is :
fx(x,μ,Σ)=1(2π)p−−−−−√|Σ|1/2exp{−12[(x−μ)′Σ−1(x−μ)]};
3.2 Mathmatical numbers:
- Mean vector: μ=E(x)=(E(x1),...,E(xp))′=(μ1,...,μp)′;
- Variance-covariance matrix: Σ=cov(X=E[(x−μ)(x−μ)′])=(σij)p×p=⎛⎝⎜⎜σ11⋮σp1σ12⋮σp2…⋱…σ1p⋮σpp⎞⎠⎟⎟
- σii=E[(xi−μi)2] ;
- σij=cov(xi,xj) ;
-
Correlation matrix;
P=⎛⎝⎜⎜⎜⎜⎜⎜1ρ21⋮ρp1ρ121⋮ρp2……⋱…ρ1pρ2p⋮1⎞⎠⎟⎟⎟⎟⎟⎟
4. Estimator Statistics;
4.1. Unbiased estimators:
μ^=1N(∑r=1Nxr)=rowMeans(X);
Σ^=1N−1[∑r=1N(xr−μ^)(xr−μ^)′]=(σ^ij)p×p;
σ^ij=1N−1[∑r=1N(xri−μ^i)(xrj−μ^j)′)];
4.2. Biased but commonly use estimators:
rij=ρ^ij=σ^ijσ^iiσ^jj−−−−−√;
R=P^=⎛⎝⎜⎜⎜⎜⎜1r21⋮rp1r121⋮rp2……⋱…r1pr2p⋮1⎞⎠⎟⎟⎟⎟⎟;
5. About multivariate computing: outliers, missing values, standardization;
5.1 outliers
Detect it by plot or PCA;
Dealing with it:: analyze the impact of outliers on the results (with outliers v.s. without outliers);
5.2 missing values
use row means (or KNN) to replace it;
remove the corresponding row;
5.3 standardization
zrj=xrj−μ^jσ^jj−−−√;r=1,..,N;j=1,..,p.
Standardization is the default operation in the computer programs.
References
- Johnson D E, 約翰遜. Applied multivariate methods for data analysts[M]. Pacific Grove, CA: Duxbury Press, 1998. ↩