天天看點

Multivariate Analysis ch1 (Overview)Outlines:1. Data2. Type of Methods3. Notations and definitions4. Estimator Statistics;5. About multivariate computing: outliers, missing values, standardization;

This is the note of book “Applied multivariate methods for data analysts”1.

Outlines:

  • Data: Response Variable v.s. Experimental Units;
  • Type of Methods: Variable-directed techinique v.s. Individual-directed techniques;
  • Notations and definitions;
  • Estimator Statistics;
  • About multivariate computing: outliers, missing values, standardization;

1. Data

Multivariate data is common but complex, thus is very important and the major goal is simplify it.

Two aspects of data:

  1. Response variables;
  2. The experimental units;

Methods focus on the relationship among the response variables, the relationship among the experimental units and the relationship between the response variables and the experimental units;

2. Type of Methods

2.1 Variable-directed techinique

  • PCA (Principle component analysis)
  • FA (Factor analysis)
  • regression (Logistic regression)
  • CCA (Canonical correlation analysis)

These methods mainly operates on the correlation matrix and focus on the column of data matrix: the response variable.

2.2 Individual-directed techniques

  • DA (Discriminant analysis)
  • CA (Cluster analysis)
  • MANOVA (Multivariate analysis of variance)

Remarks:

  1. These methods focus on the row of data matrix: the observations or experimental units;
  2. Many MA methods require the Independence of experimental units;

3. Notations and definitions

Notation Explanation
p variables
N sample size
X=(xrj)N×p,r=1,...,N;j=1,...,p data matrix
xr=(xr1,...,xrp)′ the r ’th observation
r,s,t subscript for experimental units
i,j,k subscript for response variables

3.1 Multivariate normal distribution

(Def): x=(x1,...,xp)′ follows a multivariate normal distribution if ∀a ,

a′x=∑i=1paixi

follows a univariate normal distribution.

Denote it by: x∼Np(μ,Σ) , the p.d.f is :

fx(x,μ,Σ)=1(2π)p−−−−−√|Σ|1/2exp{−12[(x−μ)′Σ−1(x−μ)]};

3.2 Mathmatical numbers:

  • Mean vector: μ=E(x)=(E(x1),...,E(xp))′=(μ1,...,μp)′;
  • Variance-covariance matrix: Σ=cov(X=E[(x−μ)(x−μ)′])=(σij)p×p=⎛⎝⎜⎜σ11⋮σp1σ12⋮σp2…⋱…σ1p⋮σpp⎞⎠⎟⎟
    • σii=E[(xi−μi)2] ;
    • σij=cov(xi,xj) ;
  • Correlation matrix;

    P=⎛⎝⎜⎜⎜⎜⎜⎜1ρ21⋮ρp1ρ121⋮ρp2……⋱…ρ1pρ2p⋮1⎞⎠⎟⎟⎟⎟⎟⎟

4. Estimator Statistics;

4.1. Unbiased estimators:

μ^=1N(∑r=1Nxr)=rowMeans(X);

Σ^=1N−1[∑r=1N(xr−μ^)(xr−μ^)′]=(σ^ij)p×p;

σ^ij=1N−1[∑r=1N(xri−μ^i)(xrj−μ^j)′)];

4.2. Biased but commonly use estimators:

rij=ρ^ij=σ^ijσ^iiσ^jj−−−−−√;

R=P^=⎛⎝⎜⎜⎜⎜⎜1r21⋮rp1r121⋮rp2……⋱…r1pr2p⋮1⎞⎠⎟⎟⎟⎟⎟;

5. About multivariate computing: outliers, missing values, standardization;

5.1 outliers

Detect it by plot or PCA;
Dealing with it:: analyze the impact of outliers on the results (with outliers v.s. without outliers);
           

5.2 missing values

use row means (or  KNN) to replace it;
remove the corresponding row;
           

5.3 standardization

zrj=xrj−μ^jσ^jj−−−√;r=1,..,N;j=1,..,p.

Standardization is the default operation in the computer programs.

References

  1. Johnson D E, 約翰遜. Applied multivariate methods for data analysts[M]. Pacific Grove, CA: Duxbury Press, 1998. ↩

繼續閱讀