天天看点

懿说学区(35)SPSS统计分析(45)二阶聚类分析

作者:LearningYard学苑

Yishuo School District (35) | SPSS Statistical Analysis (45) Second-order Cluster Analysis

懿说学区(35)SPSS统计分析(45)二阶聚类分析

“分享兴趣,传播快乐,增长见闻,留下美好! 大家好,这里是小编。欢迎大家继续访问学苑内容,我们将竭诚为您带来更多更好的内容分享。

"Share interest, spread happiness, increase knowledge, and leave a good impression! Hello everyone, this is Xiaobian. Welcome to continue to visit the content of Xueyuan, and we will wholeheartedly bring you more and better content to share.

懿说学区(35)SPSS统计分析(45)二阶聚类分析

上一期,我们讲到了聚类和判别分析的基础概念,这一期,我们具体来讲聚类分析中的二阶聚类。首先,二阶聚类时一个探索性的分析工具,为揭示自然的分类或者是分组而设计,是针对数据集内部而不是外观的分类,是一种新型的分层聚类算法。

二阶聚类过程除了使用传统的欧氏距离之外,为了处理分类变量和连续变量,它使用似然距离测度,并要求模型中的变量是独立的。分类变量呈多项式分布,连续变量呈正态分布。

In the last issue, we talked about the basic concepts of clustering and discriminant analysis. In this issue, we will specifically talk about the second-order clustering in cluster analysis. First of all, second-order clustering is an exploratory analysis tool designed to reveal natural classification or grouping. It is aimed at the classification within the dataset rather than the appearance. It is a new hierarchical clustering algorithm.

In addition to using the traditional Euclidean distance, the second order clustering process uses the likelihood distance measure to deal with the classified variables and continuous variables, and requires that the variables in the model are independent. The classification variable is polynomial distribution, and the continuous variable is normal distribution.

使用两个变量的相关过程取检验两个连续变量之间的独立性,使用交叉表过程检验两个分类变量之间的独立性,使用均值比较过程检验连续变量和分类变量的独立性,用探索分析过程检验连续变量的正态性,使用卡方过程检验分类变量是否呈多项式分布。

懿说学区(35)SPSS统计分析(45)二阶聚类分析

Use the correlation process of two variables to test the independence between two continuous variables, use the cross table process to test the independence between two classification variables, use the mean comparison process to test the independence of continuous variables and classification variables, use the exploratory analysis process to test the normality of continuous variables, and use the chi-square process to test whether the classification variables are polynomial distribution.

二阶聚类分为两个步骤完成,第一步是构建聚类特征树,对每个观测变量都考察一遍,确定类中心。根据相近者为同一类的原则,计算距离并把距类中心距离最小的观测量分到相应类中,这个过程称为构建了一个分类的特征树(CF)。开始,它把一个观测量放在树的叶节点根部,该节点含有该观测量的变量信息。然后,使用距离测度作为相似性测度的判断依据,每个后续的观测量根据它已存在的节点的相似性归到某类去。如果相似则该观测量加在一个已存在的节点上,称为该节点的叶子。如果不相似,就形成一个新的节点。

懿说学区(35)SPSS统计分析(45)二阶聚类分析

Second-order clustering is completed in two steps. The first step is to build a clustering feature tree, and inspect each observation variable once to determine the cluster center. According to the principle that the similar ones belong to the same class, calculate the distance and divide the observation with the smallest distance from the center of the class into the corresponding classes. This process is called building a classified feature tree (CF). At first, it places an observation at the root of the leaf node of the tree, which contains the variable information of the observation. Then, the distance measure is used as the judgment basis for the similarity measure, and each subsequent observation is classified into a certain category according to the similarity of its existing nodes. If similar, the observation is added to an existing node, which is called the leaf of the node. If not, a new node will be formed.

第二步是对聚类特征树的节点进行分组,为了确定最好的类数,对每一个聚类结果使用Akaik判据(AIC)或贝叶斯判据(BIC)作为标准进行比较,得出最后的聚类结果。

The second step is to group the nodes of the clustering feature tree. In order to determine the best number of clusters, Akaik criterion (AIC) or Bayesian criterion (BIC) are used for comparison of each clustering result to obtain the final clustering result.

接下来,我们来看一个实例,某机构为了调查学生性别和所学专业与毕业后初始公司的情况。调查抽取了50名学生的数据,如下图所示(“学科”1:农业,2:建筑,3:地质,4:商务,5:林学,6:教育,7:工程,8:艺术)。试根据样本指标进行聚类分析。

懿说学区(35)SPSS统计分析(45)二阶聚类分析

Next, let's take a look at an example of an institution to investigate the gender of students, their majors and the initial company after graduation. The survey sampled data from 50 students, as shown in the figure below ("discipline" 1: agriculture, 2: architecture, 3: geology, 4: business, 5: forestry, 6: education, 7: engineering, 8: art). Try to conduct cluster analysis according to sample indicators.

第一步,分析并组织数据,由于自变量中不仅有连续属性,也有分类变量,故采用二阶聚类分析。按上图所示定义变量,输入数据并保存。

懿说学区(35)SPSS统计分析(45)二阶聚类分析
懿说学区(35)SPSS统计分析(45)二阶聚类分析

The first step is to analyze and organize the data. Because there are not only continuous attributes but also classified variables in the independent variables, the second-order cluster analysis is adopted. Define variables as shown in the figure above, input data and save.

第二步,进行二阶聚类分析的设置。按下图所示进行设置。

懿说学区(35)SPSS统计分析(45)二阶聚类分析
懿说学区(35)SPSS统计分析(45)二阶聚类分析
懿说学区(35)SPSS统计分析(45)二阶聚类分析
懿说学区(35)SPSS统计分析(45)二阶聚类分析

The second step is to set the second-order cluster analysis. Set as shown in the following figure.

第三步,主要结果和分析,结果将所有的个案聚成3类,聚类的平均轮廓值为0.6,说明聚类质量比较好。

懿说学区(35)SPSS统计分析(45)二阶聚类分析
懿说学区(35)SPSS统计分析(45)二阶聚类分析

The third step is the main results and analysis. The results gather all cases into three categories. The average contour value of clustering is 0.6, indicating that the quality of clustering is relatively good.

下期预告:本期,我们学习了

二阶聚类的理论知识和基础运用。

下一期,我们将会学习

K-均值聚类的理论和实例操作。

Preview of next issue: In this issue, we learned the theoretical knowledge and basic application of second-order clustering. In the next issue, we will learn the theory and example operation of K-means clustering.

今天的分享就到这里了

如果您对今天的文章有独特的想法

欢迎给我们留言

让我们相约明天

祝您今天过得开心快乐!

That's all for today's sharing. If you have unique ideas about today's article, please leave us a message. Let's meet tomorrow. I wish you a happy day today!

懿说学区(35)SPSS统计分析(45)二阶聚类分析

参考资料:百度百科,《SPSS 23 统计分析实用教程》

翻译:百度翻译

本文由learningyard新学苑原创,部分文字图片来源于他处,如有侵权,请联系删除。

继续阅读