laitimes

Chen Yunsong | Macro Quantitative Sociology: Humanities and Social Sciences Applications of Big Data

author:Build the Tower of Babel again

This article is transferred from | Southern governance

Chen Yunsong | Macro Quantitative Sociology: Humanities and Social Sciences Applications of Big Data

Macro Quantitative Sociology:

Humanities and Social Sciences Applications of Big Data

——Professor Chen Yunsong's keynote speech at the "High-end Forum on the Integration and Innovation of Computing, Humanities and Social Sciences" and the inaugural meeting of the "Computational Social Science Research Center" held by the University of Chinese Academy of Social Sciences

Character introduction

Chen Yunsong | Macro Quantitative Sociology: Humanities and Social Sciences Applications of Big Data

Professor Chen Yunsong, Ph.D. in Sociology from the University of Oxford, Director of the Center for Chinese and American Studies at Nanjing University-Johns Hopkins University, Professor and Doctoral Supervisor of the Department of Sociology, "Young Changjiang Scholar" of the Ministry of Education, Chief Expert of Major Projects of the National Social Science Foundation, Executive Director of the International Chinese Sociological Association, and Editorial Board Member of Social Science Research and other journals. His main research areas are computational sociology, big data, social networks, and social governance. He has published many papers in important journals at home and abroad, such as British Journal of Sociology, Social Networks, Poetics, Chinese Social Sciences, and Sociological Research, as well as media such as People's Daily and Guangming Daily. He has won the first "Young Scholar Award" of the China Urban 100 Forum and the first prize of Jiangsu Province Outstanding Achievement in Philosophy and Social Sciences.

The overall points of Professor Chan's presentation were:

From the field of traditional quantitative sociological analysis to the field of social computing, which directly presents and dissects complex data and complex phenomena, there is a very important, most needy, and most promising new research area in the middle zone: from big data or by using machine learning, social network analysis and other means to construct macro sociological research indicators that are difficult to obtain in previous social surveys, and then incorporate classical econometric models for causal inference. This hybrid approach to computational sociology helps to truly form a theory-data binary drive.

The title of my keynote speech is "Macro Quantitative Sociology: How to Apply Big Data to the Humanities and Social Sciences". The entry point I choose will be very small, it is specific, we have quantitative, qualitative, and theoretical development of sociology, and the three carriages are driven together, how can we let big data make the method of social computing help the development of sociology? How do I feel? The advent of big data, which provides a very important function, is that it provides some unmeasurable metrics for traditional research.

Why am I focusing on such a feature? Because big data and social computing can provide many paradigm breakthroughs for quantitative analysis of sociology. For example, the structure of writing an article, such as how to make predictions with machine learning? Our team is also doing it, we are predicting the sexual orientation of college students in the capital area of Beijing, for example, using machine learning to do it, because you may not get the right data by doing the questionnaire directly. When a method like this is applied, it will affect very few people, and many people are not used to such articles, cannot understand such articles, and are unwilling to read them. We hope to build a transitional zone and a field between traditional quantitative analysis and big data and social computing methods.

This field is exactly what contemporary sociology needs the most, and I call it macro-quantitative sociology, and I will make a specific report below.

1. Levels of different variables in quantitative sociological research

As shown in the image above.

Sociology quantitatively studies the hierarchy of different variables, which is well understood, explanatory variables and explanatory variables. When both X and Y are at the micro level, the method of doing sociological quantitative analysis is simple. How? For example, when I ask the question now, does a person's income affect his happiness? My method is very simple, I go to a questionnaire, I ask a sample of 3,000, 5,000, 10,000 people, and then I do a regression analysis on them, which is the standard sociological quantitative analysis method. Questionnaire methodology, regression analysis.

Then the other type of problem, when the explanatory variable Y is at the micro individual level, and the explanatory variable is at the macro group level, it can also be done with traditional quantitative methods. We will analyze urban income inequality, such as whether the Gini coefficient of each city affects individual happiness. The same questionnaire is used, but the model goes from a single layer to a multi-layer. Questionnaire analysis, multi-level regression.

The third category is that Y is at the macro group level, while X is at the micro individual level. For example, when we want to understand how personal income affects the overall well-being of cities, traditional questionnaires and regressions cannot be done, and they cannot be returned. Because this is a process that reflects the transition of social phenomena from the individual to the group to the macro level. How? The scholars who push us to do computational science know that we should use the method of simulation, and use the method of multi-agent simulation (Agent-Based Simulation) modeling to do it.

Also, when X and Y are at the macro group level, how do we do quantitative research in the social sciences? For example, I now want to be concerned about the impact of income inequality at the city level on the overall average happiness of cities. Sociology has paid very little attention to issues like this, especially compared to economics, where there is a lot of macro analysis at the municipal and provincial levels, but very little sociology.

Why is there less macro quantitative social analysis? As shown in the figure above, I summarize in three aspects: First, there are not many macro social indicators, what does it mean? For example, economic statistics departments at all levels, statistical departments at the provincial, municipal, and central statistical bureaus mostly count: economic indicators, very few statistics are you happy? Do you trust others? What is the level of trust? Indicators of sociological concern are scarce, and we lack such data at the macro level, at the county, city, provincial, national and social levels. Second, the sample size of such analyses is relatively limited. There are more than 30 provincial-level administrative regions and more than 300 municipal-level regions in the country, and N is relatively small, which may be done with the help of time series or panel data. The third analysis is the ecological fallacy of macro analysis. The two variables X and Y are positively correlated at the individual level, and X is positively correlated with Y, which may be uncorrelated at the municipal, provincial, and national levels. The ecological fallacy is an issue that deserves attention in macro research. As shown in the figure below:

Early sociologists attached great importance to macro research. It was precisely because of the possibility of ecological fallacies, coupled with the maturity of household questionnaire survey technology at that time, that by the middle of the 20th century, early scholars of quantitative sociology had shifted from studying the macro-state, county, and provincial levels to studying the individual level. As a result, the people we do quantitative sociology are different from economics, and the overall routine and model of quantitative sociology are analyzed by individual samples. A study is a sample of 10,000, 5,000, and 500,000 people, and the analysis at the individual level shows that X and Y are basically at the individual level, and there is no macro-level data. Leads to three disadvantages, the first of which is not conducive to the development of grand theories. Because there is no data on a large temporal and spatial scale, it is difficult to verify the grand theory empirically, which leads to criticism of quantitative research, saying that you are addicted to the technical, very narrow individual level and lack of large theoretical adaptation. As shown below:

Second, there are also some problems with the inference of causal logic. More importantly, I think it is not conducive to understanding the social transition, which Coleman first proposed, from individual phenomena to group phenomena, and then to group phenomena affecting individual phenomena. What is this intermediate process? This is an area that is well worth studying. For example, if X and Y are positively correlated at the individual level, but negatively correlated at the group level, why? Sociologists have to do research, but traditional data collection methods and questionnaires are difficult to provide such social indicators.

Two

Macro quantitative social analysis is restarted in large numbers

So I'm now proposing that the advent of computational methods, especially big data, could restart macro quantitative social analysis. As shown below:

What is the value of restarting macro quantitative social analysis? It can not only provide macro data that cannot be obtained in previous questionnaire surveys, such as the ideological map of ordinary people in Chinese society over the past 100 years. Over the past 200 years, social trust in American society and so on, such large indicators could not be measured in the past. The more important thing is that I just mentioned that it can form a transitional stage of disciplines. Extract such "important indicators" from big data, extract them with computational social methods, and then package and revise the indicators to become traditional econometric models, such as OLS models, time series analysis models, and panel data models, and use these traditional econometric models to carry out regression analysis indicators, so that traditional sociological quantitative analysis and the complete use of computational social methods such as machine learning, social network analysis or multi-agent simulation modeling, etc., form a transitional field between the two. I think this transitional field is of great significance to the development of contemporary sociology, especially quantitative sociology.

Three

Large-scale restart of macro quantitative social analysis example 1: time series analysis

Here are a few simple examples, why do you say that?

In particular, Mr. Luo Jiaojiao said just now that as a senior scholar, he also saw that from the domestic publications, Chinese sociologists are still in the description stage of the use of big data, and relatively few directly use big data for analysis. Our team uses the macro quantitative sociological methods I just talked about to extract indicators that can be analyzed by traditional econometric models from big data, and then conduct a meaningful, theoretically valuable, and theoretically ordered sociological analysis. It is mainly published in some English-based journals, which also brings the contemporary research of big data and computational sociology to the world for our Chinese scholars.

Let me give you a few examples: the first one, which we published in Social Science Research, is a study of the class consciousness of ordinary people in the United States over the past 100 years. As shown in the figure:

The starting point for this study is simple, because the year before last was the 200th anniversary of Marx's birth. What was the object of Marx's observation of the theory of class consciousness at that time? It was Britain and Germany in the 19th century. But can such a grand theory also explain the developed United States in the 20th century? Can it be explained even in the 21st century?

For example, if we want to analyze whether the class consciousness of Americans in the past 100 years in the United States, for example, from 1900 to 2000, is related to the Gini coefficient and income inequality in the United States as a whole in the past 100 years. It is easier to get data on social inequality in the United States, such as the Gini data of American society in the past 100 years.

But for 100 years, it has been difficult to do social surveys of the class consciousness of American society, because many people have died, and it is impossible for you to go to the United States to do surveys now, and you can't get those people in 1920 and 1930 and analyze them. What are we going to do? For example, we use a database of cultural big data in Google Ngram Viewer, as shown in the figure above, we extract a large number of words about "class" and "class", as shown in the following table:

Chen Yunsong | Macro Quantitative Sociology: Humanities and Social Sciences Applications of Big Data

Use them in the book, and the frequency of these words in the book, to represent the attention of the American public to the phenomenon of class. This is shown in the figure below. Why is this possible? Because books are an important carrier that carries almost all the knowledge and ideas of human beings.

So let's analyze it like this (as shown in the two figures below).

Then we use statistical methods to compress such an index into a class concern of American society over the past 100 years, for example, as you can see in the figure below this red line, and then we analyze it.

Similarly, we have a well-known Werther effect in suicide, where celebrities commit suicide and you go to imitate suicide. As shown below:

Taking American society as an example, in the past 100 years, is the suicide circulating in American social books related to its real suicide? We take the same big data approach. We extract people's macro social consciousness from the big data of books for 100 years and 50 years, which is an indicator that cannot be obtained by traditional questionnaires, and then put it into the traditional standard econometric model, such as the time series model for analysis (Figure 3 below), which is the first aspect I talked about.

Four

Large-scale restart of macro quantitative social analysis example 2: panel data analysis

The second is the analysis of panel data.

With what I just mentioned, for example, time series can be extended from the national level to the state level and the provincial level.

For example, we sociologists have done research in the fields of economics and finance, and we will study what is related to global investment in various provinces of China. Economists do a lot of research, and the explanatory variables they study are economic indicators, such as industrial agglomeration, labor costs, education levels, etc., but what do we care about? We believe that in the same situation of the two, because investment is a risky behavior, the degree to which a region, a city, or a province's international popularity is mentioned is related to investment. So we also used similar data to build a 20-year panel data model for each province in China. Our method is to extract the internationally well-known indicators of each province in China from the massive big data, and then use these indicators to use the panel models that our traditional econometrics and quantitative sociologists are familiar with, such as dynamic panel models, two-state fixed models, etc., so as to analyze the impact of cultural factors on economic behavior. The second aspect is the use of panel data. This is shown in the four figures below.

What else can big data offer?

Five

Large-scale restart of macro quantitative social analysis example 3: network structure data

There is also one that can be provided to us for network data. I'll also give you an example, we know that there is a flow of people and logistics between cities and between regions. As shown below:

In the information society, we are concerned about the flow of information between regions. What phenomena do we care about? For example, I would like to ask the teachers and experts here to think about two provinces, Shanghai and Anhui. Do Shanghainese search for "Anhui" or Anhui people search for "Shanghai"? If we think about it, it is likely that Anhui searches for "Shanghai" more. Because it may be that Anhui searches for "Shanghai", in addition to traveling to Shanghai, it may also involve employment, going to university, and so on. Because Anhui people already account for more than 1/3 of Shanghai's floating population. I can use Shanghai to search for "Anhui" and Anhui to search for "Shanghai" in the information space, and multiply them to build the cultural attraction between their provinces and the attractiveness of the information flow space. Of course, the indicator of "attraction", like gravity, is not interesting enough. What I am concerned about is a gap between Shanghai and Anhui in mutual search, for example, if you search for "Shanghai" in Anhui, divide it by "Anhui", this data must be greater than 1, but what does this data represent? It means that in the Internet information flow space, Shanghai people are culturally narcissistic, involuted, etc., or he doesn't care much about Anhui, but Anhui cares about Shanghai very much. In this way, we can put forward corresponding theories, concepts, and new theories to study social and cultural phenomena.

We did an interesting analysis, and we saw that this graph put together the interactions between each province, the interactions retrieved on the Internet, and the interactions retrieved by each other on Baidu.

As shown in the figure above, which line is thicker is the stronger the interaction between them. It turns out that which two provincial administrative units interact the most? Beijing and Hebei.

In the picture above, you can see the line of the mouse movement, so you can imagine why this is the case, the attraction between Beijing and Hebei is the strongest, you care about me, and I care about you.

I just talked about the most important cultural involution and penetration, where is Shanghai's penetration of Anhui reflected? We also did an analysis, and we also analyzed the gap and distance between each province and each other. As shown in the figure below:

As shown in the picture above, what do we find? Which two provinces did we find the largest inter-provincial search gap? It's Beijing and Tianjin. This means that Beijingers may not pay much attention to Tianjin on the Internet, and do not search for Tianjin too much. However, Tianjin people are highly concerned about Beijing, and they have searched for Beijing in large quantities.

Then we build such indicators for each province, and then we conduct socio-economic analysis, and then establish a traditional econometric model from the perspective of mechanism, and we analyze whether the involution, penetration, and attractiveness are related to the per capita income of the province, the disposable income of urban residents, or the per capita GDP, or the average education level, and so on. This is shown in the figure below.

In this way, we still use the same method to extract and construct sociologically significant indicators from the big data of Internet searches, and then return to the traditional econometric model to do it, which is also in the middle of the traditional quantitative sociological analysis and the computational sociology that we use in a new paradigm in the full sense, to build such a transitional field. Such a field is called macro-quantitative sociological analysis, which is not only a supplement to the traditional standard quantitative sociological analysis, but also an important field and an important stage in the development of computational sociology.

I think I'll report here in 15 minutes today. Thank you!

(This article is based on the recording of Professor Chen Yunsong's keynote speech at the forum)

Read on