laitimes

From one to infinity: A simulation of a sample of respondents by a large language model

author:Build the Tower of Babel again

Editor's Recommendations:

The innovation of this paper is to start from the principle of large language models, and prove that through appropriate training (conditioning), the bias of language models can be roughly eliminated to simulate the responses of specific populations more accurately. The authors define the accuracy of the simulation as "algorithmic fidelity" and set four criteria to test whether GPT-3 is accurate enough. This paper proves that GPT-3 can roughly accurately simulate the reactions of various populations in the context of U.S. elections. This paper demonstrates the potential of large language models in political science and social science research, such as low-cost and accurate simulation experiments or surveys of a specific object. It is important to note that the "algorithmic fidelity" of the model in this paper is limited to the U.S. election and U.S. public political opinions. In terms of research in other countries and regions, there may be a bias between model generation responses and human responses, and further research is needed.

From one to infinity: A simulation of a sample of respondents by a large language model

Simulation of a large language model on a sample of respondents

Summary:

The application of AI is sometimes limited by biases within models, such as racism, which are often seen as unifying attributes of models. This study shows that bias in GPT-3 language models is fine-grained and demographically correlated. This means that proper training can make it more accurate to simulate the response distribution of various human subpopulations. This article refers to this feature as "algorithmic fidelity" and explores its scope of application in GPT-3. The authors created "silicon-based samples" based on the socio-demographic background stories of thousands of real human participants, and then compared the silicon-based samples to human samples to demonstrate that far from superficial similarities between GPT-3's information and human information, they reflect the complex interplay between human thoughts, attitudes, and sociocultural contexts that characterize human attitudes toward things. Therefore, language models with algorithmic fidelity are a powerful tool that facilitates the understanding of humans and society across disciplines.

About the Author:

Lisa P. Argyle,杨百翰大学助理教授

Ethan C. Busby,杨百翰大学助理教授

Nancy Fulda,杨百翰大学DRAGN实验室主任

Joshua Gubler,杨百翰大学副教授

Christopher Rytting,杨百翰大学博士

David Wingate,杨百翰大学副教授

Compilation Source:

Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis 31(3), 337-351.

From one to infinity: A simulation of a sample of respondents by a large language model

本文作者之一 Nancy Fulda

I. Introduction

In recent years, machine learning tools have greatly advanced social science research. However, the potential of large language models like GPT-3 to enhance the understanding of human social and political behavior has been largely overlooked. Therefore, the authors believe that these models have the potential to partially replace the role of human respondents in various social science studies.

AI models tend to retain their creators' biases about race, gender, economics, and other aspects, while most of the discussion sees them as a single, macroscopic feature of the model. The authors argue that it is best understood as a reflection of the patterns of associations between human thoughts, attitudes, and contexts. The authors' research shows that the same language model, when properly trained, produces "bias" that can both favor and oppose the output of specific groups and opinions, and these outputs are very consistent with the response patterns of people with certain characteristics. By "conditioning" simulated "individuals" with target identities and personality traits (meaning conditioning, meaning that they feed a model to obtain a desired output, which can be understood as conditioning), it is possible to choose from a diverse and often disconnected response distribution in the model, each of which is closely related to a real human subpopulation. The authors refer to the degree to which the model accurately reflects these distributions as "algorithmic fidelity."

The "algorithmic fidelity" of a language model is critical for its application in the social sciences, as it allows researchers to gain insight into the different attitudes and perception patterns of many groups and combinations of those groups from a single language model. In three studies, GPT-3 was trained on the characteristics of respondents in multiple large surveys in the U.S., resulting in evidence that GPT-3 met the criteria for "algorithmic fidelity." These surveys include data from the American National Election Studies (ANES) and Rothschild et al.'s "Pigeonholing Partisans." By training the model, an AI-simulated "Silicon Subject" (referring to virtual respondents trained using a language model) was generated for each human study participant in the three sets of studies, and then these virtual respondents were asked to complete the same tasks as the human respondents. To assess the fidelity of the algorithm, the authors discuss the extent to which the complex patterns of relationships between thoughts, attitudes, and situations in "silicon-based people" reflect this relationship in human groups.

In Study 1, the authors asked GPT-3 "silicon-based people" to list words that describe bipartisan membership in the United States, and to demonstrate how closely these words are to the corresponding human list of words. In Studies 2 and 3, the authors explored the relationship between various demographics, attitudes, and reporting behaviors; The results show that the "silicon-based humans" generated by GPT-3 have similar patterns of interaction between thoughts, attitudes, and situations as humans. As a result, in the U.S. political arena, researchers can use GPT-3 to conduct "silicon sampling" (the use of language models trained to generate large numbers of virtual respondents for testing) to explore research hypotheses before using humans as the subject of study.

2. Principles of the GPT-3 model

Formally, language models like GPT-3 are tokens, which are the smallest semantic units used to represent data such as text or speech. For example, in natural language processing, a token can be a word, punctuation, or phrase ) conditional probability distribution p(x|x1,...,xn-1), where each xi comes from a fixed vocabulary. By iteratively sampling from this distribution, the language model can generate arbitrarily long text sequences. However, before text can be generated, a language model like GPT-3 needs to be trained, i.e., it must be provided with an initial input token consisting of {x1,...,xn-1}. The authors refer to this training text as the "context" of the model. For example, in the context {x1, x2, x3} = "Can you come", a language model might give x4 = "home" a high probability and x4 = "bananas" a low probability, but change a word in the context to {x1, x2, x3} = "Can you eat", and the opposite would happen. At each generation step, the model estimates a probability distribution that corresponds to the probability that any given mark in the vocabulary would be the next observed xi if the model was reading pre-written text. Using the distribution function, it selects one of the most likely candidates, adds a new XI to the training context, and repeats the process. This process continues until a prespecified number of tags are generated, or an external intervention stops the process.

3. Algorithm fidelity

The authors define algorithmic fidelity as the extent to which the complex patterns of relationships between thoughts, attitudes, and sociocultural contexts in the model accurately reflect the corresponding patterns of relationships in human subgroups. The central assumption of this concept is that the text generated by the model is not picked from a single overall probability distribution, but from a combination of multiple distributions. The authors believe that the high-level, human-like output of language models is based on the correlation of basic concepts in the model's thinking similar to that of humans. This means that, given basic human population context, the model reveals patterns of latent associations between concepts, opinions, and attitudes that are identical to those recorded by humans with matching backgrounds. Therefore, a language model must provide duplicate, consistent evidence that meets at least four of the following criteria to demonstrate algorithmic fidelity:

Standard 1 (Social Science Turing Test): Responses generated by the model are indistinguishable from parallel human text.

Criterion 2 (Backward Continuity): The responses generated by the model are commensurate with the attitudinal and socio-demographic information in their input/"training context", allowing the human viewing the responses to deduce these elements of the input.

Criterion 3 (Forward Continuity): Responses are generated naturally from the "training context" provided and reliably reflect the form, tone, and content of the context.

Criterion 4 (Pattern Correspondence): The responses generated reflect the underlying patterns of relationships between thoughts, demographics, and behaviors that can be observed in human-generated similar data.

The authors did not propose specific metrics or numerical thresholds to quantify these criteria, as appropriate statistics would depend on different data structures and disciplinary criteria. According to the authors, the best measure is to repeatedly meet the criteria for the fidelity of each algorithm across multiple data sources, different measures, and multiple populations.

Fourth, silicon sampling

The application of language models to social science research raises a problem: the demographics of Internet users on which the model is trained are neither representative of the majority of the relevant population nor demographically balanced, and language models are trained on snapshots of the Internet taken at a fixed point in time.

The authors propose a method called Silicon Sampling, which corrects the skewness of marginal statistics in language models. GPT-3 co-models voting pattern V and demographic information BGPT3 as P(V, BGPT3) = P(V|BGPT3)P(BGPT3). However, in most populations of interest to social scientists (e.g., among all citizens eligible to vote), the distribution of backstory P (BGPT3) does not match the distribution of P(BTrue); If not corrected, the distribution of P(BTrue) is different from that of P(BGPT3), and the conclusion about the marginal voting pattern P(V) = ∫B P(V|BGPT3) is skewed.

To overcome this problem, the authors took advantage of the conditional nature of the language model to draw backstories from known, nationally representative samples (e.g., ANES) and then estimate P(V) based on the backstories sampled by ANES. THIS MAKES IT POSSIBLE TO CALCULATE P(V|BANES)P(BANES). As long as the conditional distribution P(V|B) can be well modeled, the pattern of any given population can be studied. As the ability to sample from GPT-3's textual component distributions does not, by itself, guarantee that these distributions faithfully reflect the behavior of a particular human subgroup. To do this, the researcher must first check the algorithmic fidelity of the model in terms of the field of study and the relevant population groups.

5. Research

a) Study 1

The first test of the fidelity of the GPT-3 algorithm involved "Pigeonholing Partisans" data from Rothschild et al. The survey asked respondents to list four words to describe Republicans and Democrats. Rothschild et al. found that people talk about partisans in different ways, focusing on traits, political issues, social groups, or a combination of the three. In addition, people tend to be more active when talking about their own political party than about other parties. In this study, the authors used silicon sampling to ask GPT-3 if it could generate text about U.S. party affiliation that is indistinguishable from human-generated words. To this end, the authors constructed a first-person backstory for each human subject in the "arbitrarily categorized party" survey, generating a "silicon-based human", as shown in Figure 1. Using these texts, the authors asked GPT-3 to sample new words. Due to the context in which it was trained, GPT-3 almost always responds with a uniform set of four words, although like humans, it occasionally responds with long and short sentences, short passages, or no responses. After processing the generated content using regular expressions, the authors extract the final dataset from each sample.

From one to infinity: A simulation of a sample of respondents by a large language model

Figure 1: Examples of context and completion of four silicon-based humans in Study 1. The text represents the context of the training; Underlined words indicate the demographic information inserted into the template; The blue words are the four final generated words.

Figure 2 compares the most commonly used words in the dataset to describe Democrats and Republicans by data source (GPT-3 or human responses) and the ideology of the response source (Democratic vs. Republican). The bubble size represents the relative frequency with which the word appears; The columns represent the ideology of the authors of the survey responses. Qualitatively, both humans and GPT-3's vocabulary (for bipartisanship) were initially in line with political scientists' expectations. For example, both GPT-3 and humans use a common set of words to describe Democrats, while few of these words are used to describe Republicans.

From one to infinity: A simulation of a sample of respondents by a large language model

Figure 2: The original "arbitrarily classified party" dataset and the corresponding GPT-3 generated words. The bubble size represents the relative frequency with which the word appears; The column represents the ideology of the writer of the list. GPT-3 uses words similar to those of humans.

To formally analyze the data, the authors evaluated 7,675 responses to a survey generated by humans and GPT-3 as survey respondents through the survey platform Lucid, employing 2,873 people, but did not specify which responses were human or GPT-3. Each person was assessed 8 randomly assigned responses, and each response was assessed by 3 different people at the same time.

The authors presented the evaluators with a four-word survey response and the following preface: "Consider the following description of (Republican/Democratic):". They are then asked to answer 6 questions. First, the authors asked them to guess the party affiliation of the respondents to the survey (Republican, Democratic, or Independent). The authors then asked them to rate the list on 5 aspects: (1) positive or negative tone, (2) overall extremity (of the response), and whether the response mentioned (3) traits, (4) policy issues, or (5) social groups. Subjects then looked at eight other randomly selected survey responses in turn and were told that some of the responses were generated by computer models, and were asked to guess whether each list was human or computer-generated.

With this design, the authors explored two social science variants of the Turing test: (1) whether human evaluators could recognize the difference between human and GPT-3-generated survey responses; and (2) whether humans perceive the survey responses from the two sources to be similar. These tests involve Standard 1 (Social Science Turing Test) and Standard 2 (Backward Continuity).

The authors found evidence in favor of both criteria: human participants speculated that 61.7% of human-generated survey responses were human-generated, while 61.2% of GPT-3 survey responses were considered human-generated (two-tailed difference p = 0.44). While asking participants to determine whether a survey response was human or computer-generated led them to guess that some of the survey responses were not human, the trend did not change depending on the source of the survey response.

This is particularly interesting considering the results of the second exploration: whether participants noticed differences in characteristics between human and GPT-3-generated survey responses. To determine these differences, the authors estimated the regression model using ordinary least squares (OLS), regressing five characteristics of the evaluation list (positivity, extremity, and mentions of traits, problems, and groups) with dichotomous source variables (0 = human, 1 = GPT-3) and a range of control variables (gender, race, income, age, and party affiliation of the original list writers recorded in the Rothschild et al. data). All models included a fixed effect of raters (because each rater assessed 8 lists), and a clustering standard error of raters and lists (because each list was assessed 3 times).

Figure 3(B) shows the percentage of predictions assessed as having various characteristics across all survey responses (human and GPT-3). The results showed significant consistency in the evaluation of content and tone between the survey responses generated by humans and GPT-3. For example, human survey response writers included more personality traits (e.g., "paranoia," "morality") than other components (72.3% of survey responses). The same is true for GPT-3 (66.5% of survey responses). Less than half of the survey responses generated by humans and GPT-3 were rated as extreme (39.8% and 41.0%, respectively). This similarity is reflected in all 5 traits, with all but one hovering around 50%. The only exception is "traits", which are much more frequent in both human and GPT-3 data. This is consistent with the original pattern of human survey responses. GPT-3 reflects this exception, as well as a pattern of all other characteristics, which is a strong testament to the depth of algorithmic fidelity it contains.

In addition, as shown in Figure 3(A), GPT-3 also reflects human-like patterns at this level when drilled down to a more detailed level to explore the underlying patterns behind these results (Criterion 4, Pattern Correspondence). The similarities between humans and GPT-3 in their use of positive and extreme words are significant, as can be seen in the ideological grouping of the list authors.

From one to infinity: A simulation of a sample of respondents by a large language model

Figure 3: Analysis graph of human/GPT-3 responses to Lucid's survey

The above analysis can prove:1. Human evaluators are unable to correctly distinguish between humans and GPT-3-generated survey responses; 2. Their assessment of the content/characteristics of these survey responses was very similar. To explore the extent to which participants were able to use these lists to correctly guess the true partisan tendencies of the list authors, the authors estimated a model similar to the one just now, combining a dichotomous variable (1 = yes; 0 = No) with the source of the survey response (GPT-3 vs. human) and the same control group to determine if participants correctly guessed the party affiliation of the list authors (1 = Yes; 0 = No). The bar chart on the far left of Figure 3(B) shows the correct rate of predictions based on source type.

When participants were presented with survey responses from both sources, the probability of guessing their authors' affiliation was significantly higher than the probability of random selection (33%, as subjects could guess Republican, Democratic, or independent), providing strong evidence for the fidelity of the GPT-3 algorithm. Subjects who saw human-generated survey responses were about 7.3% more likely (60.1% vs. 52.8%) to successfully guess than those who saw GPT-3 lists, a statistically significant difference (two-tailed test p < 0.001). Both the text of both the human and GPT-3 contain the obvious emotional clues needed to guess the partisan leanings of the creators.

The results of Study 1 show that GPT-3's algorithmic fidelity is very high: Standard 1 (Social Science Turing Test) and Standard 2 (backward continuity) are repeatedly and consistently supported, and Standard 4 (pattern correspondence) also has some preliminary evidence. In each of these cases, the authors observed that these standards were supported by different measurement methods and in different subgroups of the U.S. population.

b) Study 2

In Study 2, the authors used ANES (National Elections Study in the United States) from 2012, 2016, and 2020 as data sources. The similarity of the distribution of voting choices reported by the GPT-3 Silicon Sample, constructed from the demographics of ANES participants in 2012, 2016, and 2020, to be considered first and the similarity to the matching human sample. The study requires GPT-3 to generate voting choices from a limited number of options (such as voting for Trump in 2016) and must generate choices in different ways depending on the human context provided. Therefore, this study evaluated Criterion 3 (Forward Continuity) and Criterion 4 (Pattern Correspondence). Study 2 also explores the possible impact of GPT-3's time constraints: the model's training corpus was sourced no later than 2019, so the 2020 data allowed the authors to explore how the algorithmic fidelity of the language model would change outside of the time in which the original training corpus was held.

The authors used the following ANES variables as training conditions for GPT-3: (1) race/ethnic self-identity, (2) gender, (3) age, (4) conservative-liberal ideological self-orientation, (5) partisan identity, (6) political interest, (7) church attendance, (8) whether respondents reported discussing politics with family and friends, (9) patriotic sentiments associated with the American flag, and (10) state of residence (Note: (9) and (10) did not have data for 2020 at the time of analysis). The authors document GPT-3 after being trained on backstories, filling in the Republican/Democratic candidates to complete the "In [year], I voted for ...... The probability of this sentence. Using these ANES variables as training text in GPT-3 allows authors to compare the extent to which GPT-3 silicon samples replicate the relationship between each variable in the human sample and voting choices. Next, respondents/GPT-3 indicated that when voting for a Republican candidate in that election, the vote choice code was 1, and when voting for the Democratic candidate, the code was 0. To match GPT-3's predictions with observed human data, a dichotomize of probability is set to 0.50, with a high value indicating a vote for a Republican candidate.

It can be observed that GPT-3 and ANES respondents report a high degree of agreement in the proportion of presidential votes in both parties. Averaged across the sample, GPT-3 reported a probability of 0.391 voting for Romney in 2012, while ANES had a percentage of 0.404. In 2016 data, the probability of GPT-3 voting for Trump was 0.432, compared to 0.477 for ANES in 2016. In 2020, GPT-3 had a 0.472 probability of voting for Trump, compared to 0.412 percentages from ANES. In all three cases, the authors can see a slight overall bias in GPT-3. However, the substantial difference between the ANES and GPT-3 estimates is relatively small and consistent with the authors' arguments for algorithmic fidelity and corrected skew marginal values, and it cannot be ruled out that there is a significant correlation between GPT-3's response and subgroup responses in the U.S. population.

The statistics in Table 1 report two forms of correlation between ANES' self-voting report and GPT-3's binary voting report. Dichotomous voting probabilities of GPT-3 to match Anthropometric Metrics (ANES). There was a significant correspondence between GPT-3 and human respondents across all three years of survey data. The Tetrachoric Correlation was 0.9 for all respondents in 2012, 0.92 in 2016, and 0.94 in 2020. This consistently high correlation is significant, given the different contexts from year to year.

From one to infinity: A simulation of a sample of respondents by a large language model

表1: GPT-3 与 ANES 对共和党总统候选人投票概率的相关性测量。 Tetra 指四分相关性(Tetrachoric Correlation)。 Prop. Agree指比例一致性(Proportion Agreement)。 GPT-3 vote指 GPT-3 预测的投票给共和党候选人概率的二进制版本(将预测值除以 0.50)。

This pattern correspondence is also found in all subgroups of the U.S. population. From the proportion of human votes and the corresponding GPT-3 votes reported by ANES in the three years of 2012, 2016 and 2020, it can be found that more than half of the four correlations corresponding to each human subgroup are greater than or equal to 0.90. Proportional consistency in Table 1 also shows a high level of original agreement between the two ballot choice reports for 2012, 2016 and 2020. The only exception to this overall pattern is "independent candidates". However, this is also the only deviation from the overall trend in Table 1. Existing research in political science shows that this segment of the population is particularly unpredictable because they are the most contradictory to the choice of the two parties, the least likely to vote, the least politically knowledgeable, and the least interested in politics. Overall, therefore, the results in Table 1 provide strong additional evidence for algorithmic fidelity, with criteria 3 (forward continuity) and criterion 4 (pattern correspondence) supported by repeated consistency.

c) Study 3

Study 3 examines GPT-3's ability to replicate patterns of complex associations between various concepts. Given the complexity of this task, the authors conducted research on ANES data from 2016 only: on the basis of Study 2 voting predictions, the authors scaled up the size of the information output required to be generated by GPT-3 and used the resulting data to assess complex inter-concept correlations (i.e., Criterion 4 (pattern correspondence)).

The challenge with this study is that when asked about voting choices in a particular election (i.e., "Trump" vs. "Hillary"), there are naturally a limited number of possible responses, but no such response exists right now, so the authors developed a method that allows GPT-3 to be trained to provide specific responses from a range of options. To this end, the authors have developed an interview-style condition template. This approach serves two purposes. First, using the language model's zero-shot learning (any example of a model trained to recognize and classify objects or concepts without prior knowledge of those categories or concepts), GPT-3 is guided to answer survey questions using strings of tokens drawn from the options provided by the interviewer. Second, the questions in the training text provide the demographic and attitudinal background information necessary to generate different silicon-based people. The study used human responses to 11 survey questions in the 2016 ANES survey to generate conditional text, and then used GPT-3 to predict responses to the 12th question.

Cramer's V-value for each combination of survey items in the ANES sample ("human") was calculated, using data from ANES and silicon-based humans ("human") (the Cramer's V-value provides a generalized measure of correlation that can account for changes in baseline value in the raw data). Figure 4 shows a comparison of Cramer's V values between the two data sources. There was a significant correspondence between the correlation patterns in the human survey data and the correlation patterns in the survey data generated by GPT-3 (the mean difference between Cramer's V values was 0.026). It can be seen that the Cramer's V value of the responses generated by GPT-3 is not consistently high or low, but rather reflects the stronger or weaker relationships that exist in the human data. Two concepts that are not strongly related in human data are similarly not strongly related in GPT-3 data, and vice versa. In Figure 4, while there are differences in the accuracy of the relationship patterns in GPT-3 and those in ANES, in the vast majority of cases, the correspondence between GPT-3 and ANES is striking.

From one to infinity: A simulation of a sample of respondents by a large language model

图4:ANES数据与GPT-3数据之间的Cramer’s V相关性(Cramer’s V Correlations)

The authors provide a first-person backstory based on specific human survey data, and it is unlikely that the values of the silicon-based samples will be exactly consistent with human responses at the individual level. For each text completion, the language model uses a random sampling process to select text from the distribution of probable next tokens for completion. Therefore, as long as the sample size is large enough, it can be expected that the overall distribution of Chinese responses in silicon-based samples will match the overall distribution of human data. In addition, as with all stochastic processes, it is expected that there will be some variation from different sampling of silicon-based samples.

These results again provide convincing and consistent evidence of replication for criterion 4 (pattern correspondence). GPT-3 reproduces subtle patterns of association. When provided with real-world survey data as input, GPT-3 can reliably answer closed-ended survey questions in a manner very similar to that of human respondents. The statistical similarity (between humans and GPT-3) extends to a set of interrelationships between measures such as individual behavior, demographic characteristics, and complex attitudes. Therefore, the authors once again see this finding as strong evidence of the fidelity of the algorithm.

6. The prospect of large language models

So far, the focus has been on demonstrating the fidelity of GPT-3's algorithms by comparing its output with human data. The evidence in this paper suggests that algorithmic fidelity is an important attribute of tools such as GPT-3, as it demonstrates that these language models can be used before or without human data.

Data from the silicon-based sample in Study 1 suggests that: (1) people use different words to describe Republicans and Democrats that highlight different stereotypes about both groups; (2) the emotional content and extremity of these texts are systematically linked to the political beliefs and identities of individuals; (3) stereotypes of party members include content based on issues, groups, and traits; (4) Others can guess an individual's partisan leanings based on their stereotypes of Democrats and Republicans. Using only the data from GPT-3, all of this is obvious. With this information, interested researchers can design survey questions, experimental processing methods, and code manuals to guide human research with inexpensive GPT-3 models instead of human data collection.

The same is true for Studies 2 and 3. The Ablation Analysis in Study 2 showed which variables researchers should include in public opinion studies if they want to accurately understand Americans' voting behavior. Based on the results of GPT-3, social scientists can design an experimental or observational study that confirms and dissects this relationship in a rigorous and causal manner. The results also indicate which variables have potential confounders that should be included in the pre-analysis plan for regression and other econometric models with causal relationships. All of these insights would be possible for scholars to gain access to GPT-3 and no human baseline. Once algorithmic fidelity has been established in specific models for specific topics/domains, researchers can use the insights gained from silicon-based samples to experiment with different question wording, sort through different types of measurements, identify key relationships that need to be evaluated more closely, and develop an analysis plan before collecting any data from human participants.

本文转自 | Political理论志

Read on