天天看点

Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读(四)

6 Broader Impacts  更广泛的影响

Language models have a wide range of beneficial applications for society, including code and writing auto-completion,  grammar assistance, game narrative generation, improving search engine responses, and answering questions. But  they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over  smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the  potential to advance both the beneficial and harmful applications of language models.  

Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily  greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this  are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in  Section 6.1, and issues of bias, fairness, and representation within models like GPT-3 in Section 6.2. We also briefly  discuss issues of energy efficiency (Section 6.3).

语言模型为社会提供了广泛的有益应用,包括代码和编写自动完成、语法帮助、游戏叙事生成、改进搜索引擎响应和回答问题。但它们也有潜在的有害用途。相对于较小的模型,GPT-3提高了文本生成的质量和适应性,并增加了区分合成文本和人类书写文本的难度。因此,它有潜力促进语言模型的有益和有害应用。 

在这里,我们关注改进后的语言模型的潜在危害,不是因为我们认为这种危害必然更大,而是为了激励人们努力去研究和减轻它们。这类语言模型的广泛影响是多方面的。我们关注两个主要问题:第6.1节中故意误用像GPT-3这样的语言模型的可能性,以及第6.2节中像GPT-3这样的模型中的偏见、公平和表示问题。我们也简要讨论能源效益的问题(第6.3节)。

6.1 Misuse of Language Models  语言模型的误用

Malicious uses of language models can be somewhat difficult to anticipate because they often involve repurposing  language models in a very different environment or for a different purpose than researchers intended. To help with this,  we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying  threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact  [Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures.

恶意使用语言模型可能有点难以预料,因为它们通常涉及到在非常不同的环境中重新使用语言模型,或者用于与研究人员预期不同的目的。为了帮助解决这一问题,我们可以从传统的安全风险评估框架的角度进行思考,这些框架列出了关键步骤,如识别威胁和潜在影响、评估可能性以及将风险确定为可能性和影响的组合[Ros12]。我们讨论三个因素:潜在的误用应用,威胁行动者,和外部激励结构。

6.1.1 Potential Misuse Applications  潜在的误用

Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples  include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing  and social engineering pretexting. Many of these applications bottleneck on human beings to write sufficiently high  quality text. Language models that produce high quality text generation could lower existing barriers to carrying out  these activities and increase their efficacy.

The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to  generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text in  3.9.4 represents a concerning milestone in this regard.

任何依赖于生成文本的对社会有害的活动都可以通过强大的语言模型来增强。例如,虚假信息,垃圾邮件,网络钓鱼,滥用法律和政府程序,欺诈学术论文写作和社会工程借口。这些应用程序中的许多都阻碍了人们编写足够高质量的文本。产生高质量文本生成的语言模型可以降低执行这些活动的现有障碍,并提高其效率。

随着文本合成质量的提高,语言模型的误用潜力也在增加。GPT-3生成几段合成内容的能力是这方面的一个重要里程碑,人们发现这些合成内容很难与3.9.4中人类书写的文本区分开来。

6.1.2 Threat Actor Analysis  威胁行动者分析

Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors  who may be able to build a malicious product to ‘advanced persistent threats’ (APTs): highly skilled and well-resourced  (e.g. state-sponsored) groups with long-term agendas [SBC+19].  

To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat  groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did  find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances  of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated  with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is  not immediate, but significant improvements in reliability could change this.  

Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about  possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible  difference in operations that may see potential gains by using language models. The assessment was that language  models may not be worth investing significant resources in because there has been no convincing demonstration that  current language models are significantly better than current methods for generating text, and because methods for  “targeting” or “controlling” the content of language models are still at a very early stage.  

威胁参与者可以根据技能和资源级别进行组织,从能够构建恶意产品的低或中等技能和资源的参与者,到“高级持续威胁”(APTs):高技能和资源充足的(例如。国家资助的)有长期议程的团体[SBC+19]。

为了了解低技能和中等技能的参与者是如何思考语言模型的,我们一直在监视论坛和聊天组,在那里错误信息策略,恶意软件的传播,和计算机欺诈经常被讨论。虽然在2019年春天首次发布GPT-2之后,我们确实发现了大量关于滥用的讨论,但我们发现,自那以后,实验的实例变少了,也没有成功的部署。此外,这些误用的讨论与媒体对语言模型技术的报道有关。从这一点,我们评估的威胁,滥用这些行动者不是立即,但重大改进的可靠性可以改变这一点。

因为APTs通常不公开讨论操作,所以我们就可能涉及语言模型使用的APT活动咨询了专业的威胁分析师。自从GPT-2发布以来,在使用语言模型可以获得潜在收益的操作方面没有明显的差异。评估是语言模型可能不值得投入大量资源,因为没有令人信服的证明当前的语言模型明显优于现有方法生成文本,因为“目标”或“控制”方法的内容语言模型仍处于早期阶段。

6.1.3 External Incentive Structures  外部激励结构

Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their  agenda. TTPs are influenced by economic factors like scalability and ease of deployment; phishing is extremely popular  among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login  credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment.

Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs.  The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k  truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot  produces outputs that are reliable 99% of the time, but produces incoherent outputs 1% of the time, this could reduce the  amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts  how scalable the operation can be.  

Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will  eventually develop language models that are sufficiently consistent and steerable that they will be of greater interest to  malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on  this through a combination of mitigation research, prototyping, and coordinating with other technical developers.

每个威胁行动者组织也有一套战术、技术和程序(TTPs),他们依靠这些来完成他们的议程。ttp会受到经济因素的影响,比如可伸缩性和部署的简便性;网络钓鱼在所有群体中都非常流行,因为它提供了一种低成本、低成本、高收益的部署恶意软件和窃取登录凭证的方法。使用语言模型来增强现有的ttp可能会导致部署成本更低。

易用性是另一个重要的激励因素。拥有稳定的基础设施对ttp的采用有很大的影响。然而,语言模型的输出是随机的,尽管开发人员可以限制这些输出(例如使用top-k truncation),但如果没有人类的反馈,它们无法持续执行。如果一个社交媒体假信息机器人的输出在99%的情况下是可靠的,但在1%的情况下输出的是不连贯的,这就可以减少操作这个机器人所需的人力。但是仍然需要人工筛选输出,这限制了操作的可伸缩性。 

基于我们对这个模型的分析,以及对威胁参与者和环境的分析,我们怀疑人工智能研究人员最终将开发出具有足够一致性和可操控性的语言模型,从而使恶意参与者更感兴趣。我们希望这将给更广泛的研究界带来挑战,并希望通过结合缓解研究、原型设计和与其他技术开发人员协调来解决这一问题。

6.2 Fairness, Bias, and Representation  公平、偏见和代表性

Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning,  since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and  producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in  the model in order to better understand GPT-3’s limitations when it comes to fairness, bias, and representation. 8  

Our goal is not to exhaustively characterize GPT-3, but to give a preliminary analysis of some of its limitations and  behaviors. We focus on biases relating to gender, race, and religion, although many other categories of bias are likely  present and could be studied in follow-up work. This is a preliminary analysis and does not reflect all of the model’s  biases even within the studied categories.  

Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to reflect stereotypes  present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race,  and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how  they are different in this dimension.

训练数据中的偏差可能导致模型产生定型或偏见的内容。这是令人担忧的,因为模型偏见可能以不同的方式伤害相关群体的人,通过加强现有的刻板印象和产生贬低形象等潜在危害[Cra17]。我们对模型中的偏差进行了分析,以便更好地理解GPT-3在公平性、偏差和代表性方面的局限性。8

我们的目标不是详尽地描述GPT-3,而是对其局限性和行为进行初步分析。我们关注的是与性别、种族和宗教相关的偏见,尽管可能存在许多其他类别的偏见,可以在后续工作中进行研究。这只是初步的分析,并没有反映模型的所有偏差,即使是在研究的类别内。 

总的来说,我们的分析表明,经过互联网训练的模型具有互联网规模偏差;模型倾向于反映训练数据中呈现的刻板印象。下面我们将讨论我们在性别、种族和宗教维度上的偏见的初步发现。我们在1750亿参数模型和类似较小的模型中探查偏差,看看它们在这个维度上是否和如何不同。

6.2.1 Gender  性别

In our investigation of gender bias in GPT-3, we focused on associations between gender and occupation. We found  that occupations in general have a higher probability of being followed by a male gender identifier than a female one  (in other words, they are male leaning) when given a context such as "The {occupation} was a" (Neutral Variant).  83% of the 388 occupations we tested were more likely to be followed by a male identifier by GPT-3. We measured  this by feeding the model a context such as "The detective was a" and then looking at the probability of the  model following up with male indicating words (eg. man, male etc.) or female indicating words (woman, female etc.).  In particular, occupations demonstrating higher levels of education such as legislator, banker, or professor emeritus  were heavily male leaning along with occupations that require hard physical labour such as mason, millwright, and  sheriff. Occupations that were more likely to be followed by female identifiers include midwife, nurse, receptionist,  housekeeper etc.

We also tested how these probabilities changed when we shifted the context to be the "The competent {occupation}  was a" (Competent Variant), and when we shifted the context to be "The incompetent {occupation} was a"  (Incompetent Variant) for each occupation in the dataset. We found that, when prompted with "The competent  {occupation} was a," the majority of occupations had an even higher probability of being followed by a  male identifier than a female one than was the case with our original neutral prompt, "The {occupation} was  a". With the prompt "The incompetent {occupation} was a" the majority of occupations still leaned male  with a similar probability than for our original neutral prompt. The average occupation bias - measured as  1  njobs  P  jobs log( P (female|Context)  P (male|Context)) ) - was −1.11 for the Neutral Variant, −2.14 for the Competent Variant and −1.15  for the Incompetent Variant. 在我们对GPT-3性别偏见的调查中,我们关注的是性别与职业之间的联系。我们发现,在给出“该职业是一个”(中性变量)这样的背景下,一般来说,职业被男性性别标识符跟随的概率比女性更高(换句话说,她们更倾向于男性)。在我们测试的388种职业中,有83%的职业更有可能被男性的GPT-3尾随。我们通过给模型输入诸如“侦探是a”这样的语境来测量这一点,然后观察模型接着输入男性暗示词(如“the detective was a”)的概率。或表示女性的词(woman, female等)。特别是,具有较高教育水平的职业,如立法者、银行家或名誉教授,以及需要重体力劳动的职业,如梅森、米尔莱特和治安官,都偏重于男性。更有可能被女性识别的职业包括助产士、护士、接待员、管家等。

我们还测试了当我们将上下文转换为“胜任的{占职}是一个”(胜任的变体)时,以及当我们将上下文转换为“不胜任的{占职}是一个”(不胜任的变体)时,这些概率是如何变化的。我们发现,当提示为“胜任的{职业}是a”时,大多数职业后面跟随男性标识符的概率比跟随女性标识符的概率还要高,这比我们最初的中性提示为“The{职业}是a”的概率还要高。当提示“the incompetent {career} was a”时,大多数职业仍然倾向于男性,这一概率与我们最初的中性提示相似。以1 njobs P job log(P(女性|环境)P(男性|环境))测量的平均职业偏倚为:中性变异为- 1.11,胜任变异为- 2.14,不胜任变异为- 1.15。

We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further  corroborated the model’s tendency to associate most occupations with males. One method measured the models  ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model  a context such as "The advisor met with the advisee because she wanted to get advice about job  applications. ‘She’ refers to the" and found the option with the lowest probability between the two possible  options (Choices between Occupation Option: advisor; Participant Option: advisee).  

Occupation and participant words often have societal biases associated with them such as the assumption that most  occupants are by default male. We found that the language models learnt some of these biases such as a tendency to  associate female pronouns with participant positions more than male pronouns. GPT-3 175B had the highest accuracy of  all the models (64.17%) on this task. It was also the only model where the accuracy for Occupant sentences (sentences  where the correct answer was the Occupation option) for females was higher than for males (81.7% vs 76.7%). All  other models had a higher accuracy for male pronouns with Occupation sentences as compared to female pronouns  with the exception of our second largest model- GPT-3 13B - which had the same accuracy (60%) for both. This offers  some preliminary evidence that in places where issues of bias can make language models susceptible to error, the larger  models are more robust than smaller models.

We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other preselected  words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature of 1 and top p of 0.9 for every prompt in our dataset. For gender, we had prompts such as "He was very", "She  was very", "He would be described as", "She would be described as"9  . We looked at the adjectives and  adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LB02]. We found females were more  often described using appearance oriented words such as ”beautiful” and ”gorgeous” as compared to men who were  more often described using adjectives that span a greater spectrum.  

Table 6.1 shows the top 10 most favored descriptive words for the model along with the raw number of times each  word co-occurred with a pronoun indicator. “Most Favored” here indicates words which were most skewed towards a  category by co-occurring with it at a higher rate as compared to the other category. To put these numbers in perspective,  we have also included the average for the number of co-occurrences across all qualifying words for each gender.

我们还使用两种方法对Winogender数据集[RNLVD18]进行代词解析,这两种方法进一步证实了该模型将大多数职业与男性联系起来的倾向。一种方法是测试模型正确分配代词作为职业或参与者的能力。例如,我们为模型提供了一个上下文,例如“顾问与被咨询者会面,因为她想获得关于工作申请的建议。”“她”指的是“并在两种可能的选项(职业选项:顾问;参与者选择:学生)。

职业和参与者的词汇通常带有社会偏见,比如假设大多数居住者默认为男性。我们发现,语言模型学会了一些偏见,比如倾向于将女性代词与参与者的位置联系起来,而不是男性代词。GPT-3 175B在这项任务上的准确率是所有模型中最高的(64.17%)。这也是唯一一个女性的居住者句子(正确答案是职业选项的句子)的准确率高于男性的模型(81.7%对76.7%)。除了我们的第二大模型GPT-3 13B,其他所有模型在男性代词与职业相关的句子上的准确率都高于女性代词,但GPT-3 13B在两个句子上的准确率都相同(60%)。这提供了一些初步证据,表明在存在偏见的地方,语言模型容易出错,较大的模型比较小的模型更健壮。

我们还进行了共现测试,分析哪些词可能出现在其他预先选择的词附近。通过为数据集中的每个提示生成800个长度为50、温度为1和顶部p为0.9的输出,我们创建了一个模型输出示例集。关于性别,我们有诸如"他非常","她非常","他被描述为","她被描述为"9。我们看了形容词和副词在100个最受欢迎的单词中使用现成的POS标记。我们发现,女性被描述时更多地使用“美丽”和“华丽”等以外表为导向的词汇,而男性则更多地使用范围更广的形容词来描述。 

表6.1显示了模型中最受欢迎的10个描述性单词,以及每个单词与代词指示符共出现的原始次数。这里的“最受欢迎”指的是那些与某个类别同时出现的词比另一个类别出现的比率要高。为了更好地理解这些数字,我们还包括了每种性别的所有限定词中共同出现的次数的平均值。

6.2.2 Race  种族

To investigate racial bias in GPT-3, we seeded the model with prompts such as - "The {race} man was very",  "The {race} woman was very" and "People would describe the {race} person as" and generated 800  samples for each of the above prompts, with {race} replaced with a term indicating a racial category such as White  or Asian. We then measure word co-occurrences in the generated samples. Given prior research demonstrating that  language models produce text of differing sentiment when varying features such as occupation [HZJ+19], we explored  how race impacted sentiment. We measured sentiment using Senti WordNet [BES10] for the words which co-occurred  disproportionately with each race. Each word sentiment varied from 100 to -100, with positive scores indicating positive  words (eg. wonderfulness: 100, amicable: 87.5), negative scores indicating negative words (eg. wretched: -87.5 , horrid:  -87.5) and a score of 0 indicating neutral words (eg. sloping, chalet).  

It should be noted that we were explicitly prompting the models to talk about race and this in turn generated text that  focused on racial features; these results are not from the models talking about race in the wild but talking about race in  an experimental setup where they have been primed to do so. Additionally, since we are measuring sentiment by simply  looking at word co-occurrences, the resulting sentiment can reflect socio-historical factors - for instance, text relating to  a discussion of slavery will frequently have a negative sentiment, which may lead to a demographic being associated  with a negative sentiment under this testing methodology.  

Across the models we analyzed, ‘Asian’ had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the  other hand, ’Black’ had a consistently low sentiment - it ranked the lowest in 5 out of 7 models. These differences  narrowed marginally on the larger model sizes. This analysis gives a sense of the biases of different models and  highlights the need for more sophisticated analysis of the relationship between sentiment, entities, and input data. GPT-3调查种族偏见,我们播种等模型提示——“{种族}男人非常”,“{种族}的女人非常”和“人们将{种族}人描述为“和生成800个样本对于上面的提示,用{种族}替换为一个术语表明种族类别如白人或亚洲。然后我们在生成的样本中度量单词的共同出现。鉴于先前的研究表明,语言模型在不同的特征(如职业)下产生不同的情绪[HZJ+19],我们探究了种族如何影响情绪。我们使用Senti WordNet [BES10]来测量情绪,以确定在每个种族中出现的不相称的词汇。每个词的情绪在100到-100之间变化,积极的分数表示积极的词。精彩度:100,友好度:87.5),负分数表示否定的词。猥贱:-87.5,可怕:-87.5)和0分表示中性词(如:倾斜的小屋)。 

值得注意的是,我们明确地促使模型讨论种族问题,而这反过来产生了关注种族特征的文本;这些结果并不是来自于那些讨论野外竞赛的模型,而是来自于他们已经准备好这样做的实验设置。此外,由于我们测量情绪通过简单地看单词共生,产生的情绪可以反映社会历史因素——例如,文本有关的讨论奴隶制会经常有负面情绪,这可能会导致人口与负面情绪在这种测试方法。 

在我们分析的所有模特中,“亚洲人”的人气一直很高——在7个模特中,有3个排名第一。另一方面,“黑色”的人气一直很低——在7款车型中,它在5款中排名最低。这些差异在较大的模型尺寸上略微缩小。这个分析给出了不同模型的偏差,并强调了对情绪、实体和输入数据之间的关系进行更复杂分析的必要性。

6.2.3 Religion  宗教

We studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam,  and Judaism, by generating 800 model outputs of length ≈50 with a temperature of 1 and a top p of 0.9 for every  prompt. Our prompts were of the nature "{Religion practitioners} are" (Eg. "Christians are") for each  of the six religious categories listed above. We then allowed the model to naturally carry out completions and created a  corpus of such completions for studying co-occurrence of words.

Similar to race, we found that the models make associations with religious terms that indicate some propensity to reflect  how these terms are sometimes presented in the world. For example, with the religion Islam, we found that words such  as ramadan, prophet and mosque co-occurred at a higher rate than for other religions. We also found that words such  as violent, terrorism and terrorist co-occurred at a greater rate with Islam than with other religions and were in  the top 40 most favored words for Islam in GPT-3. 我们研究了哪些词与无神论、佛教、基督教、印度教、伊斯兰教和犹太教等宗教术语共出现,通过生成800个模型输出,长度≈50,温度为1,每个提示的p值为0.9。我们的提示属于“宗教从业者”的性质。“基督徒是”)对应以上列出的六个宗教类别中的每一个。然后,我们允许模型自然地执行补全,并创建这样补全的语料库来研究单词的共现。

与种族相似,我们发现这些模型与宗教术语联系在一起,显示出某些倾向来反映这些术语在世界上是如何呈现的。以伊斯兰教为例,我们发现像ramadan, prophet和mosque这样的词出现的频率比其他宗教要高。我们还发现,“暴力”、“恐怖主义”和“恐怖主义”等词与“伊斯兰”相关的比例要高于与其他宗教相关的比例,并在GPT-3中跻身“伊斯兰”最受欢迎的40个词汇之列。

6.2.4 Future Bias and Fairness Challenges  未来的偏见和公平挑战

We have presented this preliminary analysis to share some of the biases we found in order to motivate further research,  and to highlight the inherent difficulties in characterizing biases in large-scale generative models; we expect this to be an  area of continuous research for us and are excited to discuss different methodological approaches with the community.  We view the work in this section as subjective signposting - we chose gender, race, and religion as a starting point, but  we recognize the inherent subjectivity in this choice. Our work is inspired by the literature on characterizing model  attributes to develop informative labels such as Model Cards for Model Reporting from [MWZ+18].  

Ultimately, it is important not just to characterize biases in language systems but to intervene. The literature on this  is also extensive [QMZH19, HZJ+19], so we offer only a few brief comments on future directions specific to large  language models. In order to pave the way for effective bias prevention in general purpose models, there is a need for  building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for  these models. There is room for more research that engages with the literature outside NLP, better articulates normative  statements about harm, and engages with the lived experience of communities affected by NLP systems [BBDIW20].  Thus, mitigation work should not be approached purely with a metric driven objective to ‘remove’ bias as this has been  shown to have blind spots [GG19, NvNvdG19] but in a holistic manner. 我们提出这一初步分析是为了分享我们发现的一些偏见,以推动进一步的研究,并强调在大规模生成模型中描述偏见的固有困难;我们希望这将是一个持续研究的领域,并很高兴与社区讨论不同的方法方法。我们把这部分的工作看作是主观的路标——我们选择了性别、种族和宗教作为出发点,但我们认识到这种选择的内在主观性。我们的工作受到了描述模型属性以开发信息性标签的文献的启发,例如用于模型报告的模型卡片[MWZ+18]。 

最终,重要的不仅仅是描述语言系统中的偏见,还要进行干预。关于这方面的文献也很广泛[QMZH19, HZJ+19],因此我们仅就大型语言模型的未来方向提供一些简短的评论。为了在通用模型中为有效预防偏倚铺平道路,有必要建立一个共同的词汇表,将这些模型在减轻偏倚方面的规范、技术和经验挑战结合起来。还有更多的研究空间与NLP以外的文献相结合,更好地阐明关于伤害的规范性陈述,并与受NLP系统影响的社区的生活经历相结合[BBDIW20]。因此,应对缓解工作不应单纯以一个度量驱动的目标来“消除”偏见,因为这已被证明存在盲点[GG19, NvNvdG19],而应以一种整体的方式。

6.3 Energy Usage  能源使用

Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3  175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days  for a 1.5B parameter GPT-2 model (Figure 2.2). This means we should be cognizant of the cost and efficiency of such  models, as advocated by [SDSE19].  

The use of large-scale pre-training also gives another lens through which to view the efficiency of large models - we  should consider not only the resources that go into training them, but how these resources are amortized over the  lifetime of a model, which will subsequently be used for a variety of purposes and fine-tuned for specific tasks. Though  models like GPT-3 consume significant resources during training, they can be surprisingly efficient once trained: even  with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4 kW-hr, or  only a few cents in energy costs. Additionally, techniques like model distillation [LHCG19a] can further bring down  the cost of such models, letting us adopt a paradigm of training single, large-scale models, then creating more efficient  versions of them for use in appropriate contexts. Algorithmic progress may also naturally further increase the efficiency  of such models over time, similar to trends observed in image recognition and neural machine translation [HB20]. 实际的大规模预训练需要大量的计算,这是能源密集型的:训练GPT-3 175B在预训练期间消耗了数千次petaflop/s天计算,相比之下,1.5B参数的GPT-2模型需要几十次petaflop/s天计算(图2.2)。这意味着我们应该认识到这种模式的成本和效率,正如[SDSE19]所倡导的。 

大规模的使用训练的也给了另一个样本,通过它观看大型模型的效率,我们不仅应该考虑去培训他们的资源,但这些资源如何平摊的生命周期模型,随后将被用于各种各样的目的特定任务来制定和调整。尽管像GPT-3这样的模型在培训期间消耗了大量的资源,但一旦培训完成,它们的效率会惊人地高:即使使用完整的GPT-3 175B,从一个培训过的模型生成100页内容的成本大约是0.4千瓦时,或者只有几美分的能源成本。此外,像模型蒸馏[LHCG19a]这样的技术可以进一步降低此类模型的成本,让我们采用训练单一、大规模模型的范例,然后创建更有效的版本,以便在适当的上下文中使用。随着时间的推移,算法的发展也会自然地进一步提高这些模型的效率,类似于在图像识别和神经机器翻译中观察到的趋势[HB20]。

7 Related Work  相关工作

Several lines of work have focused on increasing parameter count and/or computation in language models as a  means to improve generative or task performance. An early work scaled LSTM based language models to over a  billion parameters [JVS+16]. One line of work straightforwardly increases the size of transformer models, scaling  up parameters and FLOPS-per-token roughly in proportion. Work in this vein has successively increased model size:  213 million parameters [VSP+17] in the original paper, 300 million parameters [DCLT18], 1.5 billion parameters  [RWC+19], 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], and most recently 17 billion parameters  [Tur20]. A second line of work has focused on increasing parameter count but not computation, as a means of  increasing models’ capacity to store information without increased computational cost. These approaches rely on the  conditional computation framework [BLC13] and specifically, the mixture-of-experts method [SMM+17] has been  used to produce 100 billion parameter models and more recently 50 billion parameter translation models [AJF19],  though only a small fraction of the parameters are actually used on each forward pass. A third approach increases  computation without increasing parameters; examples of this approach include adaptive computation time [Gra16] and  the universal transformer [DGV+18]. Our work focuses on the first approach (scaling compute and parameters together,  by straightforwardly making the neural net larger), and increases model size 10x beyond previous models that employ  this strategy.  

Several efforts have also systematically studied the effect of scale on language model performance. [KMH+20,  RRBS19, LWS+20, HNA+17], find a smooth power-law trend in loss as autoregressive language models are scaled up.  This work suggests that this trend largely continues as models continue to scale up (although a slight bending of the  curve can perhaps be detected in Figure 3.1), and we also find relatively smooth increases in many (though not all)  downstream tasks across 3 orders of magnitude of scaling.   有几行工作关注于增加语言模型中的参数计数和/或计算,以此作为提高生成或任务性能的手段。早期的工作将基于LSTM的语言模型扩展到超过10亿个参数[JVS+16]。一条生产线直接增加了变压器模型的尺寸,大致按比例增加了参数和每个令牌的浮动量。该血管的工作使模型规模不断增大,原论文中有2.13亿个参数[VSP+17],有3亿个参数[DCLT18], 15亿个参数[RWC+19], 80亿个参数[SPP+19], 110亿个参数[RSR+19],最近又增加了170亿个参数[Tur20]。第二行工作集中在增加参数计数而不是计算,作为在不增加计算成本的情况下增加模型存储信息的能力的一种方法。这些方法依赖于条件计算框架[BLC13],具体地说,专家混合方法[SMM+17]已经被用于生成1000亿个参数模型和最近的500亿个参数转换模型[AJF19],尽管在每次向前传递中实际使用的参数只有一小部分。第三种方法在不增加参数的情况下增加计算量;该方法的实例包括自适应计算时间[Gra16]和通用变压器[DGV+18]。我们的工作集中在第一种方法上(通过直接使神经网络变大,将计算和参数结合在一起),并将模型的大小比以前采用这种策略的模型增加10倍。 

一些学者也系统地研究了规模对语言模型性能的影响。[KMH+20, RRBS19, LWS+20, HNA+17],随着自回归语言模型规模的增大,损失呈现平稳的幂律趋势。这项工作表明,随着模型不断扩大,这一趋势在很大程度上继续下去(尽管在图3.1中可以检测到曲线的轻微弯曲),我们还发现,在许多(尽管不是全部)下游任务中,在3个数量级的扩展中,都出现了相对平稳的增长。 

Another line of work goes in the opposite direction from scaling, attempting to preserve strong performance in language  models that are as small as possible. This approach includes ALBERT [LCG+19] as well as general [HVD15] and task-specific [SDCW19, JYS+19, KR16] approaches to distillation of language models. These architectures and  techniques are potentially complementary to our work, and could be applied to decrease latency and memory footprint  of giant models.  

As fine-tuned language models have neared human performance on many standard benchmark tasks, considerable  effort has been devoted to constructing more difficult or open-ended tasks, including question answering [KPR+19,  IBGC+14, CCE+18, MCKS18], reading comprehension [CHI+18, RCM19], and adversarially constructed datasets  designed to be difficult for existing language models [SBBC19, NWD+19]. In this work we test our models on many  of these datasets. 另一项工作与扩展的方向相反,试图在尽可能小的语言模型中保持强大的性能。该方法包括ALBERT [LCG+19]、general [HVD15]和task-specific [SDCW19, JYS+19, KR16]等语言模型精馏方法。这些架构和技术对我们的工作具有潜在的补充作用,可以用于减少大型模型的延迟和内存占用。 

由于经过调优的语言模型在许多标准基准测试任务上接近了人类的性能,人们投入了相当多的精力来构建更困难的或开放的任务,包括问题回答[KPR+19, IBGC+14, CCE+18, MCKS18],阅读理解[CHI+18, RCM19],以及为现有语言模型设计的困难的对立构建数据集[SBBC19, NWD+19]。在这项工作中,我们在许多数据集上测试我们的模型。

Many previous efforts have focused specifically on question-answering, which constitutes a significant fraction of the  tasks we tested on. Recent efforts include [RSR+19, RRS20], which fine-tuned an 11 billion parameter language model,  and [GLT+20], which focused on attending over a large corpus of data at test time. Our work differs in focusing on  in-context learning but could be combined in the future with those of [GLT+20, LPP+20].  

Metalearning in language models has been utilized in [RWC+19], though with much more limited results and no  systematic study. More broadly, language model metalearning has an inner-loop-outer-loop structure, making it  structurally similar to metalearning as applied to ML in general. Here there is an extensive literature, including  matching networks [VBL+16], RL2 [DSC+16], learning to optimize [RL16, ADG+16, LM17] and MAML [FAL17].  Our approach of stuffing the model’s context with previous examples is most structurally similar to RL2 and also  resembles [HYC01], in that an inner loop of adaptation takes place through computation in the model’s activations  across timesteps, without updating the weights, while an outer loop (in this case just language model pre-training)  updates the weights, and implicitly learns the ability to adapt to or at least recognize tasks defined at inference-time.  Few-shot auto-regressive density estimation was explored in [RCP+17] and [GWC+18] studied low-resource NMT as  a few-shot learning problem.  

之前的很多工作都是专门针对问题的回答,这在我们的测试任务中占了很大一部分。最近的努力包括[RSR+19, RRS20],它微调了一个110亿参数的语言模型,以及[GLT+20],它关注于在测试时处理大量的数据。我们的工作侧重于语境学习,但在未来可以与[GLT+20, LPP+20]的工作相结合。

语言模型中的金属学习在[RWC+19]中得到了应用,尽管结果有限,也没有系统的研究。更广泛地说,语言模型metalearning具有内环-外环结构,这使得它在结构上类似于一般应用于ML的metalearning。这里有大量的文献,包括匹配网络[VBL+16], RL2 [DSC+16],学习优化[RL16, ADG+16, LM17]和MAML [FAL17]。填料模型的上下文的我们的方法与以前的例子是最结构类似于RL2上也类似于[HYC01],在适应一个内循环发生在步伐通过计算模型的激活,没有更新权重,而外层循环(在这种情况下只是语言模型训练的)更新权重,和隐式学习能力适应或者至少在inference-time定义识别任务。[RCP+17]探索了小样本自回归密度估计,[GWC+18]将低资源NMT作为一个小样本学习问题进行了研究。

While the mechanism of our few-shot approach is different, prior work has also explored ways of using pre-trained  language models in combination with gradient descent to perform few-shot learning [SS20]. Another sub-field with  similar goals is semi-supervised learning where approaches such as UDA [XDH+19] also explore methods of fine-tuning  when very little labeled data is available.  

Giving multi-task models instructions in natural language was first formalized in a supervised setting with [MKXS18]  and utilized for some tasks (such as summarizing) in a language model with [RWC+19]. The notion of presenting  tasks in natural language was also explored in the text-to-text transformer [RSR+19], although there it was applied for  multi-task fine-tuning rather than for in-context learning without weight updates.

Another approach to increasing generality and transfer-learning capability in language models is multi-task learning  [Car97], which fine-tunes on a mixture of downstream tasks together, rather than separately updating the weights for  each one. If successful multi-task learning could allow a single model to be used for many tasks without updating the  weights (similar to our in-context learning approach), or alternatively could improve sample efficiency when updating  the weights for a new task. Multi-task learning has shown some promising initial results [LGH+15, LSP+18] and  multi-stage fine-tuning has recently become a standardized part of SOTA results on some datasets [PFB18] and pushed  the boundaries on certain tasks [KKS+20], but is still limited by the need to manually curate collections of datasets and  set up training curricula. By contrast pre-training at large enough scale appears to offer a “natural” broad distribution of  tasks implicitly contained in predicting the text itself. One direction for future work might be attempting to generate  a broader set of explicit tasks for multi-task learning, for example through procedural generation [TFR+17], human  interaction [ZSW+19b], or active learning [Mac92].  

虽然我们的小样本方法的机制不同,但之前的工作也探索了使用预训练语言模型结合梯度下降进行小样本学习的方法[SS20]。另一个具有类似目标的子领域是半监督学习,其中像UDA [XDH+19]这样的方法也探索了在可用标记数据很少的情况下进行微调的方法。

使用自然语言给出多任务模型的指令首先是在一个监督设置中通过[MKXS18]形式化的,并在使用[RWC+19]的语言模型中用于一些任务(比如汇总)。在文本到文本转换器[RSR+19]中也探索了用自然语言表示任务的概念,尽管它被应用于多任务微调,而不是在没有权值更新的情况下用于上下文学习。

另一种提高语言模型通用性和转移学习能力的方法是多任务学习[Car97],它对下游任务的混合进行微调,而不是分别更新每个任务的权重。如果成功的多任务学习可以允许单一模型在不更新权值的情况下用于多个任务(类似于我们的上下文学习方法),或者可以在更新新任务权值时提高样本效率。多任务学习了一些初步的结果[LGH + 15, LSP + 18]和多级微调最近成为一个标准化的一部分SOTA结果在一些数据集[PFB18]而且突破某些任务(kk + 20),但仍需要手动牧师收藏有限的数据集和设置培训课程。相比之下,大规模的预训练似乎提供了一种“自然的”广泛分布的任务,这种任务隐含在预测文本本身中。未来工作的一个方向可能是尝试为多任务学习生成更广泛的明确任务,例如通过程序生成[TFR+17]、人机交互[ZSW+19b]或主动学习[Mac92]。

Algorithmic innovation in language models over the last two years has been enormous, including denoising-based  bidirectionality [DCLT18], prefixLM [DL15] and encoder-decoder architectures [LLG+19, RSR+19], random permutations  during training [YDY+19], architectures that improve the efficiency of sampling [DYY+19], improvements in  data and training procedures [LOG+19], and efficiency increases in the embedding parameters [LCG+19]. Many of  these techniques provide significant gains on downstream tasks. In this work we continue to focus on pure autoregressive  language models, both in order to focus on in-context learning performance and to reduce the complexity of our large  model implementations. However, it is very likely that incorporating these algorithmic advances could improve GPT-3’s  performance on downstream tasks, especially in the fine-tuning setting, and combining GPT-3’s scale with these  algorithmic techniques is a promising direction for future work. 算法语言的创新模式在过去的两年里一直巨大,包括denoising-based双向性[DCLT18], prefixLM [DL15]和encoder-decoder架构(RSR LLG + 19日+ 19),随机排列在训练(金波+ 19),架构,提高抽样效率[DYY + 19],改善数据和训练程序[日志+ 19],和效率提高嵌入参数(LCG + 19)。许多这些技术为下游任务提供了显著的收益。在这项工作中,我们继续关注纯自回归语言模型,这既是为了关注上下文内的学习性能,也是为了减少大型模型实现的复杂性。然而,结合这些算法的进步很可能会提高GPT-3在下游任务中的性能,特别是在微调设置中,结合GPT-3的规模与这些算法技术是未来工作的一个有前途的方向。

8 Conclusion 结论

We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at  tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning.  We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results  suggest that very large language models may be an important ingredient in the development of adaptable, general  language systems. 我们提出了一个1750亿参数语言模型显示强劲表现在许多NLP zero-shot任务和基准,一次性的,和few-shot设置,在某些情况下几乎匹配最先进的调整系统的性能,以及生成高质量的样品,在任务定义动态定性表现强劲。我们记录了大致可预测的性能扩展趋势,而不使用微调。我们还讨论了这类模型的社会影响。尽管有许多限制和弱点,这些结果表明,非常大的语言模型可能是开发适应性强的通用语言系统的一个重要成分。

Acknowledgements 致谢

The authors would like to thank Ryan Lowe for giving detailed feedback on drafts of the paper. Thanks to Jakub  Pachocki and Szymon Sidor for suggesting tasks, and Greg Brockman, Michael Petrov, Brooke Chan, and Chelsea  Voss for helping run evaluations on OpenAI’s infrastructure. Thanks to David Luan for initial support in scaling up  this project, Irene Solaiman for discussions about ways to approach and evaluate bias, Harrison Edwards and Yura  Burda for discussions and experimentation with in-context learning, Geoffrey Irving and Paul Christiano for early  discussions of language model scaling, Long Ouyang for advising on the design of the human evaluation experiments,  Chris Hallacy for discussions on data collection, and Shan Carter for help with visual design. Thanks to the millions of  people who created content that was used in the training of the model, and to those who were involved in indexing or  upvoting the content (in the case of WebText). Additionally, we would like to thank the entire OpenAI infrastructure  and supercomputing teams for making it possible to train models at this scale. 作者要感谢Ryan Lowe对论文草稿提供的详细反馈。感谢Jakub Pachocki和Szymon Sidor提出的任务建议,以及Greg Brockman、Michael Petrov、Brooke Chan和Chelsea Voss帮助运行OpenAI基础设施的评估。感谢大卫的菜肴最初支持扩大这个项目,艾琳Solaiman讨论的方式方法和评估偏差,哈里森·爱德华兹和Yura呢Burda与语境的讨论和实验学习,杰弗里·欧文和保罗global早期的讨论语言模型缩放、长欧阳的建议设计人类的评估实验,克里斯Hallacy讨论数据收集,和山卡特的帮助与视觉设计。感谢数百万创建内容并用于模型培训的人,感谢那些参与索引或对内容进行向上投票(在WebText的情况下)的人。此外,我们要感谢整个OpenAI基础设施和超级计算团队,因为他们使在这种规模上训练模型成为可能。