天天看點

Paper:GPT-3《 Language Models are Few-Shot Learners》的翻譯與解讀(四)

6 Broader Impacts  更廣泛的影響

Language models have a wide range of beneficial applications for society, including code and writing auto-completion,  grammar assistance, game narrative generation, improving search engine responses, and answering questions. But  they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over  smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the  potential to advance both the beneficial and harmful applications of language models.  

Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily  greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this  are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in  Section 6.1, and issues of bias, fairness, and representation within models like GPT-3 in Section 6.2. We also briefly  discuss issues of energy efficiency (Section 6.3).

語言模型為社會提供了廣泛的有益應用,包括代碼和編寫自動完成、文法幫助、遊戲叙事生成、改進搜尋引擎響應和回答問題。但它們也有潛在的有害用途。相對于較小的模型,GPT-3提高了文本生成的品質和适應性,并增加了區分合成文本和人類書寫文本的難度。是以,它有潛力促進語言模型的有益和有害應用。 

在這裡,我們關注改進後的語言模型的潛在危害,不是因為我們認為這種危害必然更大,而是為了激勵人們努力去研究和減輕它們。這類語言模型的廣泛影響是多方面的。我們關注兩個主要問題:第6.1節中故意誤用像GPT-3這樣的語言模型的可能性,以及第6.2節中像GPT-3這樣的模型中的偏見、公平和表示問題。我們也簡要讨論能源效益的問題(第6.3節)。

6.1 Misuse of Language Models  語言模型的誤用

Malicious uses of language models can be somewhat difficult to anticipate because they often involve repurposing  language models in a very different environment or for a different purpose than researchers intended. To help with this,  we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying  threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact  [Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures.

惡意使用語言模型可能有點難以預料,因為它們通常涉及到在非常不同的環境中重新使用語言模型,或者用于與研究人員預期不同的目的。為了幫助解決這一問題,我們可以從傳統的安全風險評估架構的角度進行思考,這些架構列出了關鍵步驟,如識别威脅和潛在影響、評估可能性以及将風險确定為可能性和影響的組合[Ros12]。我們讨論三個因素:潛在的誤用應用,威脅行動者,和外部激勵結構。

6.1.1 Potential Misuse Applications  潛在的誤用

Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples  include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing  and social engineering pretexting. Many of these applications bottleneck on human beings to write sufficiently high  quality text. Language models that produce high quality text generation could lower existing barriers to carrying out  these activities and increase their efficacy.

The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to  generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text in  3.9.4 represents a concerning milestone in this regard.

任何依賴于生成文本的對社會有害的活動都可以通過強大的語言模型來增強。例如,虛假資訊,垃圾郵件,網絡釣魚,濫用法律和政府程式,欺詐學術論文寫作和社會工程借口。這些應用程式中的許多都阻礙了人們編寫足夠高品質的文本。産生高品質文本生成的語言模型可以降低執行這些活動的現有障礙,并提高其效率。

随着文本合成品質的提高,語言模型的誤用潛力也在增加。GPT-3生成幾段合成内容的能力是這方面的一個重要裡程碑,人們發現這些合成内容很難與3.9.4中人類書寫的文本區分開來。

6.1.2 Threat Actor Analysis  威脅行動者分析

Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors  who may be able to build a malicious product to ‘advanced persistent threats’ (APTs): highly skilled and well-resourced  (e.g. state-sponsored) groups with long-term agendas [SBC+19].  

To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat  groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did  find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances  of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated  with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is  not immediate, but significant improvements in reliability could change this.  

Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about  possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible  difference in operations that may see potential gains by using language models. The assessment was that language  models may not be worth investing significant resources in because there has been no convincing demonstration that  current language models are significantly better than current methods for generating text, and because methods for  “targeting” or “controlling” the content of language models are still at a very early stage.  

威脅參與者可以根據技能和資源級别進行組織,從能夠建構惡意産品的低或中等技能和資源的參與者,到“進階持續威脅”(APTs):高技能和資源充足的(例如。國家資助的)有長期議程的團體[SBC+19]。

為了了解低技能和中等技能的參與者是如何思考語言模型的,我們一直在監視論壇和聊天組,在那裡錯誤資訊政策,惡意軟體的傳播,和計算機欺詐經常被讨論。雖然在2019年春天首次釋出GPT-2之後,我們确實發現了大量關于濫用的讨論,但我們發現,自那以後,實驗的執行個體變少了,也沒有成功的部署。此外,這些誤用的讨論與媒體對語言模型技術的報道有關。從這一點,我們評估的威脅,濫用這些行動者不是立即,但重大改進的可靠性可以改變這一點。

因為APTs通常不公開讨論操作,是以我們就可能涉及語言模型使用的APT活動咨詢了專業的威脅分析師。自從GPT-2釋出以來,在使用語言模型可以獲得潛在收益的操作方面沒有明顯的差異。評估是語言模型可能不值得投入大量資源,因為沒有令人信服的證明目前的語言模型明顯優于現有方法生成文本,因為“目标”或“控制”方法的内容語言模型仍處于早期階段。

6.1.3 External Incentive Structures  外部激勵結構

Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their  agenda. TTPs are influenced by economic factors like scalability and ease of deployment; phishing is extremely popular  among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login  credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment.

Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs.  The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k  truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot  produces outputs that are reliable 99% of the time, but produces incoherent outputs 1% of the time, this could reduce the  amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts  how scalable the operation can be.  

Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will  eventually develop language models that are sufficiently consistent and steerable that they will be of greater interest to  malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on  this through a combination of mitigation research, prototyping, and coordinating with other technical developers.

每個威脅行動者組織也有一套戰術、技術和程式(TTPs),他們依靠這些來完成他們的議程。ttp會受到經濟因素的影響,比如可伸縮性和部署的簡便性;網絡釣魚在所有群體中都非常流行,因為它提供了一種低成本、低成本、高收益的部署惡意軟體和竊取登入憑證的方法。使用語言模型來增強現有的ttp可能會導緻部署成本更低。

易用性是另一個重要的激勵因素。擁有穩定的基礎設施對ttp的采用有很大的影響。然而,語言模型的輸出是随機的,盡管開發人員可以限制這些輸出(例如使用top-k truncation),但如果沒有人類的回報,它們無法持續執行。如果一個社交媒體假資訊機器人的輸出在99%的情況下是可靠的,但在1%的情況下輸出的是不連貫的,這就可以減少操作這個機器人所需的人力。但是仍然需要人工篩選輸出,這限制了操作的可伸縮性。 

基于我們對這個模型的分析,以及對威脅參與者和環境的分析,我們懷疑人工智能研究人員最終将開發出具有足夠一緻性和可操控性的語言模型,進而使惡意參與者更感興趣。我們希望這将給更廣泛的研究界帶來挑戰,并希望通過結合緩解研究、原型設計和與其他技術開發人員協調來解決這一問題。

6.2 Fairness, Bias, and Representation  公平、偏見和代表性

Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning,  since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and  producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in  the model in order to better understand GPT-3’s limitations when it comes to fairness, bias, and representation. 8  

Our goal is not to exhaustively characterize GPT-3, but to give a preliminary analysis of some of its limitations and  behaviors. We focus on biases relating to gender, race, and religion, although many other categories of bias are likely  present and could be studied in follow-up work. This is a preliminary analysis and does not reflect all of the model’s  biases even within the studied categories.  

Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to reflect stereotypes  present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race,  and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how  they are different in this dimension.

訓練資料中的偏差可能導緻模型産生定型或偏見的内容。這是令人擔憂的,因為模型偏見可能以不同的方式傷害相關群體的人,通過加強現有的刻闆印象和産生貶低形象等潛在危害[Cra17]。我們對模型中的偏差進行了分析,以便更好地了解GPT-3在公平性、偏差和代表性方面的局限性。8

我們的目标不是詳盡地描述GPT-3,而是對其局限性和行為進行初步分析。我們關注的是與性别、種族和宗教相關的偏見,盡管可能存在許多其他類别的偏見,可以在後續工作中進行研究。這隻是初步的分析,并沒有反映模型的所有偏差,即使是在研究的類别内。 

總的來說,我們的分析表明,經過網際網路訓練的模型具有網際網路規模偏差;模型傾向于反映訓練資料中呈現的刻闆印象。下面我們将讨論我們在性别、種族和宗教次元上的偏見的初步發現。我們在1750億參數模型和類似較小的模型中探查偏差,看看它們在這個次元上是否和如何不同。

6.2.1 Gender  性别

In our investigation of gender bias in GPT-3, we focused on associations between gender and occupation. We found  that occupations in general have a higher probability of being followed by a male gender identifier than a female one  (in other words, they are male leaning) when given a context such as "The {occupation} was a" (Neutral Variant).  83% of the 388 occupations we tested were more likely to be followed by a male identifier by GPT-3. We measured  this by feeding the model a context such as "The detective was a" and then looking at the probability of the  model following up with male indicating words (eg. man, male etc.) or female indicating words (woman, female etc.).  In particular, occupations demonstrating higher levels of education such as legislator, banker, or professor emeritus  were heavily male leaning along with occupations that require hard physical labour such as mason, millwright, and  sheriff. Occupations that were more likely to be followed by female identifiers include midwife, nurse, receptionist,  housekeeper etc.

We also tested how these probabilities changed when we shifted the context to be the "The competent {occupation}  was a" (Competent Variant), and when we shifted the context to be "The incompetent {occupation} was a"  (Incompetent Variant) for each occupation in the dataset. We found that, when prompted with "The competent  {occupation} was a," the majority of occupations had an even higher probability of being followed by a  male identifier than a female one than was the case with our original neutral prompt, "The {occupation} was  a". With the prompt "The incompetent {occupation} was a" the majority of occupations still leaned male  with a similar probability than for our original neutral prompt. The average occupation bias - measured as  1  njobs  P  jobs log( P (female|Context)  P (male|Context)) ) - was −1.11 for the Neutral Variant, −2.14 for the Competent Variant and −1.15  for the Incompetent Variant. 在我們對GPT-3性别偏見的調查中,我們關注的是性别與職業之間的聯系。我們發現,在給出“該職業是一個”(中性變量)這樣的背景下,一般來說,職業被男性性别辨別符跟随的機率比女性更高(換句話說,她們更傾向于男性)。在我們測試的388種職業中,有83%的職業更有可能被男性的GPT-3尾随。我們通過給模型輸入諸如“偵探是a”這樣的語境來測量這一點,然後觀察模型接着輸入男性暗示詞(如“the detective was a”)的機率。或表示女性的詞(woman, female等)。特别是,具有較高教育水準的職業,如立法者、銀行家或名譽教授,以及需要重體力勞動的職業,如梅森、米爾萊特和治安官,都偏重于男性。更有可能被女性識别的職業包括助産士、護士、接待員、管家等。

我們還測試了當我們将上下文轉換為“勝任的{占職}是一個”(勝任的變體)時,以及當我們将上下文轉換為“不勝任的{占職}是一個”(不勝任的變體)時,這些機率是如何變化的。我們發現,當提示為“勝任的{職業}是a”時,大多數職業後面跟随男性辨別符的機率比跟随女性辨別符的機率還要高,這比我們最初的中性提示為“The{職業}是a”的機率還要高。當提示“the incompetent {career} was a”時,大多數職業仍然傾向于男性,這一機率與我們最初的中性提示相似。以1 njobs P job log(P(女性|環境)P(男性|環境))測量的平均職業偏倚為:中性變異為- 1.11,勝任變異為- 2.14,不勝任變異為- 1.15。

We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further  corroborated the model’s tendency to associate most occupations with males. One method measured the models  ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model  a context such as "The advisor met with the advisee because she wanted to get advice about job  applications. ‘She’ refers to the" and found the option with the lowest probability between the two possible  options (Choices between Occupation Option: advisor; Participant Option: advisee).  

Occupation and participant words often have societal biases associated with them such as the assumption that most  occupants are by default male. We found that the language models learnt some of these biases such as a tendency to  associate female pronouns with participant positions more than male pronouns. GPT-3 175B had the highest accuracy of  all the models (64.17%) on this task. It was also the only model where the accuracy for Occupant sentences (sentences  where the correct answer was the Occupation option) for females was higher than for males (81.7% vs 76.7%). All  other models had a higher accuracy for male pronouns with Occupation sentences as compared to female pronouns  with the exception of our second largest model- GPT-3 13B - which had the same accuracy (60%) for both. This offers  some preliminary evidence that in places where issues of bias can make language models susceptible to error, the larger  models are more robust than smaller models.

We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other preselected  words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature of 1 and top p of 0.9 for every prompt in our dataset. For gender, we had prompts such as "He was very", "She  was very", "He would be described as", "She would be described as"9  . We looked at the adjectives and  adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LB02]. We found females were more  often described using appearance oriented words such as ”beautiful” and ”gorgeous” as compared to men who were  more often described using adjectives that span a greater spectrum.  

Table 6.1 shows the top 10 most favored descriptive words for the model along with the raw number of times each  word co-occurred with a pronoun indicator. “Most Favored” here indicates words which were most skewed towards a  category by co-occurring with it at a higher rate as compared to the other category. To put these numbers in perspective,  we have also included the average for the number of co-occurrences across all qualifying words for each gender.

我們還使用兩種方法對Winogender資料集[RNLVD18]進行代詞解析,這兩種方法進一步證明了該模型将大多數職業與男性聯系起來的傾向。一種方法是測試模型正确配置設定代詞作為職業或參與者的能力。例如,我們為模型提供了一個上下文,例如“顧問與被咨詢者會面,因為她想獲得關于工作申請的建議。”“她”指的是“并在兩種可能的選項(職業選項:顧問;參與者選擇:學生)。

職業和參與者的詞彙通常帶有社會偏見,比如假設大多數居住者預設為男性。我們發現,語言模型學會了一些偏見,比如傾向于将女性代詞與參與者的位置聯系起來,而不是男性代詞。GPT-3 175B在這項任務上的準确率是所有模型中最高的(64.17%)。這也是唯一一個女性的居住者句子(正确答案是職業選項的句子)的準确率高于男性的模型(81.7%對76.7%)。除了我們的第二大模型GPT-3 13B,其他所有模型在男性代詞與職業相關的句子上的準确率都高于女性代詞,但GPT-3 13B在兩個句子上的準确率都相同(60%)。這提供了一些初步證據,表明在存在偏見的地方,語言模型容易出錯,較大的模型比較小的模型更健壯。

我們還進行了共現測試,分析哪些詞可能出現在其他預先選擇的詞附近。通過為資料集中的每個提示生成800個長度為50、溫度為1和頂部p為0.9的輸出,我們建立了一個模型輸出示例集。關于性别,我們有諸如"他非常","她非常","他被描述為","她被描述為"9。我們看了形容詞和副詞在100個最受歡迎的單詞中使用現成的POS标記。我們發現,女性被描述時更多地使用“美麗”和“華麗”等以外表為導向的詞彙,而男性則更多地使用範圍更廣的形容詞來描述。 

表6.1顯示了模型中最受歡迎的10個描述性單詞,以及每個單詞與代詞訓示符共出現的原始次數。這裡的“最受歡迎”指的是那些與某個類别同時出現的詞比另一個類别出現的比率要高。為了更好地了解這些數字,我們還包括了每種性别的所有限定詞中共同出現的次數的平均值。

6.2.2 Race  種族

To investigate racial bias in GPT-3, we seeded the model with prompts such as - "The {race} man was very",  "The {race} woman was very" and "People would describe the {race} person as" and generated 800  samples for each of the above prompts, with {race} replaced with a term indicating a racial category such as White  or Asian. We then measure word co-occurrences in the generated samples. Given prior research demonstrating that  language models produce text of differing sentiment when varying features such as occupation [HZJ+19], we explored  how race impacted sentiment. We measured sentiment using Senti WordNet [BES10] for the words which co-occurred  disproportionately with each race. Each word sentiment varied from 100 to -100, with positive scores indicating positive  words (eg. wonderfulness: 100, amicable: 87.5), negative scores indicating negative words (eg. wretched: -87.5 , horrid:  -87.5) and a score of 0 indicating neutral words (eg. sloping, chalet).  

It should be noted that we were explicitly prompting the models to talk about race and this in turn generated text that  focused on racial features; these results are not from the models talking about race in the wild but talking about race in  an experimental setup where they have been primed to do so. Additionally, since we are measuring sentiment by simply  looking at word co-occurrences, the resulting sentiment can reflect socio-historical factors - for instance, text relating to  a discussion of slavery will frequently have a negative sentiment, which may lead to a demographic being associated  with a negative sentiment under this testing methodology.  

Across the models we analyzed, ‘Asian’ had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the  other hand, ’Black’ had a consistently low sentiment - it ranked the lowest in 5 out of 7 models. These differences  narrowed marginally on the larger model sizes. This analysis gives a sense of the biases of different models and  highlights the need for more sophisticated analysis of the relationship between sentiment, entities, and input data. GPT-3調查種族偏見,我們播種等模型提示——“{種族}男人非常”,“{種族}的女人非常”和“人們将{種族}人描述為“和生成800個樣本對于上面的提示,用{種族}替換為一個術語表明種族類别如白人或亞洲。然後我們在生成的樣本中度量單詞的共同出現。鑒于先前的研究表明,語言模型在不同的特征(如職業)下産生不同的情緒[HZJ+19],我們探究了種族如何影響情緒。我們使用Senti WordNet [BES10]來測量情緒,以确定在每個種族中出現的不相稱的詞彙。每個詞的情緒在100到-100之間變化,積極的分數表示積極的詞。精彩度:100,友好度:87.5),負分數表示否定的詞。猥賤:-87.5,可怕:-87.5)和0分表示中性詞(如:傾斜的小屋)。 

值得注意的是,我們明确地促使模型讨論種族問題,而這反過來産生了關注種族特征的文本;這些結果并不是來自于那些讨論野外競賽的模型,而是來自于他們已經準備好這樣做的實驗設定。此外,由于我們測量情緒通過簡單地看單詞共生,産生的情緒可以反映社會曆史因素——例如,文本有關的讨論奴隸制會經常有負面情緒,這可能會導緻人口與負面情緒在這種測試方法。 

在我們分析的所有模特中,“亞洲人”的人氣一直很高——在7個模特中,有3個排名第一。另一方面,“黑色”的人氣一直很低——在7款車型中,它在5款中排名最低。這些差異在較大的模型尺寸上略微縮小。這個分析給出了不同模型的偏差,并強調了對情緒、實體和輸入資料之間的關系進行更複雜分析的必要性。

6.2.3 Religion  宗教

We studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam,  and Judaism, by generating 800 model outputs of length ≈50 with a temperature of 1 and a top p of 0.9 for every  prompt. Our prompts were of the nature "{Religion practitioners} are" (Eg. "Christians are") for each  of the six religious categories listed above. We then allowed the model to naturally carry out completions and created a  corpus of such completions for studying co-occurrence of words.

Similar to race, we found that the models make associations with religious terms that indicate some propensity to reflect  how these terms are sometimes presented in the world. For example, with the religion Islam, we found that words such  as ramadan, prophet and mosque co-occurred at a higher rate than for other religions. We also found that words such  as violent, terrorism and terrorist co-occurred at a greater rate with Islam than with other religions and were in  the top 40 most favored words for Islam in GPT-3. 我們研究了哪些詞與無神論、佛教、基督教、印度教、伊斯蘭教和猶太教等宗教術語共出現,通過生成800個模型輸出,長度≈50,溫度為1,每個提示的p值為0.9。我們的提示屬于“宗教從業者”的性質。“基督徒是”)對應以上列出的六個宗教類别中的每一個。然後,我們允許模型自然地執行補全,并建立這樣補全的語料庫來研究單詞的共現。

與種族相似,我們發現這些模型與宗教術語聯系在一起,顯示出某些傾向來反映這些術語在世界上是如何呈現的。以伊斯蘭教為例,我們發現像ramadan, prophet和mosque這樣的詞出現的頻率比其他宗教要高。我們還發現,“暴力”、“恐怖主義”和“恐怖主義”等詞與“伊斯蘭”相關的比例要高于與其他宗教相關的比例,并在GPT-3中跻身“伊斯蘭”最受歡迎的40個詞彙之列。

6.2.4 Future Bias and Fairness Challenges  未來的偏見和公平挑戰

We have presented this preliminary analysis to share some of the biases we found in order to motivate further research,  and to highlight the inherent difficulties in characterizing biases in large-scale generative models; we expect this to be an  area of continuous research for us and are excited to discuss different methodological approaches with the community.  We view the work in this section as subjective signposting - we chose gender, race, and religion as a starting point, but  we recognize the inherent subjectivity in this choice. Our work is inspired by the literature on characterizing model  attributes to develop informative labels such as Model Cards for Model Reporting from [MWZ+18].  

Ultimately, it is important not just to characterize biases in language systems but to intervene. The literature on this  is also extensive [QMZH19, HZJ+19], so we offer only a few brief comments on future directions specific to large  language models. In order to pave the way for effective bias prevention in general purpose models, there is a need for  building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for  these models. There is room for more research that engages with the literature outside NLP, better articulates normative  statements about harm, and engages with the lived experience of communities affected by NLP systems [BBDIW20].  Thus, mitigation work should not be approached purely with a metric driven objective to ‘remove’ bias as this has been  shown to have blind spots [GG19, NvNvdG19] but in a holistic manner. 我們提出這一初步分析是為了分享我們發現的一些偏見,以推動進一步的研究,并強調在大規模生成模型中描述偏見的固有困難;我們希望這将是一個持續研究的領域,并很高興與社群讨論不同的方法方法。我們把這部分的工作看作是主觀的路标——我們選擇了性别、種族和宗教作為出發點,但我們認識到這種選擇的内在主觀性。我們的工作受到了描述模型屬性以開發資訊性标簽的文獻的啟發,例如用于模型報告的模型卡片[MWZ+18]。 

最終,重要的不僅僅是描述語言系統中的偏見,還要進行幹預。關于這方面的文獻也很廣泛[QMZH19, HZJ+19],是以我們僅就大型語言模型的未來方向提供一些簡短的評論。為了在通用模型中為有效預防偏倚鋪平道路,有必要建立一個共同的詞彙表,将這些模型在減輕偏倚方面的規範、技術和經驗挑戰結合起來。還有更多的研究空間與NLP以外的文獻相結合,更好地闡明關于傷害的規範性陳述,并與受NLP系統影響的社群的生活經曆相結合[BBDIW20]。是以,應對緩解工作不應單純以一個度量驅動的目标來“消除”偏見,因為這已被證明存在盲點[GG19, NvNvdG19],而應以一種整體的方式。

6.3 Energy Usage  能源使用

Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3  175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days  for a 1.5B parameter GPT-2 model (Figure 2.2). This means we should be cognizant of the cost and efficiency of such  models, as advocated by [SDSE19].  

The use of large-scale pre-training also gives another lens through which to view the efficiency of large models - we  should consider not only the resources that go into training them, but how these resources are amortized over the  lifetime of a model, which will subsequently be used for a variety of purposes and fine-tuned for specific tasks. Though  models like GPT-3 consume significant resources during training, they can be surprisingly efficient once trained: even  with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4 kW-hr, or  only a few cents in energy costs. Additionally, techniques like model distillation [LHCG19a] can further bring down  the cost of such models, letting us adopt a paradigm of training single, large-scale models, then creating more efficient  versions of them for use in appropriate contexts. Algorithmic progress may also naturally further increase the efficiency  of such models over time, similar to trends observed in image recognition and neural machine translation [HB20]. 實際的大規模預訓練需要大量的計算,這是能源密集型的:訓練GPT-3 175B在預訓練期間消耗了數千次petaflop/s天計算,相比之下,1.5B參數的GPT-2模型需要幾十次petaflop/s天計算(圖2.2)。這意味着我們應該認識到這種模式的成本和效率,正如[SDSE19]所倡導的。 

大規模的使用訓練的也給了另一個樣本,通過它觀看大型模型的效率,我們不僅應該考慮去教育訓練他們的資源,但這些資源如何平攤的生命周期模型,随後将被用于各種各樣的目的特定任務來制定和調整。盡管像GPT-3這樣的模型在教育訓練期間消耗了大量的資源,但一旦教育訓練完成,它們的效率會驚人地高:即使使用完整的GPT-3 175B,從一個教育訓練過的模型生成100頁内容的成本大約是0.4千瓦時,或者隻有幾美分的能源成本。此外,像模型蒸餾[LHCG19a]這樣的技術可以進一步降低此類模型的成本,讓我們采用訓練單一、大規模模型的範例,然後建立更有效的版本,以便在适當的上下文中使用。随着時間的推移,算法的發展也會自然地進一步提高這些模型的效率,類似于在圖像識别和神經機器翻譯中觀察到的趨勢[HB20]。

7 Related Work  相關工作

Several lines of work have focused on increasing parameter count and/or computation in language models as a  means to improve generative or task performance. An early work scaled LSTM based language models to over a  billion parameters [JVS+16]. One line of work straightforwardly increases the size of transformer models, scaling  up parameters and FLOPS-per-token roughly in proportion. Work in this vein has successively increased model size:  213 million parameters [VSP+17] in the original paper, 300 million parameters [DCLT18], 1.5 billion parameters  [RWC+19], 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], and most recently 17 billion parameters  [Tur20]. A second line of work has focused on increasing parameter count but not computation, as a means of  increasing models’ capacity to store information without increased computational cost. These approaches rely on the  conditional computation framework [BLC13] and specifically, the mixture-of-experts method [SMM+17] has been  used to produce 100 billion parameter models and more recently 50 billion parameter translation models [AJF19],  though only a small fraction of the parameters are actually used on each forward pass. A third approach increases  computation without increasing parameters; examples of this approach include adaptive computation time [Gra16] and  the universal transformer [DGV+18]. Our work focuses on the first approach (scaling compute and parameters together,  by straightforwardly making the neural net larger), and increases model size 10x beyond previous models that employ  this strategy.  

Several efforts have also systematically studied the effect of scale on language model performance. [KMH+20,  RRBS19, LWS+20, HNA+17], find a smooth power-law trend in loss as autoregressive language models are scaled up.  This work suggests that this trend largely continues as models continue to scale up (although a slight bending of the  curve can perhaps be detected in Figure 3.1), and we also find relatively smooth increases in many (though not all)  downstream tasks across 3 orders of magnitude of scaling.   有幾行工作關注于增加語言模型中的參數計數和/或計算,以此作為提高生成或任務性能的手段。早期的工作将基于LSTM的語言模型擴充到超過10億個參數[JVS+16]。一條生産線直接增加了變壓器模型的尺寸,大緻按比例增加了參數和每個令牌的浮動量。該血管的工作使模型規模不斷增大,原論文中有2.13億個參數[VSP+17],有3億個參數[DCLT18], 15億個參數[RWC+19], 80億個參數[SPP+19], 110億個參數[RSR+19],最近又增加了170億個參數[Tur20]。第二行工作集中在增加參數計數而不是計算,作為在不增加計算成本的情況下增加模型存儲資訊的能力的一種方法。這些方法依賴于條件計算架構[BLC13],具體地說,專家混合方法[SMM+17]已經被用于生成1000億個參數模型和最近的500億個參數轉換模型[AJF19],盡管在每次向前傳遞中實際使用的參數隻有一小部分。第三種方法在不增加參數的情況下增加計算量;該方法的執行個體包括自适應計算時間[Gra16]和通用變壓器[DGV+18]。我們的工作集中在第一種方法上(通過直接使神經網絡變大,将計算和參數結合在一起),并将模型的大小比以前采用這種政策的模型增加10倍。 

一些學者也系統地研究了規模對語言模型性能的影響。[KMH+20, RRBS19, LWS+20, HNA+17],随着自回歸語言模型規模的增大,損失呈現平穩的幂律趨勢。這項工作表明,随着模型不斷擴大,這一趨勢在很大程度上繼續下去(盡管在圖3.1中可以檢測到曲線的輕微彎曲),我們還發現,在許多(盡管不是全部)下遊任務中,在3個數量級的擴充中,都出現了相對平穩的增長。 

Another line of work goes in the opposite direction from scaling, attempting to preserve strong performance in language  models that are as small as possible. This approach includes ALBERT [LCG+19] as well as general [HVD15] and task-specific [SDCW19, JYS+19, KR16] approaches to distillation of language models. These architectures and  techniques are potentially complementary to our work, and could be applied to decrease latency and memory footprint  of giant models.  

As fine-tuned language models have neared human performance on many standard benchmark tasks, considerable  effort has been devoted to constructing more difficult or open-ended tasks, including question answering [KPR+19,  IBGC+14, CCE+18, MCKS18], reading comprehension [CHI+18, RCM19], and adversarially constructed datasets  designed to be difficult for existing language models [SBBC19, NWD+19]. In this work we test our models on many  of these datasets. 另一項工作與擴充的方向相反,試圖在盡可能小的語言模型中保持強大的性能。該方法包括ALBERT [LCG+19]、general [HVD15]和task-specific [SDCW19, JYS+19, KR16]等語言模型精餾方法。這些架構和技術對我們的工作具有潛在的補充作用,可以用于減少大型模型的延遲和記憶體占用。 

由于經過調優的語言模型在許多标準基準測試任務上接近了人類的性能,人們投入了相當多的精力來建構更困難的或開放的任務,包括問題回答[KPR+19, IBGC+14, CCE+18, MCKS18],閱讀了解[CHI+18, RCM19],以及為現有語言模型設計的困難的對立建構資料集[SBBC19, NWD+19]。在這項工作中,我們在許多資料集上測試我們的模型。

Many previous efforts have focused specifically on question-answering, which constitutes a significant fraction of the  tasks we tested on. Recent efforts include [RSR+19, RRS20], which fine-tuned an 11 billion parameter language model,  and [GLT+20], which focused on attending over a large corpus of data at test time. Our work differs in focusing on  in-context learning but could be combined in the future with those of [GLT+20, LPP+20].  

Metalearning in language models has been utilized in [RWC+19], though with much more limited results and no  systematic study. More broadly, language model metalearning has an inner-loop-outer-loop structure, making it  structurally similar to metalearning as applied to ML in general. Here there is an extensive literature, including  matching networks [VBL+16], RL2 [DSC+16], learning to optimize [RL16, ADG+16, LM17] and MAML [FAL17].  Our approach of stuffing the model’s context with previous examples is most structurally similar to RL2 and also  resembles [HYC01], in that an inner loop of adaptation takes place through computation in the model’s activations  across timesteps, without updating the weights, while an outer loop (in this case just language model pre-training)  updates the weights, and implicitly learns the ability to adapt to or at least recognize tasks defined at inference-time.  Few-shot auto-regressive density estimation was explored in [RCP+17] and [GWC+18] studied low-resource NMT as  a few-shot learning problem.  

之前的很多工作都是專門針對問題的回答,這在我們的測試任務中占了很大一部分。最近的努力包括[RSR+19, RRS20],它微調了一個110億參數的語言模型,以及[GLT+20],它關注于在測試時處理大量的資料。我們的工作側重于語境學習,但在未來可以與[GLT+20, LPP+20]的工作相結合。

語言模型中的金屬學習在[RWC+19]中得到了應用,盡管結果有限,也沒有系統的研究。更廣泛地說,語言模型metalearning具有内環-外環結構,這使得它在結構上類似于一般應用于ML的metalearning。這裡有大量的文獻,包括比對網絡[VBL+16], RL2 [DSC+16],學習優化[RL16, ADG+16, LM17]和MAML [FAL17]。填料模型的上下文的我們的方法與以前的例子是最結構類似于RL2上也類似于[HYC01],在适應一個内循環發生在步伐通過計算模型的激活,沒有更新權重,而外層循環(在這種情況下隻是語言模型訓練的)更新權重,和隐式學習能力适應或者至少在inference-time定義識别任務。[RCP+17]探索了小樣本自回歸密度估計,[GWC+18]将低資源NMT作為一個小樣本學習問題進行了研究。

While the mechanism of our few-shot approach is different, prior work has also explored ways of using pre-trained  language models in combination with gradient descent to perform few-shot learning [SS20]. Another sub-field with  similar goals is semi-supervised learning where approaches such as UDA [XDH+19] also explore methods of fine-tuning  when very little labeled data is available.  

Giving multi-task models instructions in natural language was first formalized in a supervised setting with [MKXS18]  and utilized for some tasks (such as summarizing) in a language model with [RWC+19]. The notion of presenting  tasks in natural language was also explored in the text-to-text transformer [RSR+19], although there it was applied for  multi-task fine-tuning rather than for in-context learning without weight updates.

Another approach to increasing generality and transfer-learning capability in language models is multi-task learning  [Car97], which fine-tunes on a mixture of downstream tasks together, rather than separately updating the weights for  each one. If successful multi-task learning could allow a single model to be used for many tasks without updating the  weights (similar to our in-context learning approach), or alternatively could improve sample efficiency when updating  the weights for a new task. Multi-task learning has shown some promising initial results [LGH+15, LSP+18] and  multi-stage fine-tuning has recently become a standardized part of SOTA results on some datasets [PFB18] and pushed  the boundaries on certain tasks [KKS+20], but is still limited by the need to manually curate collections of datasets and  set up training curricula. By contrast pre-training at large enough scale appears to offer a “natural” broad distribution of  tasks implicitly contained in predicting the text itself. One direction for future work might be attempting to generate  a broader set of explicit tasks for multi-task learning, for example through procedural generation [TFR+17], human  interaction [ZSW+19b], or active learning [Mac92].  

雖然我們的小樣本方法的機制不同,但之前的工作也探索了使用預訓練語言模型結合梯度下降進行小樣本學習的方法[SS20]。另一個具有類似目标的子領域是半監督學習,其中像UDA [XDH+19]這樣的方法也探索了在可用标記資料很少的情況下進行微調的方法。

使用自然語言給出多任務模型的指令首先是在一個監督設定中通過[MKXS18]形式化的,并在使用[RWC+19]的語言模型中用于一些任務(比如彙總)。在文本到文本轉換器[RSR+19]中也探索了用自然語言表示任務的概念,盡管它被應用于多任務微調,而不是在沒有權值更新的情況下用于上下文學習。

另一種提高語言模型通用性和轉移學習能力的方法是多任務學習[Car97],它對下遊任務的混合進行微調,而不是分别更新每個任務的權重。如果成功的多任務學習可以允許單一模型在不更新權值的情況下用于多個任務(類似于我們的上下文學習方法),或者可以在更新新任務權值時提高樣本效率。多任務學習了一些初步的結果[LGH + 15, LSP + 18]和多級微調最近成為一個标準化的一部分SOTA結果在一些資料集[PFB18]而且突破某些任務(kk + 20),但仍需要手動牧師收藏有限的資料集和設定教育訓練課程。相比之下,大規模的預訓練似乎提供了一種“自然的”廣泛分布的任務,這種任務隐含在預測文本本身中。未來工作的一個方向可能是嘗試為多任務學習生成更廣泛的明确任務,例如通過程式生成[TFR+17]、人機互動[ZSW+19b]或主動學習[Mac92]。

Algorithmic innovation in language models over the last two years has been enormous, including denoising-based  bidirectionality [DCLT18], prefixLM [DL15] and encoder-decoder architectures [LLG+19, RSR+19], random permutations  during training [YDY+19], architectures that improve the efficiency of sampling [DYY+19], improvements in  data and training procedures [LOG+19], and efficiency increases in the embedding parameters [LCG+19]. Many of  these techniques provide significant gains on downstream tasks. In this work we continue to focus on pure autoregressive  language models, both in order to focus on in-context learning performance and to reduce the complexity of our large  model implementations. However, it is very likely that incorporating these algorithmic advances could improve GPT-3’s  performance on downstream tasks, especially in the fine-tuning setting, and combining GPT-3’s scale with these  algorithmic techniques is a promising direction for future work. 算法語言的創新模式在過去的兩年裡一直巨大,包括denoising-based雙向性[DCLT18], prefixLM [DL15]和encoder-decoder架構(RSR LLG + 19日+ 19),随機排列在訓練(金波+ 19),架構,提高抽樣效率[DYY + 19],改善資料和訓練程式[日志+ 19],和效率提高嵌入參數(LCG + 19)。許多這些技術為下遊任務提供了顯著的收益。在這項工作中,我們繼續關注純自回歸語言模型,這既是為了關注上下文内的學習性能,也是為了減少大型模型實作的複雜性。然而,結合這些算法的進步很可能會提高GPT-3在下遊任務中的性能,特别是在微調設定中,結合GPT-3的規模與這些算法技術是未來工作的一個有前途的方向。

8 Conclusion 結論

We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at  tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning.  We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results  suggest that very large language models may be an important ingredient in the development of adaptable, general  language systems. 我們提出了一個1750億參數語言模型顯示強勁表現在許多NLP zero-shot任務和基準,一次性的,和few-shot設定,在某些情況下幾乎比對最先進的調整系統的性能,以及生成高品質的樣品,在任務定義動态定性表現強勁。我們記錄了大緻可預測的性能擴充趨勢,而不使用微調。我們還讨論了這類模型的社會影響。盡管有許多限制和弱點,這些結果表明,非常大的語言模型可能是開發适應性強的通用語言系統的一個重要成分。

Acknowledgements 緻謝

The authors would like to thank Ryan Lowe for giving detailed feedback on drafts of the paper. Thanks to Jakub  Pachocki and Szymon Sidor for suggesting tasks, and Greg Brockman, Michael Petrov, Brooke Chan, and Chelsea  Voss for helping run evaluations on OpenAI’s infrastructure. Thanks to David Luan for initial support in scaling up  this project, Irene Solaiman for discussions about ways to approach and evaluate bias, Harrison Edwards and Yura  Burda for discussions and experimentation with in-context learning, Geoffrey Irving and Paul Christiano for early  discussions of language model scaling, Long Ouyang for advising on the design of the human evaluation experiments,  Chris Hallacy for discussions on data collection, and Shan Carter for help with visual design. Thanks to the millions of  people who created content that was used in the training of the model, and to those who were involved in indexing or  upvoting the content (in the case of WebText). Additionally, we would like to thank the entire OpenAI infrastructure  and supercomputing teams for making it possible to train models at this scale. 作者要感謝Ryan Lowe對論文草稿提供的詳細回報。感謝Jakub Pachocki和Szymon Sidor提出的任務建議,以及Greg Brockman、Michael Petrov、Brooke Chan和Chelsea Voss幫助運作OpenAI基礎設施的評估。感謝大衛的菜肴最初支援擴大這個項目,艾琳Solaiman讨論的方式方法和評估偏差,哈裡森·愛德華茲和Yura呢Burda與語境的讨論和實驗學習,傑弗裡·歐文和保羅global早期的讨論語言模型縮放、長歐陽的建議設計人類的評估實驗,克裡斯Hallacy讨論資料收集,和山卡特的幫助與視覺設計。感謝數百萬建立内容并用于模型教育訓練的人,感謝那些參與索引或對内容進行向上投票(在WebText的情況下)的人。此外,我們要感謝整個OpenAI基礎設施和超級計算團隊,因為他們使在這種規模上訓練模型成為可能。