天天看點

基于 BERT 掩碼語言的指代消解--論文筆記BERT Masked Language Modeling for Co-reference Resolution基于 BERT 掩碼語言的指代消解

BERT Masked Language Modeling for Co-reference Resolution

基于 BERT 掩碼語言的指代消解

[原文位址](https://www.aclweb.org/anthology/W19-3811.pdf)

Abstract

This paper explains the TALP-UPC participation for the Gendered Pronoun Resolution shared-task of the 1st ACL Workshop on Gender Bias for Natural Language Processing. We have implemented two models for mask language modeling using pre-trained BERT adjusted to work for a classification problem. The proposed solutions are based on the word probabilities of the original BERT model, but using common English names to replace the original test names.

本文解釋了 TALP-UPC 參與第一屆自然語言處理性别偏見研讨會的性别代詞解決分擔任務。我們實作了兩個模型的掩模語言模組化使用預訓練 BERT 調整工作的一個分類問題。提出的解決方案是基于原始 BERT 模型的字詞機率,但使用公共英文名稱來代替原始測試名稱。

1 Introduction

The Gendered Pronoun Resolution task is a natural language processing task whose objective is to build pronoun resolution systems that identify the correct name a pronoun refers to. It’s called a co-reference resolution task. Co-reference resolution tackles the problem of different elements of a text that refer to the same thing. Like for example a pronoun and a noun, or multiple nouns that describe the same entity. There are multiple deep learning approaches to this problem. NeuralCoref 1 presents one based on giving every pair of mentions (pronoun + noun) a score to represent whether or not they refer to the same entity. In our current task, this approach is not possible, because we don’t have the true information of every pair of mentions, only the two names per entry.

性别代詞解析任務是一項自然語言處理任務,其目标是建立代詞解析系統,識别代詞所指的正确名稱。這被稱為共同引用解析任務(指代消解)。指代消解解決了案文中涉及同一事物的不同要素的問題。例如,一個代詞和一個名詞,或者描述同一實體的多個名詞。有多種深度學習的方法來解決這個問題。NeuralCoref 1提出了一個基于給每對提及(代名詞 + 名詞)一個分數來表示它們是否指向同一個實體。在我們目前的任務中,這種方法是不可能的,因為我們沒有每對提及的真實資訊,每個條目隻有兩個名字。

The current task also has to deal with the problem of gender. As the GAP researchers point out (Webster et al., 2018), the biggest and most common datasets for co-reference resolution have a bias towards male entities. For example the OntoNotes dataset, which is used for some of the most popular models, only has a 25% female representation (Pradhan and Xue, 2009). This creates a problem, because any machine learning model is only as good as its training set. Biased training sets will create biased models, and this will have repercussions on any uses the model may have.

目前的任務還包括處理性别問題。正如 GAP 的研究人員指出的(Webster 等人,2018) ,最大和最常見的共同參考解析資料集偏向于男性實體。例如,OntoNotes 資料集,用于一些最流行的模型,隻有25% 的女性代表(Pradhan 和 Xue,2009)。這就産生了一個問題,因為任何機器學習模型都隻取決于它的訓練集。有偏見的訓練集将會建立有偏見的模型,這将會對模型的任何使用産生影響。

This task provides an interesting challenge specially by the fact that it is proposed over a gender neutral dataset. In this sense, the challenge is oriented towards proposing methods that are genderneutral and to not provide bias given that the data set does not have it.

這項任務提供了一個有趣的挑戰,特别是因為它是在一個中性資料集上提出的。從這個意義上來說,挑戰在于提出不帶性别色彩的方法,并且在資料集沒有資料的情況下不提供偏見。

To face this task, we propose to make use of the recent popular BERT tool (Devlin et al., 2018). BERT is a model trained for masked language modeling (LM) word prediction and sentence prediction using the transformer network (Vaswani et al., 2017). BERT also provides a group of pretrained models for different uses, of different languages and sizes. There are implementations for it in all sorts of tasks, including text classification, question answering, multiple choice question answering, sentence tagging, among others. BERT is gaining popularity quickly in language tasks, but before this shared-task appeared, we had no awareness of its implementation in co-reference resolution. For this task, we’ve used an implementation that takes advantage of the masked LM which BERT is trained for and uses it for a kind of task BERT is not specifically designed for.

為了面對這個任務,我們建議使用最近流行的 BERT 工具(Devlin 等人,2018)。BERT 是一個使用變壓器網絡(Vaswani et al. ,2017)訓練用于掩蔽語言模組化(LM)的單詞預測和句子預測的模型。BERT 還為不同的用途、不同的語言和大小提供了一組經過預先訓練的模型。它在各種任務中都有實作,包括文本分類、問題回答、多項選擇問題回答、句子标記等。BERT 在語言任務中迅速流行起來,但是在這個共享任務出現之前,我們并沒有意識到它在共指解析中的實作。對于這個任務,我們使用了一個實作,它利用了 BERT 被訓練用于的蒙面 LM,并将其用于一種 BERT 沒有專門設計用于的任務。

In this paper, we are detailing our shared-task participation, which basically includes descriptions on the use we gave to the BERT model and on our technique of ’Name Replacement’ that allowed to reduce the impact of name frequency.

在本文中,我們詳細介紹了我們的共享任務參與,其中基本上包括我們給 BERT 模型的使用描述和我們的“名稱替換”技術,以減少名稱頻率的影響。

2 Co-reference Resolution System Description

2.1 BERT for Masked LM

This model’s main objective is to predict a word that has been masked in a sentence. For this exercise that word is the pronoun whose referent we’re trying to identify. This one pronoun gets replaced by the [MASKED] tag, the rest of the sentence is subjected to the different name change rules described in section 2.2.

這個模型的主要目的是預測一個被句子掩蓋的單詞。在這個練習中,這個詞是我們試圖識别的指代詞。這個代詞被[ MASKED ]标簽替換,句子的其餘部分受制于第2.2節中描述的不同的名稱更改規則。

The text is passed through the pre-trained BERT model. This model keeps all of its weights intact, the only changes made in training are to the network outside of the BERT model. The resulting sequence then passes through what is called the masked language modeling head. This consists of a small neural network that returns, for every word in the sequence, an array the size of the entire vocabulary with the probability for every word. The array for our masked pronoun is extracted and then from that array, we get the probabilities of three different words. These three words are : the first replaced name (name 1), the second replaced name (name 2) and the word none for the case of having none.

文本通過預先訓練的 BERT 模型傳遞。這個模型保持了所有的權重不變,訓練中唯一的變化是在 BERT 模型之外的網絡。結果序列然後通過所謂的屏蔽語言模組化頭。它由一個小型神經網絡組成,對于序列中的每個單詞,它都傳回一個整個詞彙表大小的數組,以及每個單詞的機率。我們提取了掩蔽代詞的數組,然後從這個數組中,我們得到了三個不同詞的機率。這三個單詞是: 第一個被替換的名字(名字1) ,第二個被替換的名字(名字2)和沒有的情況下的單詞無。

This third case is the strangest one, because the word none would logically not appear in the sentence. Tests were made with the original pronoun as the third option instead. But the results ended up being very similar albeit slightly worse, so the word none was kept instead. These cases where there is no true answer are the hardest ones for both of the models.

第三種情況是最奇怪的一種,因為從邏輯上講,“無”這個詞不會出現在句子中。取而代之的是以原來的代詞作為第三個選項的測試。但結果卻非常相似,隻是稍微差一點,是以沒有保留這個詞。對于這兩個模型來說,這些沒有真正答案的情況是最困難的。

We experimented with two models.

我們試驗了兩個模型。

Model 1 After the probabilities for each word are extracted, the rest is treated as a classification problem. An array is created with the probabilities of the 2 names and none ([name 1, name 2, none]), where each one represents the probability of a class in multi-class classification. This array is passed through a softmax function to adjust it to probabilities between 0 and 1 and then the log loss is calculated. A block diagram of this model can be seen in figure 1.

模型1 在提取每個單詞的機率之後,剩下的單詞作為分類問題處理。建立一個數組,其中兩個名稱和 none ([ name 1,name 2,none ])的機率為,其中每個名稱代表多類分類中類的機率。這個數組通過一個柔性最大激活函數數組來調整0到1之間的機率,然後計算對數損失。該模型的框圖如圖1所示。

Model 2 This model repeats the steps of model 1 but for two different texts. These texts are mostly the same except the replacement names name 1 and name 2 have been switched (as explained in the section 2.2). It calculates the probabilities for each word for each text and then takes an average of both. Then finally applies the softmax and calculates the loss with the average probability of each class across both texts. A block diagram of this model can be seen in figure 2.

模型2 這個模型重複了模型1的步驟,但是針對兩個不同的文本。這些文本大部分是相同的,除了替換名稱1和名稱2已經改變(如第2.2節所解釋的)。它為每個文本計算每個單詞的機率,然後取兩者的平均值。最後應用最大軟體模型,計算每個類在兩個文本中的平均機率損失。該模型的框圖如圖2所示。

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-dKpIlcah-1617610988025)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210405144144904.png)]

2.2 Name Replacement

The task contains names of individuals who are featured in Wikipedia, and some of these names are uncommon in the English language. As part of the pre-processing for both models, these names are replaced. They are replaced with common English names in their respective genders2 . If the pronoun is female, one of two common English female names are chosen, same thing for the male pronouns. In order to replace them in the text, the following set of rules are followed.

這個任務包含了維基百科上的個人名字,其中一些名字在英語中并不常見。作為兩個模型預處理的一部分,這些名稱将被替換。它們在各自的性别中被普通的英文名字所取代。如果代詞是女性,則選擇兩個常見的英語女性名字中的一個,與男性代詞相同。為了在文本中替換它們,遵循以下一組規則。

  1. The names mentioned on the A and B columns are replaced.

    A 和 B 欄中提到的名稱将被替換。

  2. Any other instances of the full name as it appears on the A/B columns are replaced.

    在 A/B 列上出現的全名的任何其他執行個體都将被替換。

  3. If the name on the A/B column contains a first name and a last name. Instances of the first name are also replaced. Unless both entities share a first name, or the first name of one is contained within the other.

    如果 A/B 列上的名稱包含名和姓。名字的執行個體也被替換。除非兩個實體共享一個名字,或者一個實體的名字包含在另一個實體中。

  4. Both the name and the text are converted to lowercase

    名稱和文本都轉換為小寫

This name replacement has two major benefits. First, the more common male and female names work better with BERT because they appear more in the corpus in which it is trained on. Secondly, when the word piece encoding splits certain words the tokenizer can be configured so that our chosen names are never split. So they are single tokens (and not multiple word pieces), which helps the way the model is implemented.

這個名字的替換有兩個主要的好處。首先,更常見的男性和女性的名字更好地工作與 BERT,因為他們出現在更多的語料庫,它是訓練。其次,當單詞片段編碼拆分某些單詞時,可以配置标記器,這樣我們選擇的名稱就不會拆分。是以它們是單個标記(而不是多個單詞片段) ,這有助于模型的實作方式。

Both models (1 and 2 presented in the above section) use BERT for Masked LM prediction where the mask always covers a pronoun, and because the pronoun is a single token (not split into word pieces), it’s more useful to compare the masked pronoun to both names, which are also both single tokens (not multiple word pieces).

兩個模型(在上面的部分中提供了1和2)都使用 BERT 進行掩碼 LM 預測,掩碼總是覆寫代詞,而且因為代詞是單個标記(沒有拆分為詞片) ,是以比較掩碼代詞和兩個名字更有用,這兩個名字也都是單個标記(不是多個詞片)。

Because the chosen names are very common in the English language, BERT’s previous training might contain biases towards one name or the other. This can be detrimental to this model where it has to compare between only 3 options. So the alternative is the approach in model number 2. In model 2 two texts are created. Both texts are basically the same except the names chosen as the replacement names 1 and 2 are switched. So, as figure 3 shows, we get one text with each name in each position

因為所選擇的名字在英語中非常普遍,BERT 之前的訓練可能包含對這個名字或那個名字的偏見。這可能不利于這種模式,因為它必須比較隻有3個選項。是以另一個選擇是模型2中的方法。在模型2中,建立了兩個文本。這兩個文本基本上是相同的,除了名稱選擇作為替代名稱1和2互換。是以,如圖3所示,我們在每個位置得到一個帶有每個名稱的文本

For example lets say we get the text:

”In the late 1980s Jones began working with Duran Duran on their live shows and then in the studio producing a B side single “This Is How A Road Gets Made”, before being hired to record the album Liberty with producer Chris Kimsey.”,

20世紀80年代末,瓊斯開始與杜蘭 · 杜蘭合作他們的現場演出,然後在錄音室制作 b 面單曲《 This Is How a Road Gets Made 》 ,之後他被聘請與制作人克裡斯 · 金姆西合作錄制專輯《自由》,

A is Jones and B is *Chris Kimsey.*For the name replacement lets say we choose two common English names like John and Harry. The new text produced for model 1 (figure 1) would be something like:

A 是瓊斯,B 是克裡斯 · 金賽。 對于替換名字,我們選擇兩個常見的英文名字,如約翰和哈裡。為模型1産生的新文本(圖1)大緻如下:

”in the late 1980s harry began working with duran duran on their live shows and then in the studio producing a b side single “this is how a road gets made”, before being hired to record the album liberty with producer john.”

And for model 2 (figure 2) the same text would be used for the top side and for the bottom side it would have the harry and john in the opposite positions.

對于模型2(圖2) ,同樣的文字将用于頂部和底部,它将哈裡和約翰在相反的位置。

3 Experimental Framework

3.1 Task details

The objective of the task is that of a classification problem. Where the output for every entry is the probability of the pronoun referencing name A, name B or Neither.

這項任務的目标是一個分類問題。其中每個條目的輸出是代詞引用名稱 A、名稱 B 或都不是的機率。

3.2 Data

The GAP dataset (Webster et al., 2018) created by Google AI Language was the dataset used for this task. This dataset consists of 8908 co-reference labeled pairs sampled from Wikipedia, also it’s split perfectly between male and female representation. Each entry of the dataset consists of a short text, a pronoun that is present in the text and its offset and two different names (name A and name B) also present in the text. The pronoun refers to one of these two names and in some cases, none of them. The GAP dataset doesn’t contain any neutral pronouns such as it or they.

由谷歌人工智能語言建立的 GAP 資料集(Webster et al. ,2018)就是用于這項任務的資料集。這個資料集包括8908個從維基百科上抽取的标記對,也完美地分割為男性和女性的代表。資料集的每個條目由一個短文本、文本中出現的代詞及其偏移量和文本中出現的兩個不同名稱(名稱 A 和名稱 B)組成。這個代詞指的是這兩個名字中的一個,在某些情況下,沒有一個是指這兩個名字中的一個。GAP 資料集不包含任何中性代詞,比如 it 或 they。

For the two different stages of the competition different datasets were used.

對于兩個不同階段的比賽,使用了不同的資料集。

  • For Stage 1 the data used for the submission is the same as the development set available in the GAP repository. The dataset used for training is the combination of the GAP validation and GAP testing sets from the repository.

    對于第1階段,用于送出的資料與 GAP 存儲庫中的開發集相同。用于訓練的資料集是來自存儲庫的 GAP 驗證和 GAP 測試集的組合。

  • For Stage 2 the data used for submission was only available through Kaggle3 and the correct labels have yet to be released, so we can only analyze the final log loss of each of the models. This testing set has a total of 12359 rows, with 6499 male pronouns and 5860 female ones. For training, a combination of the GAP development, testing and validation sets was used. And, as all the GAP data, it is evenly distributed between genders.

    對于第二階段,用于送出的資料隻能通過 Kaggle3獲得,正确的标簽還沒有釋出,是以我們隻能分析每個模型的最終日志損失。該測試集共有12359行,其中男性代詞6499個,女性代詞5860個。對于教育訓練,使用了 GAP 開發、測試和驗證集的組合。而且,正如所有的 GAP 資料一樣,性别之間的分布是均勻的。

The distributions of all the datasets are shown in table 1. It can be seen that in all cases, the None option has the least support by a large margin. This, added to the fact that the model naturally is better suited to identifying names rather than the absence of them, had a negative effect on the results.

所有資料集的分布情況如表1所示。可以看出,在所有情況下,None 選項的支援度最低,差距很大。再加上模型更适合于識别名稱,而不是缺少名稱,這對結果産生了負面影響。

3.3 Training details

For the BERT pre-trained weights, several models were tested. BERT base is the one that produced the best results. BERT large had great results in a lot of other implementations, but in this model it produced worse results while consuming much more resources and having a longer training time. During the experiments the model had an overfitting problem, so the learning rate was tuned as well as a warm up percentage was introduced. As table 2 shows, the optimal learning rate was 3e − 5 while the optimal with a 20% warm up. The length of the sequences is set at 256, where it fits almost every text without issues. For texts too big, the text is truncated depending on the offsets of each of the elements in order to not eliminate any of the names or the pronoun.

對于 BERT 預先訓練的權重,測試了幾個模型。BERT 基函數是産生最好結果的基函數。BERT large 在許多其他實作中都有很好的結果,但是在這個模型中,它産生了更糟糕的結果,同時消耗了更多的資源,并且有更長的教育訓練時間。在實驗過程中,該模型存在過拟合問題,是以對學習率進行了調整,并引入了熱身百分比。如表2所示,最佳學習率為3e-5,而最佳學習率為20% 。序列的長度被設定為256,這樣它就可以适合幾乎所有沒有問題的文本。對于篇幅過大的文本,文本會根據每個元素的偏移量進行截斷,以避免删除任何名稱或代詞。

The training was performed in a server with an Intel Dual Core processor and Nvidia Titan X GPUs, with approximately 32GB of memory. The run time varies a lot depending on the model. The average run time on the stage 1 dataset for model 1 is from 1 to 2 hours while for model 2 it has a run time of about 4 hours. For the training set for stage 2, the duration was 4 hours 37 minutes for model 1 and 8 hours 42 minutes for model 2. The final list of hyperparameters is in table 3.

教育訓練是在一台配備英特爾雙核處理器和 Nvidia Titan x gpu 的伺服器上進行的,該伺服器擁有大約32gb 的記憶體。運作時間根據模型變化很大。模型1的階段1資料集的平均運作時間為1至2小時,而模型2的運作時間約為4小時。對于第二階段的訓練,模型1的持續時間為4小時37分鐘,模型2的持續時間為8小時42分鐘。超參數的最終清單見表3。

4 Results

Tables 4 and 5 report results for models 1 and 2 reported in section 2.1 for stage 1 of the competition. Both models 1 and 2 have similar overall results. Also both models show problems with the None class, model 2 specially. We believe this is because our model is based on guessing the correct name, so the guessing of none is not as well suited to it. Also, the training set contains much less of these examples, therefore making it even harder to train for them.

表4和表5報告了第一階段比賽第2.1節中報告的模型1和模型2的結果。模型1和模型2的總體結果相似。此外,兩個模型都顯示了 None 類的問題,特别是模型2。我們認為這是因為我們的模型是基于猜測正确的名稱,是以猜測無是不适合它。此外,訓練集包含的這些例子要少得多,是以更難為它們進行訓練。/n/n

4.1 Advantages of the Masked LM Model

As well as the Masked LM, other BERT implementations were experimented with for the task. First, a text multi class classification model (figure 4) where the [CLS] tag is placed at the beginning of every sentence, the text is passed through a pretrained BERT and then the result from this label is passed through a feed forward neural network.

除了蒙面 LM 之外,其他的 BERT 實作也對這個任務進行了實驗。首先,一個文本多類分類模型(圖4) ,其中[ CLS ]标簽放在每個句子的開頭,文本通過一個預先訓練的 BERT 傳遞,然後該标簽的結果通過一個前饋神經網絡傳遞。

And a multiple choice question answering model (figure 5), where the same text with the [CLS] label is passed through BERT with different answers and then the result these labels is passed through a feed forward neural network.

多項選擇題回答模型(圖5) ,同樣帶有[ CLS ]标簽的文本通過帶有不同答案的 BERT 傳遞,然後這些标簽通過前饋神經網絡傳遞。

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-37xaQggu-1617610988037)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210405155842486.png)]

These two models, which were specifically designed for other tasks had similar accuracy to the masked LM but suffered greatly with the log loss, which was the competition’s metric. This is because in a lot of examples the difference between the probabilities of one class and another was minimal. This made for a model where each choice had low confidence and therefore the loss increased considerably.

這兩個模型,這是專門為其他任務設計的,具有類似的準确性蒙面長征,但遭受了巨大的損失,這是競争的名額。這是因為在許多例子中,一個類和另一個類的機率之差是最小的。這使得模型中的每個選擇都信心不足,是以損失大大增加。

4.2 Name Replacement Results

As table 2.2 shows, name replacement considerably improved the model’s results. This is in part because the names chosen as replacements are more common in BERT’s training corpora. Also, a 43% of the names across the whole GAP dataset are made up of multiple words. So replacing these with a single name makes it easier for the model to identify their place in the text.

如表2.2所示,名稱替換大大改善了模型的結果。這在一定程度上是因為在 BERT 的教育訓練語料庫中,替代人選的名稱更為常見。另外,整個 GAP 資料集中43% 的名字是由多個單詞組成的。是以,用單一名稱替換這些名稱可以使模型更容易地确定它們在文本中的位置。

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-AhYHRUvN-1617610988039)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210405161557444.png)]

4.3 Competition results

In the official competition on Kaggle we placed 46th, with the second model having a loss around 0.301. As the results in table 8 show, the results of stage 2 were better than those of stage 1. And the second model, which had performed worse on the first stage was better in stage 2.

在 Kaggle 的官方比賽中,我們排在第46位,第二種模式的損失在0.301左右。如表8所示,第2階段的結果優于第1階段。第二種模式,在第一階段表現較差,在第二階段表現較好。

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-bAwejUR9-1617610988042)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210405161651063.png)]

5 Conclusions

We have proved that pre-trained BERT is useful for co-reference resolution. Additionally, we have shown that our simple ’Name Replacement’ technique was effective to reduce the impact of name frequency or popularity in the final decision.

我們證明了預訓練誤差檢測器對于共指分辨是有用的。此外,我們已經證明,我們的簡單的“名字替換”技術是有效的,以減少影響的名字頻率或流行在最終決定。

The main limitation of our technique is that it requires knowing the gender from the names and so it only makes sense for entities which have a defined gender. Our proposed model had great results when predicting the correct name but had trouble with with the none option.

我們的技術的主要局限性是,它需要從名字中知道性别,是以它隻對有明确性别的實體有意義。我們提出的模型在預測正确的名稱時有很好的結果,但是在無選項時有麻煩。

As a future improvement it’s important to analyze the characteristics of these examples where none of the names are correct and how the model could be trained better to identify them, specially because they are fewer in the dataset. Further improvements could be made in terms of fine-tuning the weights in the actual BERT model.

作為未來的改進,重要的是分析這些例子的特征,在這些例子中沒有一個名字是正确的,以及如何更好地訓練模型來識别它們,特别是因為它們在資料集中較少。進一步的改進可以在實際 BERT 模型的權重微調方面進行。

Acknowledgements

This work is also supported in part by the Spanish Ministerio de Econom´ıa y Competitividad, the European Regional Development Fund and the Agencia Estatal de Investigacion, through the ´ postdoctoral senior grant Ramon y Cajal, con- ´ tract TEC2015-69266-P (MINECO/FEDER,EU) and contract PCIN-2017-079 (AEI/MINECO).

繼續閱讀