極簡使用︱Gensim-FastText 詞向量訓練以及OOV（out-of-word）問題有效解決

glove/word2vec/fasttext目前詞向量比較通用的三種方式，之前三款詞向量的原始訓練過程還是挺繁瑣的，這邊筆者列舉一下再自己使用過程中快速訓練的方式。

其中，word2vec可見：python︱gensim訓練word2vec及相關函數與功能了解

glove可見：極簡使用︱Glove-python詞向量訓練與使用

github:

mattzheng/gensim-fast2vec

因為是在gensim之中的，需要安裝fasttext，可見：

https://github.com/facebookresearch/fastText/tree/master/python

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

文章目錄

- 2 、fasttext訓練
- 2.1 訓練主函數
- 2.2 模型的儲存與加載
- 2.3 線上更新語料庫
- 2.4 c++ 版本的fasttext訓練
- 3 fasttext使用
- - 3.1 獲得詞向量
  - 3.2 詞向量詞典
  - 3.3 與word2vec 相同的求相似性
  - 3.4 求詞附近的相似詞
  - 3.5 fasttext自帶的OOV功能
  - 3.5 如何獲得fasttext的n-grams詞向量
- 4 fasttext 與 word2vec的對比
- 參考資源

2 、fasttext訓練

2.1 訓練主函數

from gensim.models import FastText
sentences = [["你", "是", "誰"], ["我", "是", "中國人"]]

model = FastText(sentences,  size=4, window=3, min_count=1, iter=10,min_n = 3 , max_n = 6,word_ngrams = 0)
model['你']  # 詞向量獲得的方式
model.wv['你'] # 詞向量獲得的方式

其中FastText主函數為：

class gensim.models.fasttext.FastText(sentences=None, corpus_file=None, sg=0, hs=0, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, word_ngrams=1, sample=0.001, seed=1, workers=3, min_alpha=0.0001, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=10000, callbacks=())

幾個參數的含義為：

正常參數：
- model: Training architecture. Allowed values: cbow , skipgram (Default cbow )
- size: Size of embeddings to be learnt (Default 100)
- alpha: Initial learning rate (Default 0.025)
- window: Context window size (Default 5)
- min_count: Ignore words with number of occurrences below this (Default 5)
- loss: Training objective. Allowed values: ns , hs , softmax (Default ns )
- sample: Threshold for downsampling higher-frequency words (Default 0.001)
- negative: Number of negative words to sample, for ns (Default 5)
- iter: Number of epochs (Default 5)
- sorted_vocab: Sort vocab by descending frequency (Default 1)
- threads: Number of threads to use (Default 12)
fasttext附加參數
- min_n: min length of char ngrams (Default 3)
- max_n: max length of char ngrams (Default 6)
- bucket: number of buckets used for hashing ngrams (Default 2000000)
額外參數：
- word_ngrams ({1,0}, optional)
  - If 1, uses enriches word vectors with subword(n-grams) information. If 0, this is equivalent to Word2Vec.

2.2 模型的儲存與加載

# 模型儲存與加載
model.save(fname)
model = FastText.load(fname)

那麼既然gensim之中的fasttext,那麼也有這麼一種方式：

fasttext_model.wv.save_word2vec_format('temp/test_fasttext.txt', binary=False)
fasttext_model.wv.save_word2vec_format('temp/test_fasttext.bin', binary=True)

就是将fasttext地詞向量儲存為word2vec的格式，供調用：

5 4
是 -0.119938 0.042054504 -0.02282253 -0.10101332
中國人 0.080497965 0.103521846 -0.13045108 -0.01050107
你 -0.0788643 -0.082788676 -0.14035964 0.09101376
我 -0.14597991 0.035916027 -0.120259814 -0.06904249
誰 -0.0021443982 -0.0736454 -0.067576885 -0.025535036

但是，想回來了，如果fasttext儲存為word2vec格式，那麼fasttest能不能重新load進來？

筆者也不清楚，但是筆者沒有看到在fasttext或

gensim.models.keyedvectors.FastTextKeyedVectors

，看到load_word2vec_format的函數，是以隻能單向輸出：

fasttext -> word2vec

如果用

FastText.load(fname)

會報錯：

UnpicklingError: invalid load key, '5'.

2.3 線上更新語料庫

# 線上更新訓練 fasttext
from gensim.models import FastText
sentences_1 = [["cat", "say", "meow"], ["dog", "say", "woof"]]
sentences_2 = [["dude", "say", "wazzup!"]]

model = FastText(min_count=1)
model.build_vocab(sentences_1)
model.train(sentences_1, total_examples=model.corpus_count, epochs=model.iter)

model.build_vocab(sentences_2, update=True)
model.train(sentences_2, total_examples=model.corpus_count, epochs=model.iter)

通過

build_vocab

來實作

2.4 c++ 版本的fasttext訓練

# 使用c++ 版本的fasttext
from gensim.models.wrappers.fasttext import FastText as FT_wrapper

# Set FastText home to the path to the FastText executable
ft_home = '/home/chinmaya/GSOC/Gensim/fastText/fasttext'

# train the model
model_wrapper = FT_wrapper.train(ft_home, lee_train_file)

print(model_wrapper)

3 fasttext使用

3.1 獲得詞向量

model['你']  # 詞向量獲得的方式
model.wv['你'] # 詞向量獲得的方式
model.wv.word_vec('你')# 詞向量獲得的方式

兩種方式獲得單個單詞的詞向量

還有幾種方式：

sentences = [["你", "是", "誰"], ["我", "是", "中國人"]]
fasttext_model = FastText(sentences,  size=4, window=3, min_count=1, iter=10,min_n = 3 , max_n = 6,word_ngrams = 0)

fasttext_model.wv.syn0_vocab  # 單詞的向量組 (5, 4)
fasttext_model.wv.vectors_vocab# 單詞的向量組 (5, 4)  vectors_vocab == syn0_vocab != vectors
fasttext_model.wv.vectors# 單詞的向量組 (5, 4)
fasttext_model.wv.vectors_ngrams#基于單詞的n-ngram的向量組 (10, 4)
fasttext_model.wv.syn0_ngrams   # 基于單詞的n-ngram的向量組 (10, 4)
fasttext_model.wv.num_ngram_vectors # n-ngram數量
fasttext_model.wv.min_n   # 最小n-gram

vectors_ngrams與syn0_ngrams 一緻，都是n-grams的詞向量矩陣。筆者也不清楚這麼多矩陣具體是指啥。。。

其中

fasttext_model.wv.syn0_ngrams

隻能傳回矩陣，其實他是如下的單詞的fasttext詞向量：

['<中國', '中國人', '國人>', '<中國人', '中國人>', '<中國人>','你',''我,'是','誰']

3.2 詞向量詞典

一種方式就是：

fasttext_model.wv.vocab

是以dict形式，還有一種：

fasttext_model.wv.index2word

是以list形式

fasttext_model.wv.vocab
fasttext_model.wv.index2word

3.3 與word2vec 相同的求相似性

其中包括：

model.wv.most_similar(positive=['你', '是'], negative=['中國人'])
model.wv.most_similar_cosmul(positive=['你', '是'], negative=['中國人'])

類比關系，其中most_similar_cosmul使用乘法組合來查找最接近的詞（參考url）

model.wv.doesnt_match("你 真的 是".split())  # 找到不比對的

找出不适合的詞

model.wv.similarity('你', '是')  # 求相似
model.n_similarity(['cat', 'say'], ['dog', 'say'])  # 多個詞條求相似

similarity

求兩個詞之間的相似性；

n_similarity

為求多個詞之間的相似性

# !pip3 install pyemd 
model.wmdistance(['cat', 'say'], ['dog', 'say']) # 求詞條之間的WMD距離

依據詞向量求詞條之間的WMD距離

3.4 求詞附近的相似詞

model.most_similar("滋潤")
model.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)

其中，要注意

most_similar

是可以直接給入向量，然後來找相似的。這裡官方還有幾種類似的寫法，如下：

model.similar_by_vector('你好')
model.similar_by_word('你好')

其中，有一種增強版的求相似詞的方式：

model.wv.most_similar_cosmul(positive='蘋果手機', negative='手機', topn=10)

官方的解釋為：Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation.

3.5 fasttext自帶的OOV功能

fasttext自帶的對于詞典外的詞條進行向量補齊，非常好用。再kaggle的1%方案中就提到fasttext應對OOV的問題，About my 0.9872 single model

原表述：

Fixed misspellings by finding word vector neighborhoods. Fasttext tool can create vectors for out-of-dictionary words which is really nice. I trained my own fasttext vectors on Wikipedia comments corpus and used them to do this. I also used those vectors as embeddings but results were not as good as with regular fasttext vectors.

示例：

model.most_similar("萌萌哒")

[('萌哒', 0.8866026401519775),
 ('桃江', 0.7472578287124634),
 ('比榮欣', 0.69571453332901),
 ('活潑可愛', 0.680438756942749),
 ('小可愛', 0.6803416013717651),
 ('可愛', 0.6769561767578125),
 ('萌', 0.6448146104812622),
 ('卡通', 0.6299867630004883),
 ('漂亮可愛', 0.6273207664489746),
 ('極漂亮', 0.620937705039978)]

關于OOV問題，也可以基于glove等向量來自己構造一套方法

3.5 如何獲得fasttext的n-grams詞向量

fasttext_wrapper.py，20181111補充，來看看fasttext内部如何應對OOV問題：

from gensim.models.utils_any2vec import _save_word2vec_format, _load_word2vec_format, _compute_ngrams, _ft_hash

def compute_ngrams(word, min_n, max_n):
    BOW, EOW = ('<', '>')  # Used by FastText to attach to all words as prefix and suffix
    extended_word = BOW + word + EOW
    ngrams = []
    for ngram_length in range(min_n, min(len(extended_word), max_n) + 1):
        for i in range(0, len(extended_word) - ngram_length + 1):
            ngrams.append(extended_word[i:i + ngram_length])
    return ngrams

    def word_vec(self, word, use_norm=False):
        if word in self.vocab:
            return super(FastTextKeyedVectors, self).word_vec(word, use_norm)
        else:
            # from gensim.models.fasttext import compute_ngrams
            word_vec = np.zeros(self.vectors_ngrams.shape[1], dtype=np.float32)
            ngrams = _compute_ngrams(word, self.min_n, self.max_n)
            if use_norm:
                ngram_weights = self.vectors_ngrams_norm
            else:
                ngram_weights = self.vectors_ngrams
            ngrams_found = 0
            for ngram in ngrams:
                ngram_hash = _ft_hash(ngram) % self.bucket
                if ngram_hash in self.hash2index:
                    word_vec += ngram_weights[self.hash2index[ngram_hash]]
                    ngrams_found += 1
            if word_vec.any():
                return word_vec / max(1, ngrams_found)
            else:  # No ngrams of the word are present in self.ngrams
                raise KeyError('all ngrams for word %s absent from model' % word)

以上是fasttext如何應對OOV詞的原始計算方式，那麼大緻步驟為：

1 找到每個詞的N-grams，

_compute_ngrams

函數

2 然後與n-grams詞庫進行比對

3 比對到的n-gram向量平均即為最後的輸出值

from gensim.models import FastText
sentences = [["你", "是", "誰"], ["我", "是", "中國人"]]
fasttext_model = FastText(sentences,  size=4, window=3, min_count=1, iter=10,min_n = 3 , max_n = 6,word_ngrams = 0)

from gensim.models.utils_any2vec import _save_word2vec_format, _load_word2vec_format, _compute_ngrams, _ft_hash
ngrams = _compute_ngrams('吃了嗎',min_n = 3,max_n = 6)
>>> ['<吃了', '吃了嗎', '了嗎>', '<吃了嗎', '吃了嗎>', '<吃了嗎>']

筆者改編了一下，把fasttext之中的n-grams詞向量可以提取出來。

def FastTextNgramsVector(fasttext_model):
    fasttext_word_list = fasttext_model.wv.vocab.keys()
    NgramsVector = {}
    ngram_weights = fasttext_model.wv.vectors_ngrams # (10, 4)
    for word in fasttext_word_list:
        ngrams = _compute_ngrams(word,min_n = fasttext_model.wv.min_n,max_n = fasttext_model.wv.max_n)
        for ngram in ngrams:
            ngram_hash = _ft_hash(ngram) % fasttext_model.wv.bucket  
            if ngram_hash in fasttext_model.wv.hash2index:
                NgramsVector[ngram] = ngram_weights[fasttext_model.wv.hash2index[ngram_hash]] 
    return NgramsVector

FastTextNgramsVector(fasttext_model)

最終的效果為：

{'<中國': array([ 0.15037228,  0.23413078, -0.09328791,  0.09616131], dtype=float32),
 '<中國人': array([ 0.22894476,  0.01658264,  0.09593856, -0.09224218], dtype=float32),
 '<中國人>': array([ 0.24443054,  0.12408283, -0.109778  ,  0.14463967], dtype=float32),
 '<你>': array([-0.10611233, -0.18498571, -0.24031653,  0.08941776], dtype=float32),
 '<我>': array([-0.14418595, -0.11722667, -0.00421342, -0.22331873], dtype=float32),
 '<是>': array([-0.198387  , -0.02605324,  0.20429775, -0.10319293], dtype=float32),
 '<誰>': array([ 0.0370588 , -0.17663571,  0.04465277,  0.09987918], dtype=float32),
 '中國人': array([ 0.18819457,  0.19730332, -0.2074779 , -0.23047261], dtype=float32),
 '中國人>': array([ 0.09325046,  0.16731283, -0.24085586,  0.12507215], dtype=float32),
 '國人>': array([-0.1650848 ,  0.18907125, -0.20082659, -0.03944619], dtype=float32)}

可以看到不太相同的是，

<中國

會有一個

符号，要注意。

4 fasttext 與 word2vec的對比

在案例：Comparison of FastText and Word2Vec之中有官方給出的對比gensim之中，fasttext與word2vec的性能、語義關系比對。

參考博文：https://rare-technologies.com/fasttext-and-gensim-word-embeddings/

極簡使用︱Gensim-FastText 詞向量訓練以及OOV（out-of-word）問題有效解決

得出的結論：

具有n-gram的FastText模型在文法任務上的表現明顯更好，因為句法問題與單詞的形态有關；
Gensim word2vec和沒有n-gram的fastText模型在語義任務上的效果稍好一些，可能是因為語義問題中的單詞是獨立的單詞而且與它們的char-gram無關；
一般來說，随着語料庫大小的增加，模型的性能似乎越來越接近。但是，這可能是由于模型的次元大小保持恒定在100，而大型語料庫較大次元的模型大小可能會導緻更高的性能提升。
随着語料庫大小的增加，所有模型的語義準确性顯着增加。
然而，由于n-gram FastText模型的語料庫大小的增加，句法準确度的提高較低（相對和絕對術語）。這可能表明，在較大的語料庫大小的情況下，通過合并形态學資訊獲得的優勢可能不那麼顯着（原始論文中使用的語料庫似乎也表明了這一點）
最原始的fastText 由c++寫的，而gensim是由py寫的，運作性能還是c++要快一些

參考資源

1、facebookresearch/fastText

2、案例：Using FastText via Gensim

3、案例：Comparison of FastText and Word2Vec

4、官方教程：models.fasttext – FastText model

5、FastText and Gensim word embeddings

極簡使用︱Gensim-FastText 詞向量訓練以及OOV（out-of-word）問題有效解決

文章目錄

2 、fasttext訓練

2.1 訓練主函數

2.2 模型的儲存與加載

2.3 線上更新語料庫

2.4 c++ 版本的fasttext訓練

3 fasttext使用

3.1 獲得詞向量

3.2 詞向量詞典

3.3 與word2vec 相同的求相似性

3.4 求詞附近的相似詞

3.5 fasttext自帶的OOV功能

3.5 如何獲得fasttext的n-grams詞向量

4 fasttext 與 word2vec的對比

參考資源

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入