第二講 word2vec

1 Word meaning

the idea that is represented by a word, phrase, writing, art etc.

How do we have usable meaning in a computer?

Common answer: toxonomy(分類系統) like WordNet that has hypernyms relations(is-a) and synonym(同義詞) sets.

Problems with toxonomy：

missing nuances(細微差别) 比如 proficient 就比 good 更适合形容專家, 但是在分類系統中它們就是同義詞
missing new words
subjective
requires human labor to create and adapt
Hard to compute accurate word similarity

Problems with discrete representation: one-hot representation dimensions.

[0,0,0,...,1,...,0]

and one-hot doesn’t give the relation/similarity between words.

Distributional similarity: you can get a lot of value for representing a word by means of its neighbors.

Next, we want to use vectors to represent words.

distributional: understand word meaning by context.

distributed:dense vectors to represent the meaning of the words.

2. Word2vec intro

Basic idea of learning Neural Network word embeddings

We def a model to predict the center word w t w_t wt and context words in terms of word vectors.

p ( c o n t e x t ∣ w t ) p(context|w_t) p(context∣wt)

which has a loss function like

J = 1 − p ( w − t ∣ w t ) J = 1 -p(w_{-t}|w_t) J=1−p(w−t∣wt)

-t means neighbors of w t w_t wt except w t w_t wt

Main idea of word2vec: Predict between every word and its context words.

Two algorithms.

Skip-grams(SG)

Predict context words given target(position independent)

… turning into banking crises as …

banking: center word

turning: p ( w t − 2 ∣ w t ) p(w_{t-2}|w_t) p(wt−2∣wt)

For each word t=1,…T, we predict surrounding words in a window of “radius” m of every word

J ′ ( θ ) = ∏ t = 1 T ∏ 0 m ≤ j ≤ m , j ≠ 0 P ( w t + j ∣ w t ; θ ) J ( θ ) = − 1 T ∑ t = 1 T ∑ 0 m ≤ j ≤ m , j ≠ 0 P ( w t + j ∣ w t ; θ ) J'(\theta)=\prod_{t=1}^T \prod_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) \\ J(\theta)=-\frac 1 T \sum_{t=1}^T \sum_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) J′(θ)=t=1∏T0m≤j≤m,j̸=0∏P(wt+j∣wt;θ)J(θ)=−T1t=1∑T0m≤j≤m,j̸=0∑P(wt+j∣wt;θ)

hyperparameter: window size m

we use p ( w t + j ∣ w t ) = e x p ( u o T v c ) ∑ w = 1 V e x p ( u w T v c ) p(w_{t+j}|w_t)= \frac{exp(u_o^Tv_c)}{\sum_{w=1}^V exp(u_w^Tv_c)} p(wt+j∣wt)=∑w=1Vexp(uwTvc)exp(uoTvc),

the dot product will be greater if two words are more similar. And softmax maps the values to probability distribution.
Continuous Bag of Words(CBOW)

Predict target word from bag-of-words context.

3. Research highlight

omit

4. Word2vec objective function gradients

all parameters in model

θ = [ v a ⋮ v z e b r a u a ⋮ u z e b r a ] \theta=\left[\begin{aligned} v_a \\ \vdots \\ v_{zebra} \\ u_a \\ \vdots \\ u_{zebra} \end{aligned}\right] θ=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡va⋮vzebraua⋮uzebra⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤

We try to optimize these parameters by training the model. We use gradients descent.

∂ ∂ v c log ⁡ e x p ( u o T v c ) − log ⁡ ∑ x = 1 V e x p ( u w T v c ) = u o − ∑ x = 1 V u x e x p ( u x T v c ) ∑ w = 1 V e x p ( u w T v c ) = u 0 − ∑ x = 1 v p ( x ∣ c ) u x \begin{aligned} &\frac{\partial}{\partial v_c} \log{exp(u_o^Tv_c)}-\log{\sum_{x=1}^V}exp(u_w^Tv_c) \\ =& u_o - \frac{\sum_{x=1}^{V}u_x exp(u_x^Tv_c)}{\sum_{w=1}^Vexp(u_w^Tv_c)} \\ =&u_0 - \sum_{x=1}^{v}p(x|c)u_x \end{aligned} ==∂vc∂logexp(uoTvc)−logx=1∑Vexp(uwTvc)uo−∑w=1Vexp(uwTvc)∑x=1Vuxexp(uxTvc)u0−x=1∑vp(x∣c)ux

5. Optimization refresher

We have the gradients at point x. Then we go along the negative gradients.

θ j n e w = θ j o l d − α ∂ ∂ θ j o l d J ( θ ) \theta_j^{new}=\theta_j^{old} - \alpha\frac{\partial}{\partial \theta_j^{old}}J(\theta) θjnew=θjold−α∂θjold∂J(θ)

α \alpha α: step size.

In matrix notation for parameters

θ j n e w = θ j o l d − α ∇ θ J ( θ ) \theta_j^{new}=\theta_j^{old} - \alpha\nabla_\theta J(\theta) θjnew=θjold−α∇θJ(θ)

Stochastic Gradient Descent:

global update -> much time
mini batch -> also good idea

CS224N notes_chapter2_word2vec第二講 word2vec

第二講 word2vec

1 Word meaning

2. Word2vec intro

3. Research highlight

4. Word2vec objective function gradients

5. Optimization refresher

6. Assignment 1 notes

7. Usefulness of Wordvec

繼續閱讀

文本分類之 residual-connection+selfAttention的詞向量平均模型

新聞文本分類-06 基于Bert的文本分類

seq2seq模型 + Attention機制

elasticlunr.js 最新版本v0.6.7釋出啦應用示例為什麼你需要elasticlunr.js?

RNN/LSTM學習資料總結

使用中文維基百科進行GloVe實驗

從詞向量衡量标準到全局向量的詞嵌入模型GloVe再到一詞多義的解決方式衡量标準Evaluation引子全局向量的詞嵌入應用對一詞多義的思考Reference

NLP︱進階詞向量表達（一）——GloVe（理論、相關測評結果、R&python實作、相關應用）一、理論簡述二、測評三、Glove實作&R&python四、相關應用

GloVe與word2vec的差別，及GloVe的缺陷

更别緻的詞向量模型(一)：simpler glove

glove_python安裝（避免編譯錯誤）

python 分析qq聊天記錄

[一起學BERT]（一）：BERT模型的原理基礎Self-Attention機制理論Multi-head Self-Attention注意力機制位置編碼Transformer理論BERT理論

ELMO BERT GPT

BERT、Elmo、GPT一、發展曆史二、bert三、ERNIE四、GPT—transformer的decoder

人工智能如何有效地運用于自然語言處理