天天看點

CS224N notes_chapter2_word2vec第二講 word2vec

第二講 word2vec

1 Word meaning

the idea that is represented by a word, phrase, writing, art etc.

How do we have usable meaning in a computer?

Common answer: toxonomy(分類系統) like WordNet that has hypernyms relations(is-a) and synonym(同義詞) sets.

Problems with toxonomy:

  • missing nuances(細微差别) 比如 proficient 就比 good 更适合形容專家, 但是在分類系統中它們就是同義詞
  • missing new words
  • subjective
  • requires human labor to create and adapt
  • Hard to compute accurate word similarity

Problems with discrete representation: one-hot representation dimensions.

[0,0,0,...,1,...,0]

and one-hot doesn’t give the relation/similarity between words.

Distributional similarity: you can get a lot of value for representing a word by means of its neighbors.

Next, we want to use vectors to represent words.

distributional: understand word meaning by context.

distributed:dense vectors to represent the meaning of the words.

2. Word2vec intro

Basic idea of learning Neural Network word embeddings

We def a model to predict the center word w t w_t wt​ and context words in terms of word vectors.

p ( c o n t e x t ∣ w t ) p(context|w_t) p(context∣wt​)

which has a loss function like

J = 1 − p ( w − t ∣ w t ) J = 1 -p(w_{-t}|w_t) J=1−p(w−t​∣wt​)

-t means neighbors of w t w_t wt​ except w t w_t wt​

Main idea of word2vec: Predict between every word and its context words.

Two algorithms.

  1. Skip-grams(SG)

    Predict context words given target(position independent)

    … turning into banking crises as …

    banking: center word

    turning: p ( w t − 2 ∣ w t ) p(w_{t-2}|w_t) p(wt−2​∣wt​)

    For each word t=1,…T, we predict surrounding words in a window of “radius” m of every word

    J ′ ( θ ) = ∏ t = 1 T ∏ 0 m ≤ j ≤ m , j ≠ 0 P ( w t + j ∣ w t ; θ ) J ( θ ) = − 1 T ∑ t = 1 T ∑ 0 m ≤ j ≤ m , j ≠ 0 P ( w t + j ∣ w t ; θ ) J'(\theta)=\prod_{t=1}^T \prod_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) \\ J(\theta)=-\frac 1 T \sum_{t=1}^T \sum_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) J′(θ)=t=1∏T​0m≤j≤m,j̸​=0∏​P(wt+j​∣wt​;θ)J(θ)=−T1​t=1∑T​0m≤j≤m,j̸​=0∑​P(wt+j​∣wt​;θ)

    hyperparameter: window size m

    we use p ( w t + j ∣ w t ) = e x p ( u o T v c ) ∑ w = 1 V e x p ( u w T v c ) p(w_{t+j}|w_t)= \frac{exp(u_o^Tv_c)}{\sum_{w=1}^V exp(u_w^Tv_c)} p(wt+j​∣wt​)=∑w=1V​exp(uwT​vc​)exp(uoT​vc​)​,

    the dot product will be greater if two words are more similar. And softmax maps the values to probability distribution.

  2. Continuous Bag of Words(CBOW)

    Predict target word from bag-of-words context.

3. Research highlight

omit

4. Word2vec objective function gradients

all parameters in model

θ = [ v a ⋮ v z e b r a u a ⋮ u z e b r a ] \theta=\left[\begin{aligned} v_a \\ \vdots \\ v_{zebra} \\ u_a \\ \vdots \\ u_{zebra} \end{aligned}\right] θ=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡​va​⋮vzebra​ua​⋮uzebra​​⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤​

We try to optimize these parameters by training the model. We use gradients descent.

∂ ∂ v c log ⁡ e x p ( u o T v c ) − log ⁡ ∑ x = 1 V e x p ( u w T v c ) = u o − ∑ x = 1 V u x e x p ( u x T v c ) ∑ w = 1 V e x p ( u w T v c ) = u 0 − ∑ x = 1 v p ( x ∣ c ) u x \begin{aligned} &\frac{\partial}{\partial v_c} \log{exp(u_o^Tv_c)}-\log{\sum_{x=1}^V}exp(u_w^Tv_c) \\ =& u_o - \frac{\sum_{x=1}^{V}u_x exp(u_x^Tv_c)}{\sum_{w=1}^Vexp(u_w^Tv_c)} \\ =&u_0 - \sum_{x=1}^{v}p(x|c)u_x \end{aligned} ==​∂vc​∂​logexp(uoT​vc​)−logx=1∑V​exp(uwT​vc​)uo​−∑w=1V​exp(uwT​vc​)∑x=1V​ux​exp(uxT​vc​)​u0​−x=1∑v​p(x∣c)ux​​

5. Optimization refresher

We have the gradients at point x. Then we go along the negative gradients.

θ j n e w = θ j o l d − α ∂ ∂ θ j o l d J ( θ ) \theta_j^{new}=\theta_j^{old} - \alpha\frac{\partial}{\partial \theta_j^{old}}J(\theta) θjnew​=θjold​−α∂θjold​∂​J(θ)

α \alpha α: step size.

In matrix notation for parameters

θ j n e w = θ j o l d − α ∇ θ J ( θ ) \theta_j^{new}=\theta_j^{old} - \alpha\nabla_\theta J(\theta) θjnew​=θjold​−α∇θ​J(θ)

Stochastic Gradient Descent:

  • global update -> much time
  • mini batch -> also good idea

6. Assignment 1 notes

7. Usefulness of Wordvec

繼續閱讀