http://blog.csdn.net/pipisorry/article/details/51525308
吉布斯采樣的實作問題
本文主要說明如何通過吉布斯采樣進行文檔分類(聚類),當然更複雜的實作可以看看吉布斯采樣是如何采樣LDA主題分布的[主題模型TopicModel:隐含狄利克雷分布LDA]。
關于吉布斯采樣的介紹文章都停止在吉布斯采樣的較長的描述上,如随機采樣和随機模拟:吉布斯采樣Gibbs Sampling(why)但并沒有說明吉布斯采樣到底如何實作的(how)?
也就是具體怎麼實作從下面這個公式采樣?
![](https://img.laitimes.com/img/_0nNw4CM6IyYiwiM6ICdiwiIyVGduV2QvwVe0lmdhJ3ZvwFM38CXlZHbvN3cpR2Lc1TPB10QGtWUCpEMJ9CXsxWam9CXwADNvwVZ6l2c052bm9CXUJDT1wkNhVzLcRnbvZ2LcZXUYpVd1kmYr50MZV3YyI2cKJDT29GRjBjUIF2LcRHelR3LcJzLctmch1mclRXY39DNzITOwcDM2EDOyUDM2EDMy8CX0Vmbu4GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.jpg)
怎麼在模型中處理連續參數問題?
怎麼生成最終我們感興趣的公式
的期望值,而不是僅僅做T次随機遊走?
下面介紹如何為樸素貝葉斯Na ̈ıve Bayes[機率圖模型:貝葉斯網絡與樸素貝葉斯網絡]建構一個吉布斯采樣器,其中包含兩大問題:如何利用共轭先驗?如何通過等式14的條件中進行實際的機率采樣?
皮皮blog
樸素貝葉斯模型的吉布斯采樣器
問題
基于樸素貝葉斯架構,通過吉布斯采樣對文檔進行(無監督和有監督)分類。假設features是文檔下的詞,我們要預測的是doc-level的文檔分類标簽(sentiment label),值為0或1。
首先在無監督資料上進行樸素貝葉斯的采樣,對于監督資料的簡化會在後面說明。
Following Pedersen [T. Pedersen. Knowledge lean word sense disambiguation. In AAAI/IAAI, page 814, 1997., T. Pedersen. Learning Probabilistic Models of Word Sense Disambiguation. PhD thesis, Southern Methodist University, 1998. http://arxiv.org/abs/0707.3972.], we’re going to describe the Gibbs sampler in a completely unsupervised setting where no labels at all are provided as training data.
模型表示
樸素貝葉斯模型對應的plate-diagram:
Figure4-1
變量代表的意思如下表:
1.每一個文檔有一個Label(j),是文檔的class,同時θ0和θ1是和Lj對應的,如果Lj=1則對應的就是θ1
3.在這個model中,Gibbs Sampling所謂的P(Z),就是産生圖中這整個資料集的聯合機率,也就是産生這N個文檔整體聯合機率,還要算上包括超參γ産生具體π和 θ的機率。是以最後得到了上圖中表達式與對應色彩。
給定文檔,我們要選擇文檔的label L使下面的機率越大:
文檔生成過程
文檔j的label生成
文檔j的詞袋生成描述
先驗Priors
π來自哪裡?
hyperparameters : parameters of a prior, which is itself used to pick parameters of the model.
Our generative story is going to assume that before this whole process began, we also picked π randomly. Specifically we’ll assume that π is sampled from a Beta distribution with parameters γ π1 and γ π0 .
In Figure 4 we represent these two hyperparameters as a single two-dimensional vector γ π = γ π1 , γ π0 . When γ π1 = γ π0 = 1, Beta(γ π1 , γ π0 ) is just a uniform distribution, which means that any value for π is equally likely. For this reason we call Beta(1, 1) an “uninformed prior”.
θ 0 和 θ 1來自哪裡?
Let γ θ be a V -dimensional vector where the value of every dimension equals 1. If θ0 is sampled from Dirichlet(γθ ), every probability distribution over words will be equally likely. Similarly, we’ll assume θ 1 is sampled from Dirichlet(γ θ ).
Note: θ0為label為0的文檔中詞的機率分布;θ1為label為1的文檔中詞的機率分布。θ0and θ1are sampled separately. There’s no assumption that they are related to each other at all.
2.3 模型的狀态空間和初始化
狀态空間
樸素貝葉斯模型中狀态空間的變量定義
• one scalar-valued variable π (文檔j的label為1的機率)
• two vector-valued variables, θ 0 and θ 1
• binary label variables L, one for each of the N documents
We also have one vector variable W j for each of the N documents, but these are observed variables, i.e.their values are already known (and which is why W jk is shaded in Figure 4).
初始化
Pick a value π by sampling from the Beta(γ π1 , γ π0 ) distribution. sample出文檔j的label為1的機率,也就知道了文檔j的label的bernoulli機率分布(π, 1-π)。
Then, for each j, flip a coin with success probability π, and assign label L j(0)— that is, the label of document j at the 0 th iteration – based on the outcome of the coin flip. 通過上步得到的bernoulli分布sample出文檔的label。
Similarly,you also need to initialize θ 0 and θ 1 by sampling from Dirichlet(γ θ ).
派生聯合分布
for each iteration t = 1 . . . T of sampling, we update every variable defining the state space by sampling from its conditional distribution given the other variables, as described in equation (14).
處理過程:
• We will define the joint distribution of all the variables, corresponding to the numerator in (14).
• We simplify our expression for the joint distribution.
• We use our final expression of the joint distribution to define how to sample from the conditional distribution in (14).
• We give the final form of the sampler as pseudocode.
聯合分布的表示和簡化
模型對于整個文檔集的聯合分布為
Note: 分号右邊是聯合分布的參數,也就是說分号左邊的變量是被右邊的超參數條件住的。
聯合分布可分解為(通過圖模型):
因子1:
也即Figure4-1紅色部分:這個是從beta分布sample出一個伯努利分布,伯努利分布隻有一個參數就是π,不要normalization項(要求的是整個聯合機率,是以在這裡糾結normalization是沒有用的),得到:
因子2:
也即Figure4-1綠色部分:這裡L是一整個向量,其中值為0的有C0個,值為1的有C1個,多次伯努利分布就是二項分布啦,是以:
因子3:
詞的分布機率,也即Figure4-1藍色部分:對于0類和1類的兩個θ都采樣自參數為γθ的狄利克雷分布,注意所有這些都是向量,有V維,每一次元對應一個Word。根據狄利克雷的PDF得到以下表達式,其實這個表達式有兩個,分别為θ0和θ1用相同的式子采樣:
因子4:
P (C 0 |θ 0 , L) and P (C 1 |θ 1 , L): the probabilities of generating the contents of the bags of words in each of the two document classes.
也即Figure4-1紫色部分:首先要求對于單獨一個文檔n,産生所有word也就是Wn的機率。假設對于某個文檔,θ=(0.2,0.5,0.3),意思就是word1産生機率0.2,word2産生機率0.5,假如這個文檔裡word1有2個,word2有3個,word3有2個,則這個文檔的産生機率就是(0.2*0.2)*(0.5*0.5*0.5)*(0.3*0.3)。是以按照這個道理,一個文檔整個聯合機率如下:
let θ = θ L n:
Wni: W n中詞i的頻數。
文檔間互相獨立,同一個class中的文檔合并,上面這個機率是針對單個文檔而言的,把所有文檔的這些機率乘起來,就得到了Figure4-1紫色部分:
Note: 其中x的取值可以是0或1,是以Cx可以是C0或C1,當x=0時,n∈Cx的意思就是對于所有class 0中的文檔,然後對于這些文檔中每一個word i,word i在該文檔中出現Wni次,求θ0,i的Wni次方,所有這些乘起來就是紫色部分。後式27是規約後得到的結果,NCx (i) :word i在 documents with class label x中的計數,如NC0(i)的意思就是word i出現在calss為0的所有文檔中的總數,同理對于NC1(i)。
聯合分布的表示和先驗選擇的原因
使用式19和21:
使用式24和25:
如果使用所有文檔的詞(也就是使用式24和27)
可知後驗分布式30是一個unnormalized Beta distribution, with parameters C 1 + γ π1 and C 0 + γ π0 ,且式32是一個unnormalized Dirichlet distribution, with parameter vector N C x (i) + γ θi for 1 ≤ i ≤ V .
也就是說先驗和後驗分布是一種形式,這樣Beta distribution是binomial (and Bernoulli)分布的共轭先驗,Dirichlet分布是多項式multinomial分布的共轭先驗。
而超參數就如觀察到的證據,是一個僞計數pseudocounts。
讓
,整個文檔集的聯合分布表示為:
将隐含變量π積出
why: 為了友善,可以對隐含變量π進行積分,最後達到消去這個變量的目的。我們可以通過積分掉π來減少模型有效的參數個數。This has the effect of taking all possible values of π into account in our sampler, without representing it as a variable explicitly and having to sample it at every iteration. Intuitively, “integrating out” a variable is an application of precisely the same principle as computing the marginal probability for a discrete distribution.As a result, c is “there” conceptually, in terms of our understanding of the model, but we don’t need to deal with manipulating it explicitly as a parameter.
Note: 積分掉的意思就是
于是聯合分布的邊緣分布為:
隻考慮積分項:
而38式後面的積分項是一個參數為C 1 + γ π1 and C 0 + γ π0的beta分布,且Beta(C 1 + γ π1 , C 0 + γ π0 )的積分為
讓N = C 0 + C 1
則38式表示為:
整個文檔集的聯合分布表示(三因子式)為:
其中
,N = C 0 + C 1
建構吉布斯采樣器Gibbs Sampler
吉布斯采樣就是通過條件機率
給Zi一個新值
如要計算
,需要計算條件分布
Note: There’s no superscript on the bags of words C because they’re fully observed and don’t change from iteration to iteration.
要計算θ 0,需要計算條件分布
直覺上,在每個疊代t開始前,我們有如下目前資訊:
每篇文檔的詞計數,标簽為0的文檔計數,标簽為1的文檔計數,每篇文檔的目前label,θ0 和 θ1的目前值等等。
采樣準則
采樣label:When we want to sample the new label for document j, we temporarily remove all information (i.e. word counts and label information) about this document from that collection of information. Then we look at the conditional probability that L j = 0 given all the remaining information, and the conditional probability that L j = 1 given the same information, and we sample the new label L j (t+1) by choosing randomly according to the relative weight of those two conditional probabilities.
采樣θ:Sampling to get the new values
operates according to the same principal.
2.5.1 文檔标簽的采樣
定義條件機率
L (−j) are all the document labels except L j , and C (−j) is the set of all documents except W j .
分子是全聯合機率分布,分母是除去Wj資訊的相同的表達式,是以我們需要考慮的隻是式40的3個因子。
其實我們要做的隻是考慮除去Wj後,改變了什麼。
因子1
由于因子1
僅依賴于超參數,分子分母一樣,不予考慮,故隻考慮式40中的因子2和因子3。
因子2
式42因子2分母的計算與上一次疊代Lj是多少有關。
不過語料大小總是從N變成了N-1,且其中一個文檔類别的計數減少1。如Lj=0,則
,Cx隻有一個有變化,這樣
let x be the class for which C x(−j)= C x − 1,式42的因子2重寫為:
又Γ(a + 1) = aΓ(a) for all a
這樣式42的因子2簡化為:
因子3
同因子2,總有某個class對應的項沒變,也就是式42的因子3中θ 0 or θ 1有一項在分子和分母中是一樣的。
合并
for x ∈ {0, 1},最終合并得到采樣文檔label的的條件分布為
從式49看文檔的label是如何選擇出來的:
式49因子1:L j = x considering only the distribution of the other labels
式49因子2:is like a word distribution “fitting room.”, an indication of how well the words in W j “fit” with each of the two distributions.
前半部分其實隻有Cx是變量,是以如果C0大,則P(L(t+1)j=0)的機率就會大一點,是以下一次Lj的值就會傾向于C0,反之就會傾向于C1。而後半部分,是在判斷目前θ參數的情況下,整個文檔的likelihood更傾向于C0還是C1。
式42的條件分布采樣過程如下
Note: 步驟3是對兩個label的機率分布進行歸一化。
對于監督資料
Using labeled documents just don’t sample L j for those documents! Always keep L j equal to the observed label.
The documents will effectively serve as “ground truth” evidence for the distributions that created them. Since we never sample for their labels, they will always contribute to the counts in (49) and (51) and will never be subtracted out.
2.5.2 θ的采樣
由于θ 0 and θ 1的分布估計是獨立的,這裡我們先消去θ下标。
顯然
since we used conjugate priors, this posterior, like the prior, works out to be a Dirichlet distribution. We actually derived the full expression , but we don’t need the full expression here. All we need to do to sample a new distribution is to make another draw from a Dirichlet distribution, but this time with parameters N C x (i) + γ θi for each i in V .
define the V dimensional vector t such that each
(這裡i下标代表V維向量t的第i個元素):
new θ的采樣公式
從Dirichlet分布采樣的實作
sample a random vector a = <a 1 , . . . , a V> from the V -dimensional Dirichlet distribution with parameters <α 1 , . . . , α V>
最快的實作是draw V independent samples y 1 , . . . , y V from gamma distributions, each with density
然後
(也就是正則化gamma分布的采樣)
[http://en.wikipedia.org/wiki/Dirichlet distribution]
整個樸素貝葉斯模型的吉布斯采樣架構
模型初始化
=<1, 1> uninformed prior: uniform distribution
=<1, ..., 1> Let γθ be a V-dimensional vector where the value of every dimension equals 1. uninformed prior
Pick a value π by sampling from the Beta(γ π1 , γ π0 ) distribution. sample出文檔j的label為1的機率,也就知道了文檔j的label的bernoulli機率分布(π, 1-π)。
Then, for each j, flip a coin with success probability π, and assign label L j(0)— that is, the label of document j at the 0 th iteration – based on the outcome of the coin flip. 通過上步得到的bernoulli分布sample出文檔的label。
Similarly,you also need to initialize θ 0 and θ 1 by sampling from Dirichlet(γθ ).
模型疊代
2.5.1 文檔j标簽label的采樣公式
2.5.2 θ采樣中的
(這裡i下标代表V維向量t的第i個元素)
Note: lz: 如果要并行計算,特别注意的變量主要隻有三個:
,
,
算 法中第3步好像寫錯了,應該去掉not?
Note: as soon as a new label for L j is assigned, this changes the counts that will affect the labeling of the subsequent documents. This is, in fact, the whole principle behind a Gibbs sampler!
從吉布斯采樣中産生值
吉布斯采樣算法的初始化和采樣疊代都會産生每個變量的值(for iterations t = 1, 2, . . . , T),In theory, the approximated value for any variable Z i can simply be obtained by calculating:
正如我們所知,吉布斯采樣疊代進入收斂階段才是穩定分布,是以一般式59加和不是從1開始,而是B + 1 through T,要丢棄t < B的采樣結果。
In this context, Jordan Boyd-Graber (personal communication) also recommends looking at Neal’s [15] discussion of likelihood as a metric of convergence.
皮皮blog
PS
1 2.6 Optional: A Note on Integrating out Continuous Parameters
In Section 3 we discuss how to actually obtain values from a Gibbs sampler, as opposed to merely watching it walk around the state space. (Which might be entertaining, but wasn’t really the point.) Our discussion includes convergence and burn-in, auto-correlation and lag, and other practical issues.
In Section 4 we conclude with pointers to other things you might find it useful to read, as well as an invitation to tell us how we could make this document more accurate or more useful.
同步圖計算并行化處理
lz的不一定正确。。。
将文檔資料分成M份,分布在M個worker節點中。
超步1:所有節點運作樸素貝葉斯吉布斯采樣算法的模型初始化部分,得到
,
,
的初始值
超步2+:
超步開始前,主節點通過其它節點發來的
改變的增量資訊計算
的新值,并分發給worker所有節點。(全局模型)
每個節點儲存有自己的
資訊,收到其它節點發來的
改變的增量資訊,主節點發來的
資訊
重新運作重新樸素貝葉斯吉布斯采樣算法的模型的疊代部分,計算
新值(局部模型)
整個節點的資料疊代完成後計算
改變的增量資訊,并在超步結束後發給主節點。
burn-in階段丢棄所有節點采樣結果;收斂階段将所有節點得到的采樣結果:所有文檔的标簽值
記錄下來。
整個過程進行多次疊代,最後文檔的标簽值就是所有疊代得到的向量[标簽值
]的均值偏向的标簽值。
還有一種同步圖并行計算可能是這樣?
類似Spark MLlib LDA 基于GraphX實作原理,以文檔到詞作為邊,以詞頻作為邊資料,把語料庫構造成圖,把對語料庫中每篇文檔的每個詞操作轉化為在圖中每條邊上的操作,而對邊RDD處理是GraphX中最常見的的處理方法。
[Spark MLlib LDA 基于GraphX實作原理及源碼分析]
基于GraphX實作的Gibbs Sampling LDA,定義文檔與詞的二部圖,頂點屬性為文檔或詞所對應的topic向量計數,邊屬性為Gibbs Sampler采樣生成的新一輪topic。每一輪疊代采樣生成topic,用mapReduceTriplets函數為文檔或詞累加對應topic計數。這好像是Pregel的處理方式?Pregel實作過LDA。
[基于GraphX實作的Gibbs Sampling LDA]
[Collapsed Gibbs Sampling for LDA]
[LDA中Gibbs采樣算法和并行化]
from: http://blog.csdn.net/pipisorry/article/details/51525308
ref: Philip Resnik : GIBBS SAMPLING FOR THE UNINITIATED*
Reading Note : Gibbs Sampling for the Uninitiated