天天看点

Bag of words模型

      Bag of words,也叫做“词袋”,在信息检索中,Bag of words model假定对于一个文本,忽略其词序和语法,句法,将其仅仅看做是一个词集合,或者说是词的一个组合,文本中每个词的出现都是独立的,不依赖于其他词是否出现,或者说当这篇文章的作者在任意一个位置选择一个词汇都不受前面句子的影响而独立选择的。

     这种假设虽然对自然语言进行了简化,便于模型化,但是其假定在有些情况下是不合理的,例如在新闻个性化推荐中,采用Bag of words的模型就会出现问题。例如用户甲对“南京醉酒驾车事故”这个短语很感兴趣,采用bag of words忽略了顺序和句法,则认为用户甲对“南京”、“醉酒”、“驾车”和“事故”感兴趣,因此可能推荐出和“南京”,“公交车”,“事故”相关的新闻,这显然是不合理的。

     解决的方法可以采用SCPCD的方法抽取出整个短语,或者采用高阶(2阶以上)统计语言模型,例如bigram,trigram来将词序保留下来,相当于bag of bigram和bag of trigram,这样能在一定程度上解决这种问题。

    简言之,bag of words模型是否适用需要根据实际情况来确定。对于那些不可以忽视词序,语法和句法的场合均不能采用bag of words的方法。

参考:http://en.wikipedia.org/wiki/Bag_of_words_model

转载自:http://blog.csdn.net/pennyliang/article/details/4325664

举例:

The following models a text document using bag-of-words.

Here are two simple text documents:

(1) John likes to watch movies. Mary likes movies too.
      
(2) John also likes to watch football games.
      

Based on these two text documents, a list is constructed as follows:

[
    "John",
    "likes",
    "to",
    "watch",
    "movies",
    "Mary",
    "too",
    "also",
    "football",
    "games"
]      
 For the example above, we can construct the following two lists to record the term frequencies of all the distinct words:

  
          
(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
(2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]      
Each entry of the lists refers to count of the corresponding entry in the list (this is also the histogram representation). For example, in the first list (which represents document 1), the first two entries are "1,2". The first entry corresponds to the word "John" which is the first word in the list, and its value is "1" because "John" appears in the first document 1 time. Similarly, the second entry corresponds to the word "likes" which is the second word in the list, and its value is "2" because "likes" appears in the first document 2 times. This list (or vector) representation does not preserve the order of the words in the original sentences, which is just the main feature of the Bag-of-words model. This kind of representation has several successful applications, for example email filtering
      

N-gram model

Bag-of-word model is an orderless document representation—only the counts of words mattered. For instance, in the above example "John likes to watch movies. Mary likes movies too", the bag-of-words representation will not reveal the fact that a person's name is always followed by the verb "likes" in this text. As an alternative, the n-gram model can be used to store this spatial information within the text. Applying to the same example above, a bigram model will parse the text into following units and store the term frequency of each unit as before.
[
    "John likes",
    "likes to",
    "to watch",
    "watch movies",
    "Mary likes",
    "likes movies",
    "movies too",
]      

继续阅读