Bag of words模型

2023-04-09 06:33:24

Bag of words，也叫做“词袋”，在信息检索中，Bag of words model假定对于一个文本，忽略其词序和语法，句法，将其仅仅看做是一个词集合，或者说是词的一个组合，文本中每个词的出现都是独立的，不依赖于其他词是否出现，或者说当这篇文章的作者在任意一个位置选择一个词汇都不受前面句子的影响而独立选择的。

这种假设虽然对自然语言进行了简化，便于模型化，但是其假定在有些情况下是不合理的，例如在新闻个性化推荐中，采用Bag of words的模型就会出现问题。例如用户甲对“南京醉酒驾车事故”这个短语很感兴趣，采用bag of words忽略了顺序和句法，则认为用户甲对“南京”、“醉酒”、“驾车”和“事故”感兴趣，因此可能推荐出和“南京”，“公交车”，“事故”相关的新闻，这显然是不合理的。

解决的方法可以采用SCPCD的方法抽取出整个短语，或者采用高阶（2阶以上）统计语言模型，例如bigram，trigram来将词序保留下来，相当于bag of bigram和bag of trigram，这样能在一定程度上解决这种问题。

简言之，bag of words模型是否适用需要根据实际情况来确定。对于那些不可以忽视词序，语法和句法的场合均不能采用bag of words的方法。

参考：http://en.wikipedia.org/wiki/Bag_of_words_model

转载自：http://blog.csdn.net/pennyliang/article/details/4325664

举例：

The following models a text document using bag-of-words.

Here are two simple text documents:

(1) John likes to watch movies. Mary likes movies too.

(2) John also likes to watch football games.

Based on these two text documents, a list is constructed as follows:

[
    "John",
    "likes",
    "to",
    "watch",
    "movies",
    "Mary",
    "too",
    "also",
    "football",
    "games"
]

 For the example above, we can construct the following two lists to record the term frequencies of all the distinct words:

  
          (1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
(2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]      
    Each entry of the lists refers to count of the corresponding entry in the list (this is also the histogram representation). For example, in the first list (which represents document 1), the first two entries are "1,2". The first entry corresponds to the word "John" which is the first word in the list, and its value is "1" because "John" appears in the first document 1 time. Similarly, the second entry corresponds to the word "likes" which is the second word in the list, and its value is "2" because "likes" appears in the first document 2 times. This list (or vector) representation does not preserve the order of the words in the original sentences, which is just the main feature of the Bag-of-words model. This kind of representation has several successful applications, for example email filtering
      
     N-gram model                
Bag-of-word model is an orderless document representation—only the counts of words mattered. For instance, in the above example "John likes to watch movies. Mary likes movies too", the bag-of-words representation will not reveal the fact that a person's name is always followed by the verb "likes" in this text. As an alternative, the n-gram model can be used to store this spatial information within the text. Applying to the same example above, a bigram model will parse the text into following units and store the term frequency of each unit as before.

    
            [
    "John likes",
    "likes to",
    "to watch",
    "watch movies",
    "Mary likes",
    "likes movies",
    "movies too",
]

Bag of words模型

N-gram model

继续阅读

#昊铂来了#7月3日，继中国第2000万辆新能源汽车下线后，高端豪华品牌的主力车型—昊铂GT新车，宣告正式上市，共推出5

今天分享ANSYS瞬态动力学分析的变浆距螺旋桨模型。首先，打开Workbench软件，进入瞬态动力学分析模块，双击进入。

Python Django结果集方法、单值方法功能及Model类构建

Qt模型视图框架：QAbstractItemDelegate、QFileSystemModel

tensorflow（3）：Object Detection API使用

Django之Model操作一、字段二、字段参数三、元信息四、多表关系以及参数五、ORM操作

Python 数据分析实例——Uplift Modeling

最新到店22款特斯拉model3p！最新到店22年9月22款特斯拉mode͏ l3performance性高能轮全驱动A

好家伙，我直接一个好家伙这才8月1号，就有车企中的显眼包，迫不及待把自己7月的销量亮出来了五菱缤果不负众望，卖出了2

Whenyouareconfident，youareasupermodel.Whereyouwalkistheredca

Editor/Pippi KongPhotographer/NinaLeeLighting/WangDaoModel/孟

资讯 | 修复安全漏洞，特斯拉推送最新升级，新款Model 3或9月投产

Alibaba Open-Sources its Seven-Billion-Parameter AI Model Similar to Meta’s Llama 2

一汽丰田卡罗拉锐放2022年-2023年款维修手册电路图车身维修手册！FAWToyotaCorollaRuifang20

用已有的数据库生成model模型

Qt QTableView 如何清理列表里的数据数据初始化清理数据