Programming Collecive Intelligence 笔记 Making Recommendations

现在recommendation是非常普遍的一项技术，在网上购物amazon会推荐你可能感兴趣的商品，在电影，音乐网站，会推荐你可能喜欢的音乐或电影。那么这儿就来看看，这些推荐是怎么样实现的

collaborative filtering

日常生活中，最简单的获取推荐的方法就是问朋友，你可能知道某些朋友的品位比较高，爱好和你比较相像。不过这种方法并不是一直管用，因为朋友知道的毕竟是很有限的，相信每个人都会有很纠结不知道去哪儿吃饭，或不知道什么商品更值得买的时候。

那么这时候就需要一个collaborative filtering算法，a collaborative filtering algorithm usually works by searching a large group of people and finding a smaller set with tastes similar to yours.

这样就是把你的朋友的范围进行扩展，当人多了，自然信息就多了

collecting preferences

the first thing you need is a way to represent different people and their preferences.

上面说了collaborative filtering算法，要从很多人中找出和你兴趣相近的人，那么首先的一步就是怎么样来表示个人和他的兴趣，以便于后面的数据处理。

通用的做法就是把每个人都当作一个向量, 而每个兴趣的特征点都作为向量的一维, 这儿需要把所有的兴趣都进行量化,不然无法进行数据的计算处理. 比如, 你很喜欢, 标上数值5, 一般标上3.

而在python表示这种向量就用字典,很方便

critics={

''lisa rose'': {''lady in the water'': 2.5, ''snakes on a plane'': 3.5, ''just my luck'': 3.0, ''superman returns'': 3.5, ''you, me and dupree'': 2.5, ''the night listener'': 3.0},

''gene seymour'': {''lady in the water'': 3.0, ''snakes on a plane'': 3.5, ''just my luck'': 1.5, ''superman returns'': 5.0, ''the night listener'': 3.0, ''you, me and dupree'': 3.5}

}

上面就表示了lisa和gene分别对各个电影的喜欢程度,用1到5的数值来表示

finding similar users

上面我用向量的形式表示出需要进行collaborative filtering的user, 那么下面的问题是怎么样从中发现similar的user

既然我们用向量来表示user, 那么发现similar的user, 其实就是去计算向量间距离最短的问题, 找出那些最相近的向量

i’ll show you two systems for calculating similarity scores: euclidean distance and pearson correlation .

euclidean distance

欧氏距离就是两点间绝对距离, 这个很好理解

>> from math import sqrt

>> sqrt(pow(5-4,2)+pow(4-1,2))

3.1622776601683795

上面的代码就计算了(5,4)和(4,1)两点间的距离

however, you need a function that gives higher values for people who are similar.

>> 1/(1+sqrt(pow(5-4,2)+pow(4-1,2)))

0.2402530733520421

pearson correlation

欧氏距离比较简单, 但有个问题, 对样东西的打分是主观的, 每个人打分的标准是不一样的, 有人打分偏高, 有人偏低, 所以算绝对距离对这种情况无法处理.

pearson相关系数用于计算向量各维度间的比例, 两个向量的维度间比例相近, 就认为两向量相似

如向量(1,2) 和 (4,8), 如果用欧氏距离去算差的很远的, 但是用pearson相关系数去计算, 相似度就是1, 完全相似.

there are many other functions such as the jaccard coefficient or manhattan distance that you can use as your similarity function.

ranking the critics

now that you have functions for comparing two people, you can create a function that scores everyone against a given person and finds the closest matches.

recommending items

finding a good critic to read is great, but what i really want is a movie recommendation right now.

上面通过计算向量间距离, 我们已经可以找到和某个user最相近的那些users, 但我们的目的是进行电影推荐, 那么下面应该怎么做了

现在有下面5个相似的user对night, lady, luck这3部电影的评分, 来看看怎样来推荐电影了

critic similarity night s.xnight lady s.xlady luck s.xluck

rose 0.99 3.0 2.97 2.5 2.48 3.0 2.97

seymour 0.38 3.0 1.14 3.0 1.14 1.5 0.57

puig 0.89 4.5 4.02 3.0 2.68

lasalle 0.92 3.0 2.77 3.0 2.77 2.0 1.85

matthews 0.66 3.0 1.99 3.0 1.99

total 12.89 8.38 8.07

sim. sum 3.84 2.95 3.18

total/sim. sum 3.35 2.83 2.53

首先电影评分*similarity得到相对的评分, 如 similarity * night = s.xnight, 这样越相似的user的评分的权重越高

把所有user对电影的相对评分相加得到总评分, 直接把总评分作为推荐依据, 会导致被越多用户评分的电影的越占便宜, 所以就那就用总评分除上所有评论用户的similarity和来得到total/sim. sum, 用这个作为推荐的依据.

not only do you get a ranked list of movies, but you also get a guess at what my rating for each movie would be.

以上我们就完成了一个推荐系统, 我们可以把其中的用户和电影替换为其他任意对象, 来完成各种各样的推荐系统.

item-based filtering

the way the recommendation engine has been implemented so far requires the use of all the rankings from every user in order to create a dataset. this will probably work well for a few thousand people or items, but a very large site like amazon has millions of customers and products—comparing a user with every other user and then comparing every product each user has rated can be very slow.

我们上面介绍的方法对于小数据集没有问题, 不过对于象amazon等这样的大数据集, 就会很慢, 因为你每次都要去计算任意两个对象的相似度.这种方法被称为user-based collaborative filtering . an alternative is known as item-based collaborative filtering . in cases with very large datasets, item-based collaborative filtering can give better results, and it allows many of the calculations to be performed in advance so that a user needing recommendations can get them more quickly.

这边假设推荐系统都是用来为user推荐item的, 上面我们的方法是, 先找到和该user相似的user集合, 然后根据这些user所喜欢的item来推荐.

那么其实item之间本身也是有相似度, 那么如果我们事先算出和每个item相似的item集合, 在对user进行item推荐的时候, 只需要以该user喜欢的item的相似item集合来进行推荐.

这样做的一个依据是comparisons between items will not change as often as comparisons between users

因为你用户的兴趣可能是不断变的, 所以用户之间的关系是不断变化的, 而事物之间的关系是相对稳定的, 比如两部电影的关系, 是比较客观的

那么怎么计算item间的相似度了, 前面我们算user的相似度, 可以把这个矩阵倒置, 以item为向量, 以user的评价为维, 来计算item的相似度

这种方法刚开始user的评价不多的时候, item间的相似度关系会频繁变动, 但当user的评价达到一定数量级的时候, 这个相似度关系会变的稳定. 其实你也可以通过其他方法来算item间相似度, 比如对电影, 可以计算电影介绍, 影评的相似度

那么得到了item间的相似度, 怎么进行推荐

假设user对snakes, superman, dupree进行了评价, 那么怎样基于他的评价给他进行推荐新的电影

下面列出了和其他电影之间的相似度, 假设只有night, lady, luck

movie rating night r.xnight lady r.xlady luck r.xluck

snakes 4.5 0.182 0.818 0.222 0.999 0.105 0.474

superman 4.0 0.103 0.412 0.091 0.363 0.065 0.258

dupree 1.0 0.148 0.148 0.4 0.4 0.182 0.182

total 0.433 1.378 0.713 1.764 0.352 0.914

normalized 3.183 2.598 2.473

计算方法如下

rating * night = r.xnight

total-r.x/total-night = normalized

user-based or item-based filtering?

item-based filtering is significantly faster than user-based when getting a list of recommendations for a large dataset, but it does have the additional overhead of maintaining the item similarity table.

item-based filtering usually outperforms user-based filtering in sparse datasets, and the two perform about equally in dense datasets.

本文章摘自博客园，原文发布日期：2011-07-04

Programming Collecive Intelligence 笔记 Making Recommendations

继续阅读

Apache Flink Meetup，1.13 新版本发布 x 互娱场景实践分享的开发者盛筵！

实时数仓的演进之路1. 业务背景2.典型实时数仓诉求3. 实时数仓架构4. 基于Hologres的最佳实践5. 未来展望

开发者玩转机器学习不能错过的15篇深度文章！

白话Elasticsearch22- 深度探秘搜索技术之match_phrase_prefix实现search-time搜索推荐

《推荐系统》--混合推荐、解释及如何评估推荐系统混合推荐方法推荐系统的解释评估推荐系统案例分析：移动互联网个性化游戏推荐

《推荐系统》--基于知识的推荐概述知识表示法和推理与基于约束推荐系统交互与基于实例的推荐系统交互小结

前端性能优化之函数防抖

ElasticSearch（ES）倒排索引原理

下方进我的商品橱窗看看真便宜。今晚8点，淘宝天猫618正式开卖。记者从淘宝天猫处获悉，从首页展示到搜索推荐提升，从榜单推

es的几个骚操作

技术解密｜阿里云多媒体 AI 团队是凭借什么拿下 CVPR2021 5冠1亚的？顶级挑战赛战绩显赫四大挑战的关键技术探索基于视频理解技术打造多媒体 AI 云产品

算法专家解读 | 开放搜索教育搜题能力和实践

企业上云的智能指挥官——混合云管理平台

快手搜索推荐算法的原理。在快手搜索中，除了热门推荐外，还有一个庞大的流量池。这个流量池不仅限于头部作者，普通作者的作品也

当前音乐推荐系统研究中的挑战和愿景摘要1. 介绍2. 重大的挑战3. 未来方向和愿景

MovieTaster-使用Item2Vec做电影推荐 MovieTaster-使用Item2Vec做电影推荐