The scale step when design web crawler

2020-10-22 04:35:00

所谓的scale step就是解决一些奇奇怪怪的corner case的

比如说：

how to handle update or failure?这是因为现在爬取的文件或者网页过一段时间后就陈旧了或者过一段时间就被原主人修改了，怎么办呢？或者说

Answer: 进行指数型的更新或者重试，比如第1 2 4 8…个星期进行更新或者重试。

how to handle dead cycle?比如说sina.com里面的内链太多导致从sina.com里面出不来了

Answer: use quota! 就是说在task table里面只给sina.com 10% 多了不给

How to design a Typeahead:(自动联想搜索)

自动联想一些被搜索的次数多的关键词

这个就需要两个服务：query service and data collection service.

查询服务：就是说在数据库的表中检索 data collection service:就是统计次数

follow up:

当我们在数据库的表中检索的时候会用到下面的语句：

select * from hit_status where keyword like ‘￥{key}%’ order by hit_count DESC LIMIT 10.

但是我们同时也注意到Like操作时间花费很多！

那么怎么样减少这个花费呢？用Trie这种数据结构！

那么针对data collection service,你想选用什么方式存储呢？

用bigTable.但是data是每时每刻都在迅速变化的怎么样更新这个big table呢？

我们并不对这个table做实时更新而是在未来的某个节点构建一个新的表然后将老的表替换掉

How to reduce response time?

cache results

what if the trie gets too large for one machine?

我们这个时候就把query service进行拆分比如说第一个service只管以a开头第二个只管以b开头。。。

how to reduce the size of the log file?

什么是log file table?

所以你可以想象的到储存20B的amazon条目十分占空间。

解决方法：

继续阅读