爬取爬虫学习资料

2022-11-24 11:51:20

如有不得当之处，请联系我会及时删除

这次的抓取我用的是requests和Xpath,因为没有必要使用大型工具

import requests
from lxml import etree

思路：

1.目的是下载爬虫教程

2.分析网页以及规则，使用Xpath简单获取下载url

3.循环下载

代码如下：

class github():
  def __init__(self):
    self.allowed_domains = 'https://github.com/Python3WebSpider'
    self.headers = {
      'User-Agent':'*****请换成你们自己的 '
    }
  def spider_pipline(self):
    response1 = requests.get(self.allowed_domains,headers = self.headers,timeout = 5)
    selector = etree.HTML(response1.text)
    main_hrefs = selector.xpath('//div[@id="org-repositories"]//ul/li/div[@class="d-inline-block mb-1"]//a/@href')
    for start_href in main_hrefs:
      href = 'https://github.com'+ start_href
      response2 = requests.get(href, headers=self.headers, timeout=5)
      selector2 = etree.HTML(response2.text)
      href = selector2.xpath('//main[@id="js-repo-pjax-container"]//div[@class="get-repo-modal-options"]/div[@class="mt-2"]/a[2]/@href')
      for item in href:
        item_new = 'https://github.com'+item
        # yield item_new
        # print(item_new)
        r = requests.get(item_new)
        item = item[18:].replace('/','-')
        # print(item)
        with open(item, "wb") as git_zip:
          git_zip.write(r.content)
          print('done-')

if __name__ == '__main__':
  git = github()
  git.spider_pipline()
  print('down——OK')

最后的最后，建议大家给GitHub博主送个星，那个博主也是我崇拜的偶像呢！他的书很不错！建议买书进行学习、有利于知识体系的结构化构建

爬取爬虫学习资料

继续阅读

UVA 10344- 23 out of 5

ZOJ 1104 Leaps Tall Buildings

HDU 2821 Pusher

UVA 1401 Remember the Word

ZOJ 2748 Free Kick

CSU 1567 Reverse Rot

JAVA 系列——>开发工具IntelliJ IDEA的安装以及配置、快捷键IDEA 简介

UVA 519 Puzzle (II)

阿里巴巴Double分布式服务框架

磁盘结构及在Linux中的命名

[Linux] diff 查找文件的异同

IDEA以http形式clone代码连接超时

最小化DevOps自动化流程(Golang)

Git学习笔记5 merge冲突时二选一

vue-cli简介（中文翻译）

开源按键组件Multi_Button的使用,含测试工程