天天看点

「数据采集」实验三

一、作业①

  • 要求:指定一个网站,爬取这个网站中的所有的所有图片,例如中国气象网。分别使用

    单线程

    多线程

    的方式爬取。(限定爬取图片数量为学号后3位)
  • 输出信息:将下载的Url信息在控制台输出,并将下载的图片存储在images子文件中,并给出截图。

(一)单线程爬取

Gitee链接:作业3_1_1

1.解析网页

1.1 页面跳转
  • 选中首页中的某些标题,找到页面跳转的链接,如下图
    「数据采集」实验三
  • 构造正则表达式获取链接信息

    link = re.findall('a href="(http://.*?)"', resp.text)

1.2 图片链接
「数据采集」实验三
  • imgurl = re.findall('src="(.*?)"',data)

2.获取网页源码getHTMLText(url)

def getHTMLText(url):
    headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
    try:
        resp = requests.get(url,headers=headers,timeout=30)
        resp.raise_for_status()
        resp.encoding = resp.apparent_encoding
        return resp.text
    except Exception as err:
        return err
           

3.获取图片链接并下载至本地

def craw(html):
    reg = r'src="(.*?)"'
    img_list = re.findall(reg,html)
    global count # 图片数量
    for imgurl in img_list:
        print(count,imgurl)
        #利用requests库下载图片
        try:
            if count> 140:
                return 0
            response = requests.get(imgurl)
            file_path = 'D:/PyCharm/InternetWorm/weather/weather/img/' + '第'+ str(count) + '张图片' + '.jpg'
            with open(file_path, 'wb') as f:  # 图片信息是二进制形式,所以要用wb写入
                f.write(response.content)
                print('success')
        except Exception as err:
            print(err)
        count += 1
           

4.运行结果

  • 控制台输出
    「数据采集」实验三
  • 本地文件夹
    「数据采集」实验三

(二)多线程爬取

Gitee链接:作业3_1_2

  • 网页解析与`单线程爬取相同

2.主函数

# main
threads = []
imageSpider(page,link)
for t in threads:
    t.join()
           

3.获取图片信息

imageSpider(page,link)

def imageSpider(page,link_list):
    global threads
    global count
    for i in range(page):
        try:
            start = time.perf_counter()
            urls = []
            url = link_list[i]
            # print(url)
            resp = requests.get(url, headers=headers, timeout=30)
            resp.raise_for_status()
            resp.encoding = resp.apparent_encoding
            print(resp.text)
            reg = r'src="(.*?)"'
            img_list = re.findall(reg, resp.text)
            # print(img_list)
            for imgurl in img_list:
                try:
                    if count >= 140:
                        end = time.perf_counter()
                        print('final is in ',end-start)
                        return 0
                    elif imgurl not in urls:
                        print(imgurl)
                        count += 1
                        # 启动线程
                        T = threading.Thread(target=download, args=(imgurl,count))
                        T.setDaemon(False)
                        T.start()
                        threads.append(T)
                except Exception as err:
                    print(err)
        except Exception as err:
            print(err)
           

4.下载图片至本地

download(url, count)

def download(url, count):
    try:
        response = requests.get(url)
        file_path = 'D:/PyCharm/InternetWorm/weather/weather/img_thread/' + '第' + str(count) + '张图片' + '.jpg'
        with open(file_path, 'wb') as f:  # 图片信息是二进制形式,所以要用wb写入
            f.write(response.content)
            print('success')
        print("downloaded " + str(count) + '.jpg')
    except Exception as err:
        print(err)
           

5.运行结果

  • 「数据采集」实验三

(三)心得体会

  • 此次作业与以往不同的是

    页面跳转

    的方式,并非通过某些特定的值来实现跳转,而是通过爬取链接。
  • 熟练掌握正则表达式的使用。

二、作业②

  • 要求:使用scrapy框架复现作业①
  • 输出信息:同作业①
Gitee链接:作业3_2

1.创建一个scrapy项目

scrapy startproject weather

2.编写

setting.py

BOT_NAME = 'weather'
SPIDER_MODULES = ['weather.spiders']
NEWSPIDER_MODULE = 'weather.spiders'
ITEM_PIPELINES = {'weather.pipelines.WeatherPipeline': 300,}
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'weather (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
           

3.编写

items.py

中的数据项目类

class WeatherItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    id = scrapy.Field()
    imgurl = scrapy.Field()
           

4.编写

pipelines.py

中的数据处理类

class WeatherPipeline:
    def open_spider(self, spider):
        print("opend")
        self.con = sqlite3.connect("img.db")
        self.cursor = self.con.cursor()
        try:
            try:
                self.cursor.execute("create table img (wId varchar(4),"
                                    "wimgUrl varchar(128),"
                                    "constraint pk_movies primary key (wId,"
                                    "wimgUrl));")
            except:
                self.cursor.execute("delete from img")
            self.opened = True
            self.count = 1
        except Exception as err:
            print(err)
            self.opened = False

    def close_spider(self, spider):
        try:
            if self.opened:
                self.con.commit()
                self.con.close()
                self.opened = False
        except Exception as err:
            print(err)
        print("closed")
        print("总共爬取", self.count-1, "项信息")

    def process_item(self, item, spider):
        try:
            print(item["imgurl"])
            if self.opened:
                self.cursor.execute("insert into img (wId,wimgUrl) "
                                    "values(?,?)",
                                    (self.count,item['imgurl']))
                self.count += 1
        except Exception as err:
            print(err)
        return item
           

5.编写

Scrapy

爬虫程序

MySpider.py

class MySpider(scrapy.Spider):
    # 继承Scrapy.Spider类
    name = "weather"
    source_url = "http://www.weather.com.cn/"
    page = 0
    count = 1

    def start_requests(self):
        url = MySpider.source_url
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        try:
            try:
                dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
                data = dammit.unicode_markup
                # print(data)
            except Exception as err:
                print(err)
            imgurl = re.findall('src="(.*?)"',data)
            MySpider.page += 1
            print("MySpider.page:", MySpider.page)
            for url in imgurl:
                if url.endswith('.jpg') or url.endswith('.JPG') or \
                        url.endswith('.png') or url.endswith('.PNG')or \
                        url.endswith('.gif') or url.endswith('.GIF'):
                    item = WeatherItem()
                    item['imgurl'] = url
                else:
                    continue
                yield item
                try:
                    if MySpider.count > 140:
                        return 0
                    imagename = 'D:/PyCharm/InternetWorm/weather/weather/images/'+ '第' + str(MySpider.count) + '张图片' + '.jpg'
                    urllib.request.urlretrieve(str(url), filename=imagename)
                    print('success')
                    MySpider.count += 1
                except Exception as err:
                    print(err)

            link = re.findall('a href="(http://.*?)"', data)
            for i in range(5):
                link_ = link[i]
                url = response.urljoin(link_)
                yield scrapy.Request(url=url, callback=self.parse)

        except Exception as err:
            print(err)
           

6.运行结果

  • 「数据采集」实验三
  • 数据库截图
    「数据采集」实验三
  • 「数据采集」实验三

7.心得体会

  • 对于某些后缀为

    .js

    imgurl

    进行了过滤;
  • 逐渐熟悉scrapy爬虫框架及数据库操作。

三、作业③

  • 要求:爬取豆瓣电影数据使用

    scrapy

    xpath

    ,并将内容存储到数据库,同时将图片存储在imgs路径下。
  • 候选网站: https://movie.douban.com/top250
  • 输出信息:
序号 电影名称 导演 演员 简介 电影评分 电影封面
1 肖申克的救赎 弗兰克·德拉邦特 蒂姆·罗宾斯 希望让人自由 9.7 ./imgs/xsk.jpg
2 ...

1.2 网页结构

2 编写

item.py

class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    id = scrapy.Field()
    name = scrapy.Field()
    director = scrapy.Field()
    actor = scrapy.Field()
    profile = scrapy.Field()
    score = scrapy.Field()
    imgurl = scrapy.Field()
           

pipelines.py

class MoviePipeline:
    def open_spider(self, spider):
        print("opend")
        self.con = sqlite3.connect("movies.db")
        self.cursor = self.con.cursor()
        try:
            try:
                self.cursor.execute("create table movies (mId varchar(4),"
                                    "mName varchar(256),mDirector varchar(64),"
                                    "mActor varchar(64),mProfile varchar(256),"
                                    "mScore varchar(8),mimgUrl varchar(128),"
                                    "constraint pk_movies primary key (mId,"
                                    "mName));")
            except:
                self.cursor.execute("delete from movies")
            self.opened = True
            self.count = 1
        except Exception as err:
            print(err)
            self.opened = False

    def close_spider(self, spider):
        try:
            if self.opened:
                self.con.commit()
                self.con.close()
                self.opened = False
        except Exception as err:
            print(err)
        print("closed")
        print("总共爬取", self.count, "项信息")

    def process_item(self, item, spider):
        try:
            print(item["id"])
            print(item["name"])
            print(item["director"])
            print(item["actor"])
            print(item["profile"])
            print(item["score"])
            print(item["imgurl"])
            print()
            if self.opened:
                self.cursor.execute("insert into movies (mId,mName,mDirector,"
                                    "mActor,mProfile,mScore,mimgUrl) "
                                    "values(?,?,?,?,?,?,?)",
                                    (item['id'], item['name'],
                                     item['director'],item['actor'],
                                     item['profile'],item['score'],
                                     item['imgurl'],))
                self.count += 1
        except Exception as err:
            print(err)
        try:
            url = item["imgurl"]
            imagename = 'D:/PyCharm/InternetWorm/movie/movie/imgs/' + '第' + str(self.count) + '张图片' + '.jpg'
            urllib.request.urlretrieve(str(url), filename=imagename)
            print('success')
        except Exception as err:
            print(err)
        return item
           

MySpider.py

4.1 xpath获取信息

selector = scrapy.Selector(text=data)
movies = selector.xpath('//div[@class="info"]')
name = movies.xpath('div[@class="hd"]/a/span[position()=1]/text()').extract()
bd = movies.xpath('div[@class="bd"]/p/text()').extract()
director = re.findall('导演: (.*?) ',str(bd))
actor = re.findall('主演: (.*?) ',str(bd))
profile = movies.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()
score = movies.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()
img = selector.xpath('//div[@class="item"]')
id = img.xpath('div/em/text()').extract()
imgurl = img.xpath('div/a/img/@src').extract()
           

4.2

start_requests(self)

def start_requests(self):
        while MySpider.page < 5:
            MySpider.page += 1
            print("MySpider.page:", MySpider.page)
            url = MySpider.source_url + '?start=' + str((MySpider.page-1) * 25)
            yield scrapy.Request(url=url, callback=self.parse)
           

  • 「数据采集」实验三
  • 「数据采集」实验三
  • 「数据采集」实验三

6.心得体会

  • 设置

    page = 5

    ,本应获取125条信息,但控制台显示获取了122条,目前尚未解决该问题。
  • 熟悉了xpath信息提取方法。