一、作业①
- 要求:指定一个网站,爬取这个网站中的所有的所有图片,例如中国气象网。分别使用
和单线程
的方式爬取。(限定爬取图片数量为学号后3位)多线程
- 输出信息:将下载的Url信息在控制台输出,并将下载的图片存储在images子文件中,并给出截图。
(一)单线程爬取
Gitee链接:作业3_1_1
1.解析网页
1.1 页面跳转
- 选中首页中的某些标题,找到页面跳转的链接,如下图
「数据采集」实验三 - 构造正则表达式获取链接信息
link = re.findall('a href="(http://.*?)"', resp.text)
1.2 图片链接
-
imgurl = re.findall('src="(.*?)"',data)
2.获取网页源码getHTMLText(url)
def getHTMLText(url):
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
try:
resp = requests.get(url,headers=headers,timeout=30)
resp.raise_for_status()
resp.encoding = resp.apparent_encoding
return resp.text
except Exception as err:
return err
3.获取图片链接并下载至本地
def craw(html):
reg = r'src="(.*?)"'
img_list = re.findall(reg,html)
global count # 图片数量
for imgurl in img_list:
print(count,imgurl)
#利用requests库下载图片
try:
if count> 140:
return 0
response = requests.get(imgurl)
file_path = 'D:/PyCharm/InternetWorm/weather/weather/img/' + '第'+ str(count) + '张图片' + '.jpg'
with open(file_path, 'wb') as f: # 图片信息是二进制形式,所以要用wb写入
f.write(response.content)
print('success')
except Exception as err:
print(err)
count += 1
4.运行结果
- 控制台输出
「数据采集」实验三 - 本地文件夹
「数据采集」实验三
(二)多线程爬取
Gitee链接:作业3_1_2
- 网页解析与`单线程爬取相同
2.主函数
# main
threads = []
imageSpider(page,link)
for t in threads:
t.join()
3.获取图片信息 imageSpider(page,link)
imageSpider(page,link)
def imageSpider(page,link_list):
global threads
global count
for i in range(page):
try:
start = time.perf_counter()
urls = []
url = link_list[i]
# print(url)
resp = requests.get(url, headers=headers, timeout=30)
resp.raise_for_status()
resp.encoding = resp.apparent_encoding
print(resp.text)
reg = r'src="(.*?)"'
img_list = re.findall(reg, resp.text)
# print(img_list)
for imgurl in img_list:
try:
if count >= 140:
end = time.perf_counter()
print('final is in ',end-start)
return 0
elif imgurl not in urls:
print(imgurl)
count += 1
# 启动线程
T = threading.Thread(target=download, args=(imgurl,count))
T.setDaemon(False)
T.start()
threads.append(T)
except Exception as err:
print(err)
except Exception as err:
print(err)
4.下载图片至本地 download(url, count)
download(url, count)
def download(url, count):
try:
response = requests.get(url)
file_path = 'D:/PyCharm/InternetWorm/weather/weather/img_thread/' + '第' + str(count) + '张图片' + '.jpg'
with open(file_path, 'wb') as f: # 图片信息是二进制形式,所以要用wb写入
f.write(response.content)
print('success')
print("downloaded " + str(count) + '.jpg')
except Exception as err:
print(err)
5.运行结果
-
「数据采集」实验三
(三)心得体会
- 此次作业与以往不同的是
的方式,并非通过某些特定的值来实现跳转,而是通过爬取链接。页面跳转
- 熟练掌握正则表达式的使用。
二、作业②
- 要求:使用scrapy框架复现作业①
- 输出信息:同作业①
Gitee链接:作业3_2
1.创建一个scrapy项目
scrapy startproject weather
2.编写 setting.py
setting.py
BOT_NAME = 'weather'
SPIDER_MODULES = ['weather.spiders']
NEWSPIDER_MODULE = 'weather.spiders'
ITEM_PIPELINES = {'weather.pipelines.WeatherPipeline': 300,}
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'weather (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
3.编写 items.py
中的数据项目类
items.py
class WeatherItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
id = scrapy.Field()
imgurl = scrapy.Field()
4.编写 pipelines.py
中的数据处理类
pipelines.py
class WeatherPipeline:
def open_spider(self, spider):
print("opend")
self.con = sqlite3.connect("img.db")
self.cursor = self.con.cursor()
try:
try:
self.cursor.execute("create table img (wId varchar(4),"
"wimgUrl varchar(128),"
"constraint pk_movies primary key (wId,"
"wimgUrl));")
except:
self.cursor.execute("delete from img")
self.opened = True
self.count = 1
except Exception as err:
print(err)
self.opened = False
def close_spider(self, spider):
try:
if self.opened:
self.con.commit()
self.con.close()
self.opened = False
except Exception as err:
print(err)
print("closed")
print("总共爬取", self.count-1, "项信息")
def process_item(self, item, spider):
try:
print(item["imgurl"])
if self.opened:
self.cursor.execute("insert into img (wId,wimgUrl) "
"values(?,?)",
(self.count,item['imgurl']))
self.count += 1
except Exception as err:
print(err)
return item
5.编写 Scrapy
爬虫程序 MySpider.py
Scrapy
MySpider.py
class MySpider(scrapy.Spider):
# 继承Scrapy.Spider类
name = "weather"
source_url = "http://www.weather.com.cn/"
page = 0
count = 1
def start_requests(self):
url = MySpider.source_url
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
try:
try:
dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
data = dammit.unicode_markup
# print(data)
except Exception as err:
print(err)
imgurl = re.findall('src="(.*?)"',data)
MySpider.page += 1
print("MySpider.page:", MySpider.page)
for url in imgurl:
if url.endswith('.jpg') or url.endswith('.JPG') or \
url.endswith('.png') or url.endswith('.PNG')or \
url.endswith('.gif') or url.endswith('.GIF'):
item = WeatherItem()
item['imgurl'] = url
else:
continue
yield item
try:
if MySpider.count > 140:
return 0
imagename = 'D:/PyCharm/InternetWorm/weather/weather/images/'+ '第' + str(MySpider.count) + '张图片' + '.jpg'
urllib.request.urlretrieve(str(url), filename=imagename)
print('success')
MySpider.count += 1
except Exception as err:
print(err)
link = re.findall('a href="(http://.*?)"', data)
for i in range(5):
link_ = link[i]
url = response.urljoin(link_)
yield scrapy.Request(url=url, callback=self.parse)
except Exception as err:
print(err)
6.运行结果
-
「数据采集」实验三 - 数据库截图
「数据采集」实验三 -
「数据采集」实验三
7.心得体会
- 对于某些后缀为
的.js
进行了过滤;imgurl
- 逐渐熟悉scrapy爬虫框架及数据库操作。
三、作业③
- 要求:爬取豆瓣电影数据使用
scrapy
,并将内容存储到数据库,同时将图片存储在imgs路径下。xpath
- 候选网站: https://movie.douban.com/top250
- 输出信息:
序号 | 电影名称 | 导演 | 演员 | 简介 | 电影评分 | 电影封面 |
---|---|---|---|---|---|---|
1 | 肖申克的救赎 | 弗兰克·德拉邦特 | 蒂姆·罗宾斯 | 希望让人自由 | 9.7 | ./imgs/xsk.jpg |
2 | ... |
1.2 网页结构
2 编写 item.py
item.py
class MovieItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
id = scrapy.Field()
name = scrapy.Field()
director = scrapy.Field()
actor = scrapy.Field()
profile = scrapy.Field()
score = scrapy.Field()
imgurl = scrapy.Field()
pipelines.py
pipelines.py
class MoviePipeline:
def open_spider(self, spider):
print("opend")
self.con = sqlite3.connect("movies.db")
self.cursor = self.con.cursor()
try:
try:
self.cursor.execute("create table movies (mId varchar(4),"
"mName varchar(256),mDirector varchar(64),"
"mActor varchar(64),mProfile varchar(256),"
"mScore varchar(8),mimgUrl varchar(128),"
"constraint pk_movies primary key (mId,"
"mName));")
except:
self.cursor.execute("delete from movies")
self.opened = True
self.count = 1
except Exception as err:
print(err)
self.opened = False
def close_spider(self, spider):
try:
if self.opened:
self.con.commit()
self.con.close()
self.opened = False
except Exception as err:
print(err)
print("closed")
print("总共爬取", self.count, "项信息")
def process_item(self, item, spider):
try:
print(item["id"])
print(item["name"])
print(item["director"])
print(item["actor"])
print(item["profile"])
print(item["score"])
print(item["imgurl"])
print()
if self.opened:
self.cursor.execute("insert into movies (mId,mName,mDirector,"
"mActor,mProfile,mScore,mimgUrl) "
"values(?,?,?,?,?,?,?)",
(item['id'], item['name'],
item['director'],item['actor'],
item['profile'],item['score'],
item['imgurl'],))
self.count += 1
except Exception as err:
print(err)
try:
url = item["imgurl"]
imagename = 'D:/PyCharm/InternetWorm/movie/movie/imgs/' + '第' + str(self.count) + '张图片' + '.jpg'
urllib.request.urlretrieve(str(url), filename=imagename)
print('success')
except Exception as err:
print(err)
return item
MySpider.py
MySpider.py
4.1 xpath获取信息
selector = scrapy.Selector(text=data)
movies = selector.xpath('//div[@class="info"]')
name = movies.xpath('div[@class="hd"]/a/span[position()=1]/text()').extract()
bd = movies.xpath('div[@class="bd"]/p/text()').extract()
director = re.findall('导演: (.*?) ',str(bd))
actor = re.findall('主演: (.*?) ',str(bd))
profile = movies.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()
score = movies.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()
img = selector.xpath('//div[@class="item"]')
id = img.xpath('div/em/text()').extract()
imgurl = img.xpath('div/a/img/@src').extract()
4.2 start_requests(self)
start_requests(self)
def start_requests(self):
while MySpider.page < 5:
MySpider.page += 1
print("MySpider.page:", MySpider.page)
url = MySpider.source_url + '?start=' + str((MySpider.page-1) * 25)
yield scrapy.Request(url=url, callback=self.parse)
-
「数据采集」实验三 -
「数据采集」实验三 -
「数据采集」实验三
6.心得体会
- 设置
,本应获取125条信息,但控制台显示获取了122条,目前尚未解决该问题。page = 5
- 熟悉了xpath信息提取方法。