输入关键字复仇者爬取返回页面的所有电影结果的相关信息,爬取电影看图
接下来我们的任务就是爬取三页结果的电影的相关的信息
信息包括: ‘movie_id’: movie_id, 电影的id
‘ranking’: ranking, 还没有上映电影的排名(二选一)
‘rank’:rank, 已经能够上映的电影的排名
‘AttitudeCount’: attitudeCount, 想看的人数
‘Usercount’: usercount, 参与评分的人
‘movietitle’: movietitle, 电影的名
‘isrelease’: isrelease 是否已经上映(true或false)
‘endDate’:endDate,上映终止的日期(还没有上映的电影没有这项数据)
事前分析:
①任务:随意输入一个关键字,爬取所有结果的电影的相关的信息
②任务分析:首先,电影分为三种,一种是还没有上映的,正在热映的和已经上映的,就可能在分析电影的详情页请求是就有response格式,需要区别分开
③具体步骤:
第一步应该是找出三种电影分析三种电影的页面异同,为后面的做准备
一、输入关键字
二、分析跳转过来的页面,用抓包工具找出‘存电影详情页的链接’的链接
三、解析页面,爬取每个详情页的链接
四、是前面第一步中已经做好的准备,进行详情页的爬取
五、存进mongodb
④要建立的函数:
1、parse_movie(html)#爬取每部电影的相关信息(有三种情况要考虑:电影已经上映,正在上映和还没有上映(网页结构不同))
4、get_movie_index(url)#获得一个电影列表页面的response
5、parse_index(url)#解析页面,提取该页面的链接
6、index#页面的循环,有三个index页面要爬取
⑤存储到MongoDB
我使用的工具
浏览器:chrome
抓包工具:fiddler
接下来正式讲解爬取的过程
一、打开首页,输入‘复仇者’,跳转到index页面的第一页,查看fiddler。找到第一页的链接,点击查看
这是一个json字符串的数据那么我们首先用正则表达式将json字符串匹配出来
pattern=re.compile('var result.*?({.*?});',re.S)
json_string=re.search(pattern,html)
json_html=json.loads(json_string.group(1))
json_html就是一个python的字典结构我们就可以通过字典的get方法获得我们想要的信息了。
#获得一个index页面的请求元文本
def one_page_index(url):
headers={
'Host':'service.library.mtime.com',
'Connection':'keep-alive',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
'Accept':'*/*',
'Referer':'http://movie.mtime.com/217497/'
}#传入请求头信息
try:
html=requests.get(url,headers=headers,verify=False)
if html:
return html.text
return None
except Exception as e:
print(e)
#解析上面获得的元文本,获得我们想要的信息。这里是(movieurl和movieid)
def parse_index(url):
b=[]
html=one_page_index(url)
pattern=re.compile('var result.*?({.*?});')
json_string=re.search(pattern,html)
json_html=json.loads(json_string.group())
if json_html and 'value' in json_html:
value=json_html.get('value')
else:
print('no json_html')
return None
if value and 'movieResult' in value:
movieResult=value.get('movieResult')
else:
print('no value')
return None
if movieResult and 'moreMovies' in movieResult:
moreMovies=movieResult.get('moreMovies')
else:
print('no movieResult')
return None
for i in moreMovies:
movieurl=i.get('movieUrl')
movieid=i.get('movieId')
a=[movieurl,movieid]
b.append(a)
return b
上面的代码中我是用了一些异常的捕捉和条件句的判断,都是必须的我感觉
经过上面的编码我们已经得到了index页面中我们想要的信息,接下来我们去获得每个电影详情页中我们最终期望爬取的信息
点击任何一个电影进入详情页面
接下来是一大段代码,其中parse_movie函数中我们考虑了三种页面结构(还没有上映,正在上映和已经能够上映的)
def get_movie(url):
headers={
'Host':'service.library.mtime.com',
'Connection':'keep-alive',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
'Accept':'*/*',
'Referer':'http://movie.mtime.com/217497/'
}
try:
html=requests.get(url,headers=headers,verify=False)
if html:
return html.text
return None
except Exception as e:
print(e)
def parse_movie(url):
global ranking
ranking=None
global endDate
endDate=None
global movie_massage
global release
global rank
html=get_movie(url)
pattern=re.compile('var result.*?({.*?});',re.S)
json_string=re.search(pattern,html)
json_html=json.loads(json_string.group())#正则匹配
if json_html and 'value' in json_html:
value=json_html.get('value')
else:
print('no value')
return None
#以下代码是获取python字典中我们需要的信息,感觉有点复杂和啰嗦。
if value and 'boxOffice' in value:
boxoffice=value.get('boxOffice')
endDate=boxoffice.get('EndDate')
rank=boxoffice.get('Rank')
elif value and 'hotValue' in value:
hotvalue=value.get('hotValue')
ranking=hotvalue.get('Ranking')
if value and 'isRelease' in value:
isrelease=value.get('isRelease')
else:
print('no isrelease')
return None
if value and 'movieRating' in value:
movierating=value.get('movieRating')
attitudeCount=movierating.get('AttitudeCount')
movie_id=movierating.get('MovieId')
usercount=movierating.get('Usercount')
else:
print('no movierating')
return None
if value and 'movieTitle' in value:
movietitle=value.get('movieTitle')
else:
print('no movietitle')
return None
#因为有些已经上映的电影是既没有endDate也没有ranking信息的
if endDate!=None or ranking !=None:
try:
#已经上映的电影有endDate信息,没有ranking信息
if endDate:
movie_massage={
'endDate':endDate,
'rank':rank,
'isrelease':isrelease,
'attitudeCount':attitudeCount,
'movie_id':movie_id,
'usercount':usercount,
'movietitle':movietitle
}
except Exception:
pass
try:
#即将上映还没有上映的电影中没有endDate信息而有ranking信息
if ranking:
movie_massage = {
'movie_id': movie_id,
'ranking': ranking,
'AttitudeCount': attitudeCount,
'Usercount': usercount,
'movietitle': movietitle,
'isrelease': isrelease
}
except Exception:
pass
else:
#剩余的就是有些已经上映的电影里两个信息都没有
try:
movie_massage = {
'movie_id': movie_id,
'AttitudeCount': attitudeCount,
'Usercount': usercount,
'movietitle': movietitle,
'isrelease': isrelease
}
except Exception as e:
print(e)
return movie_massage
由于要考虑多种情况所以比较的复杂
经历了上面的一些步骤,现在我们就能够捉去一个index页面和一个详情页的页面了,但是我们需要捉去的是多个index页面和多个详情页,这就需要对这个ajax加载链接进行一些分析,找出其中的异同,从而构造一个通用的URL模式
首先是index页面的链接构造:
*http://service.channel.mtime.com/Search.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Channel.Services&Ajax_CallBackMethod=GetSearchResult&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fsearch.mtime.com%2Fsearch%2F%3Fq%3D%25E5%25A4%258D%25E4%25BB%2587%25E8%2580%2585%26t%3D0&t=2018412183554830&Ajax_CallBackArgument0=%E5%A4%8D%E4%BB%87%E8%80%85&Ajax_CallBackArgument1=0&Ajax_CallBackArgument2=295&Ajax_CallBackArgument3=0&Ajax_CallBackArgument4=1
http://service.channel.mtime.com/Search.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Channel.Services&Ajax_CallBackMethod=GetSearchResult&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fsearch.mtime.com%2Fsearch%2F%3Fq%3D%25E5%25A4%258D%25E4%25BB%2587%25E8%2580%2585%26t%3D1%26p%3D2%26i%3D0%26c%3D295&t=20184121972310957&Ajax_CallBackArgument0=%E5%A4%8D%E4%BB%87%E8%80%85&Ajax_CallBackArgument1=1&Ajax_CallBackArgument2=295&Ajax_CallBackArgument3=0&Ajax_CallBackArgument4=2
http://service.channel.mtime.com/Search.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Channel.Services&Ajax_CallBackMethod=GetSearchResult&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fsearch.mtime.com%2Fsearch%2F%3Fq%3D%25E5%25A4%258D%25E4%25BB%2587%25E8%2580%2585%26t%3D1%26p%3D3%26i%3D0%26c%3D295&t=20184121975663865&Ajax_CallBackArgument0=%E5%A4%8D%E4%BB%87%E8%80%85&Ajax_CallBackArgument1=1&Ajax_CallBackArgument2=295&Ajax_CallBackArgument3=0&Ajax_CallBackArgument4=3
仔细对比上面你会发现连接中只用 t 和 ajax_callbackargument4 两个参数需要改变,而t是当前的时间,接下来就好处理了
*
直接上代码
def make_index_url(x):
one='http://service.channel.mtime.com/Search.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Channel.Services&Ajax_CallBackMethod=GetSearchResult&Ajax_CrossDomain=1&'
two='Ajax_RequestUrl=http%3A%2F%2F'
three='search.mtime.com%2Fsearch%2F%3Fq%3D%25E5%25A4%258D%25E4%25BB%2587%25E8%2580%2585%26t%3D1%26p%3D2%26i%3D0%26c%3D295'
#four是一个获取当前时间的方法
four=r'&t=%s'% d.datetime.now().strftime("%Y%m%d%H%M%S3282")
five=r'&Ajax_CallBackArgument0=%E5%A4%8D%E4%BB%87%E8%80%85'
six=r'&Ajax_CallBackArgument1=1&Ajax_CallBackArgument2=295&Ajax_CallBackArgument3=0&Ajax_CallBackArgument4=%s'%x
#我是利用字符串的加法进行url的构造的,真暴力
url = one +two +three+four+five+six
return url
接下来就是详情页的url构造了
*http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fmovie.mtime.com%2F218090%2F&t=20184121851354464&Ajax_CallBackArgument0=218090
http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fmovie.mtime.com%2F22411%2F&t=20184121914842461&Ajax_CallBackArgument0=22411
可以看出详情页只有三个参数有变化分别是
ajax_requesturl:这个直接就是电影的链接
t:跟上面一样是时间
ajax_callbackargument:这个是movieid,我们之前已经爬取过了的*
也是直接代码上
def make_detail_url(x):
one='http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&'
two='Ajax_RequestUrl=%s'%x[]
three='&t=%s'%d.datetime.now().strftime("%Y%m%d%H%M%S3282")
four='&Ajax_CallBackArgument0=%s'%x[]
url=one+two+three+four
return url
到这里我们基本上已经完成了这个工程了,接下来就是一些收尾的工作了
包括迭代爬取和存进数据库MongoDB
if __name__=='__main__':
massage_list=[]
for f in range(,):
url=make_index_url(f)
b=parse_index(url)
for i in b:
url=make_detail_url(i)
get_massage=parse_movie(url)
if isinstance(get_massage,dict):
massage_list.append(get_massage)
collection.insert_many(massage_list)
完整的代码
# conding:utf-8
import requests
import json,re
import datetime as d
import pymongo
client=pymongo.MongoClient()
db=client.pythonSpider
collection=db.maoyan
def one_page_index(url):
headers={
'Host':'service.library.mtime.com',
'Connection':'keep-alive',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
'Accept':'*/*',
'Referer':'http://movie.mtime.com/217497/'
}
try:
html=requests.get(url,headers=headers,verify=False)
if html:
return html.text
return None
except Exception as e:
print(e)
def parse_index(url):
b=[]
html=one_page_index(url)
pattern=re.compile('var result.*?({.*?});')
json_string=re.search(pattern,html)
json_html=json.loads(json_string.group())
if json_html and 'value' in json_html:
value=json_html.get('value')
else:
print('no json_html')
return None
if value and 'movieResult' in value:
movieResult=value.get('movieResult')
else:
print('no value')
return None
if movieResult and 'moreMovies' in movieResult:
moreMovies=movieResult.get('moreMovies')
else:
print('no movieResult')
return None
for i in moreMovies:
movieurl=i.get('movieUrl')
movieid=i.get('movieId')
a=[movieurl,movieid]
b.append(a)
return b
def get_movie(url):
headers={
'Host':'service.library.mtime.com',
'Connection':'keep-alive',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
'Accept':'*/*',
'Referer':'http://movie.mtime.com/217497/'
}
try:
html=requests.get(url,headers=headers,verify=False)
if html:
return html.text
return None
except Exception as e:
print(e)
def parse_movie(url):
global ranking
ranking=None
global endDate
endDate=None
global movie_massage
global release
global rank
html=get_movie(url)
pattern=re.compile('var result.*?({.*?});',re.S)
json_string=re.search(pattern,html)
json_html=json.loads(json_string.group())
if json_html and 'value' in json_html:
value=json_html.get('value')
else:
print('no value')
return None
if value and 'boxOffice' in value:
boxoffice=value.get('boxOffice')
endDate=boxoffice.get('EndDate')
rank=boxoffice.get('Rank')
elif value and 'hotValue' in value:
hotvalue=value.get('hotValue')
ranking=hotvalue.get('Ranking')
if value and 'isRelease' in value:
isrelease=value.get('isRelease')
else:
print('no isrelease')
return None
if value and 'movieRating' in value:
movierating=value.get('movieRating')
attitudeCount=movierating.get('AttitudeCount')
movie_id=movierating.get('MovieId')
usercount=movierating.get('Usercount')
else:
print('no movierating')
return None
if value and 'movieTitle' in value:
movietitle=value.get('movieTitle')
else:
print('no movietitle')
return None
if endDate!=None or ranking !=None:
try:
if endDate:
movie_massage={
'endDate':endDate,
'rank':rank,
'isrelease':isrelease,
'attitudeCount':attitudeCount,
'movie_id':movie_id,
'usercount':usercount,
'movietitle':movietitle
}
except Exception:
pass
try:
if ranking:
movie_massage = {
'movie_id': movie_id,
'ranking': ranking,
'AttitudeCount': attitudeCount,
'Usercount': usercount,
'movietitle': movietitle,
'isrelease': isrelease
}
except Exception:
pass
else:
try:
movie_massage = {
'movie_id': movie_id,
'AttitudeCount': attitudeCount,
'Usercount': usercount,
'movietitle': movietitle,
'isrelease': isrelease
}
except Exception as e:
print(e)
return movie_massage
def make_detail_url(x):
one='http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&'
two='Ajax_RequestUrl=%s'%x[]
three='&t=%s'%d.datetime.now().strftime("%Y%m%d%H%M%S3282")
four='&Ajax_CallBackArgument0=%s'%x[]
url=one+two+three+four
return url
def make_index_url(x):
one='http://service.channel.mtime.com/Search.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Channel.Services&Ajax_CallBackMethod=GetSearchResult&Ajax_CrossDomain=1&'
two='Ajax_RequestUrl=http%3A%2F%2F'
three='search.mtime.com%2Fsearch%2F%3Fq%3D%25E5%25A4%258D%25E4%25BB%2587%25E8%2580%2585%26t%3D1%26p%3D2%26i%3D0%26c%3D295'
four=r'&t=%s'% d.datetime.now().strftime("%Y%m%d%H%M%S3282")
five=r'&Ajax_CallBackArgument0=%E5%A4%8D%E4%BB%87%E8%80%85'
six=r'&Ajax_CallBackArgument1=1&Ajax_CallBackArgument2=295&Ajax_CallBackArgument3=0&Ajax_CallBackArgument4=%s'%x
url = one +two +three+four+five+six
return url
if __name__=='__main__':
massage_list=[]
for f in range(,):
url=make_index_url(f)
b=parse_index(url)
for i in b:
url=make_detail_url(i)
get_massage=parse_movie(url)
if isinstance(get_massage,dict):
massage_list.append(get_massage)
collection.insert_many(massage_list)
最后,欢迎和我一起交流呀。留言把