爬取猫眼电影随意输入关键字爬取所有结果的信息

输入关键字复仇者爬取返回页面的所有电影结果的相关信息，爬取电影看图

接下来我们的任务就是爬取三页结果的电影的相关的信息

信息包括： ‘movie_id’: movie_id, 电影的id

‘ranking’: ranking, 还没有上映电影的排名（二选一）

‘rank’:rank, 已经能够上映的电影的排名

‘AttitudeCount’: attitudeCount, 想看的人数

‘Usercount’: usercount, 参与评分的人

‘movietitle’: movietitle, 电影的名

‘isrelease’: isrelease 是否已经上映（true或false）

‘endDate’:endDate,上映终止的日期（还没有上映的电影没有这项数据）

事前分析：

①任务：随意输入一个关键字，爬取所有结果的电影的相关的信息

②任务分析：首先，电影分为三种，一种是还没有上映的，正在热映的和已经上映的，就可能在分析电影的详情页请求是就有response格式，需要区别分开

③具体步骤：

第一步应该是找出三种电影分析三种电影的页面异同，为后面的做准备

一、输入关键字

二、分析跳转过来的页面，用抓包工具找出‘存电影详情页的链接’的链接

三、解析页面，爬取每个详情页的链接

四、是前面第一步中已经做好的准备，进行详情页的爬取

五、存进mongodb

④要建立的函数：

1、parse_movie(html)#爬取每部电影的相关信息（有三种情况要考虑：电影已经上映，正在上映和还没有上映（网页结构不同））

4、get_movie_index(url)#获得一个电影列表页面的response

5、parse_index(url)#解析页面，提取该页面的链接

6、index#页面的循环，有三个index页面要爬取

⑤存储到MongoDB

我使用的工具

浏览器：chrome

抓包工具：fiddler

接下来正式讲解爬取的过程

一、打开首页，输入‘复仇者’，跳转到index页面的第一页，查看fiddler。找到第一页的链接，点击查看

爬取猫眼电影随意输入关键字爬取所有结果的信息

这是一个json字符串的数据那么我们首先用正则表达式将json字符串匹配出来

pattern=re.compile('var result.*?({.*?});',re.S)
    json_string=re.search(pattern,html)
    json_html=json.loads(json_string.group(1))

json_html就是一个python的字典结构我们就可以通过字典的get方法获得我们想要的信息了。

#获得一个index页面的请求元文本
def one_page_index(url):
    headers={
     'Host':'service.library.mtime.com',
    'Connection':'keep-alive',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
    'Accept':'*/*',
    'Referer':'http://movie.mtime.com/217497/'
      }#传入请求头信息
    try:
        html=requests.get(url,headers=headers,verify=False)
        if html:
            return html.text
        return None
    except Exception as e:
        print(e)
#解析上面获得的元文本，获得我们想要的信息。这里是（movieurl和movieid）
def parse_index(url):
    b=[]
    html=one_page_index(url)
    pattern=re.compile('var result.*?({.*?});')
    json_string=re.search(pattern,html)
    json_html=json.loads(json_string.group())
    if json_html and 'value' in json_html:
        value=json_html.get('value')
    else:
        print('no json_html')
        return None
    if value and 'movieResult' in value:
        movieResult=value.get('movieResult')
    else:
           print('no value')
           return None
    if movieResult and 'moreMovies' in movieResult:
        moreMovies=movieResult.get('moreMovies')
    else:
        print('no movieResult')
        return None
    for i in moreMovies:
        movieurl=i.get('movieUrl')
        movieid=i.get('movieId')
        a=[movieurl,movieid]
        b.append(a)

    return b

上面的代码中我是用了一些异常的捕捉和条件句的判断，都是必须的我感觉

经过上面的编码我们已经得到了index页面中我们想要的信息，接下来我们去获得每个电影详情页中我们最终期望爬取的信息

点击任何一个电影进入详情页面

爬取猫眼电影随意输入关键字爬取所有结果的信息

接下来是一大段代码，其中parse_movie函数中我们考虑了三种页面结构（还没有上映，正在上映和已经能够上映的）

def get_movie(url):
    headers={
     'Host':'service.library.mtime.com',
    'Connection':'keep-alive',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
    'Accept':'*/*',
    'Referer':'http://movie.mtime.com/217497/'
      }
    try:
        html=requests.get(url,headers=headers,verify=False)
        if html:
            return html.text
        return None
    except Exception as e:
        print(e)

def parse_movie(url):
    global ranking
    ranking=None
    global endDate
    endDate=None
    global movie_massage
    global release
    global rank
    html=get_movie(url)
    pattern=re.compile('var result.*?({.*?});',re.S)
    json_string=re.search(pattern,html)
    json_html=json.loads(json_string.group())#正则匹配
    if json_html and 'value' in json_html:
        value=json_html.get('value')
    else:
        print('no value')
        return None
#以下代码是获取python字典中我们需要的信息，感觉有点复杂和啰嗦。
    if value and 'boxOffice' in value:
        boxoffice=value.get('boxOffice')
        endDate=boxoffice.get('EndDate')
        rank=boxoffice.get('Rank')
    elif value and 'hotValue' in value:
        hotvalue=value.get('hotValue')
        ranking=hotvalue.get('Ranking')
    if value and 'isRelease' in value:
        isrelease=value.get('isRelease')
    else:
        print('no isrelease')
        return None
    if value and 'movieRating' in value:
        movierating=value.get('movieRating')
        attitudeCount=movierating.get('AttitudeCount')
        movie_id=movierating.get('MovieId')
        usercount=movierating.get('Usercount')
    else:
        print('no movierating')
        return None
    if value and 'movieTitle' in value:
        movietitle=value.get('movieTitle')
    else:
        print('no movietitle')
        return None
     #因为有些已经上映的电影是既没有endDate也没有ranking信息的
    if endDate!=None or ranking !=None:
        try:
        #已经上映的电影有endDate信息，没有ranking信息
            if endDate:
                movie_massage={
                    'endDate':endDate,
                    'rank':rank,
                    'isrelease':isrelease,
                    'attitudeCount':attitudeCount,
                    'movie_id':movie_id,
                    'usercount':usercount,
                    'movietitle':movietitle
                }
        except Exception:
            pass
        try:
        #即将上映还没有上映的电影中没有endDate信息而有ranking信息
            if ranking:
                movie_massage = {
                    'movie_id': movie_id,
                    'ranking': ranking,
                    'AttitudeCount': attitudeCount,
                    'Usercount': usercount,
                    'movietitle': movietitle,
                    'isrelease': isrelease
                }
        except Exception:
            pass
    else:
    #剩余的就是有些已经上映的电影里两个信息都没有
        try:
            movie_massage = {
                'movie_id': movie_id,
                'AttitudeCount': attitudeCount,
                'Usercount': usercount,
                   'movietitle': movietitle,
                'isrelease': isrelease
            }
        except Exception as e:
            print(e)

    return movie_massage

由于要考虑多种情况所以比较的复杂

经历了上面的一些步骤，现在我们就能够捉去一个index页面和一个详情页的页面了，但是我们需要捉去的是多个index页面和多个详情页，这就需要对这个ajax加载链接进行一些分析，找出其中的异同，从而构造一个通用的URL模式

首先是index页面的链接构造：

*http://service.channel.mtime.com/Search.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Channel.Services&Ajax_CallBackMethod=GetSearchResult&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fsearch.mtime.com%2Fsearch%2F%3Fq%3D%25E5%25A4%258D%25E4%25BB%2587%25E8%2580%2585%26t%3D0&t=2018412183554830&Ajax_CallBackArgument0=%E5%A4%8D%E4%BB%87%E8%80%85&Ajax_CallBackArgument1=0&Ajax_CallBackArgument2=295&Ajax_CallBackArgument3=0&Ajax_CallBackArgument4=1

http://service.channel.mtime.com/Search.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Channel.Services&Ajax_CallBackMethod=GetSearchResult&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fsearch.mtime.com%2Fsearch%2F%3Fq%3D%25E5%25A4%258D%25E4%25BB%2587%25E8%2580%2585%26t%3D1%26p%3D2%26i%3D0%26c%3D295&t=20184121972310957&Ajax_CallBackArgument0=%E5%A4%8D%E4%BB%87%E8%80%85&Ajax_CallBackArgument1=1&Ajax_CallBackArgument2=295&Ajax_CallBackArgument3=0&Ajax_CallBackArgument4=2

http://service.channel.mtime.com/Search.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Channel.Services&Ajax_CallBackMethod=GetSearchResult&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fsearch.mtime.com%2Fsearch%2F%3Fq%3D%25E5%25A4%258D%25E4%25BB%2587%25E8%2580%2585%26t%3D1%26p%3D3%26i%3D0%26c%3D295&t=20184121975663865&Ajax_CallBackArgument0=%E5%A4%8D%E4%BB%87%E8%80%85&Ajax_CallBackArgument1=1&Ajax_CallBackArgument2=295&Ajax_CallBackArgument3=0&Ajax_CallBackArgument4=3

仔细对比上面你会发现连接中只用 t 和 ajax_callbackargument4 两个参数需要改变，而t是当前的时间，接下来就好处理了

直接上代码

def make_index_url(x):
    one='http://service.channel.mtime.com/Search.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Channel.Services&Ajax_CallBackMethod=GetSearchResult&Ajax_CrossDomain=1&'
    two='Ajax_RequestUrl=http%3A%2F%2F'
    three='search.mtime.com%2Fsearch%2F%3Fq%3D%25E5%25A4%258D%25E4%25BB%2587%25E8%2580%2585%26t%3D1%26p%3D2%26i%3D0%26c%3D295'
    #four是一个获取当前时间的方法
    four=r'&t=%s'% d.datetime.now().strftime("%Y%m%d%H%M%S3282")
    five=r'&Ajax_CallBackArgument0=%E5%A4%8D%E4%BB%87%E8%80%85'
    six=r'&Ajax_CallBackArgument1=1&Ajax_CallBackArgument2=295&Ajax_CallBackArgument3=0&Ajax_CallBackArgument4=%s'%x
    #我是利用字符串的加法进行url的构造的，真暴力
    url = one +two +three+four+five+six
    return url

接下来就是详情页的url构造了

*http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fmovie.mtime.com%2F218090%2F&t=20184121851354464&Ajax_CallBackArgument0=218090

http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&Ajax_RequestUrl=http%3A%2F%2Fmovie.mtime.com%2F22411%2F&t=20184121914842461&Ajax_CallBackArgument0=22411

可以看出详情页只有三个参数有变化分别是

ajax_requesturl:这个直接就是电影的链接

t:跟上面一样是时间

ajax_callbackargument:这个是movieid，我们之前已经爬取过了的*

也是直接代码上

def make_detail_url(x):
    one='http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&'
    two='Ajax_RequestUrl=%s'%x[]
    three='&t=%s'%d.datetime.now().strftime("%Y%m%d%H%M%S3282")
    four='&Ajax_CallBackArgument0=%s'%x[]
    url=one+two+three+four
    return url

到这里我们基本上已经完成了这个工程了，接下来就是一些收尾的工作了

包括迭代爬取和存进数据库MongoDB

if __name__=='__main__':
    massage_list=[]
    for f in range(,):
        url=make_index_url(f)
        b=parse_index(url)
        for i in b:
            url=make_detail_url(i)
            get_massage=parse_movie(url)
            if isinstance(get_massage,dict):
                massage_list.append(get_massage)
    collection.insert_many(massage_list)

完整的代码

#  conding:utf-8
import requests
import json,re
import  datetime as d
import pymongo

client=pymongo.MongoClient()
db=client.pythonSpider
collection=db.maoyan

def one_page_index(url):
    headers={
     'Host':'service.library.mtime.com',
    'Connection':'keep-alive',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
    'Accept':'*/*',
    'Referer':'http://movie.mtime.com/217497/'
      }
    try:
        html=requests.get(url,headers=headers,verify=False)
        if html:
            return html.text
        return None
    except Exception as e:
        print(e)
def parse_index(url):
    b=[]
    html=one_page_index(url)
    pattern=re.compile('var result.*?({.*?});')
    json_string=re.search(pattern,html)
    json_html=json.loads(json_string.group())
    if json_html and 'value' in json_html:
        value=json_html.get('value')
    else:
        print('no json_html')
        return None
    if value and 'movieResult' in value:
        movieResult=value.get('movieResult')
    else:
           print('no value')
           return None
    if movieResult and 'moreMovies' in movieResult:
        moreMovies=movieResult.get('moreMovies')
    else:
        print('no movieResult')
        return None
    for i in moreMovies:
        movieurl=i.get('movieUrl')
        movieid=i.get('movieId')
        a=[movieurl,movieid]
        b.append(a)

    return b
def get_movie(url):
    headers={
     'Host':'service.library.mtime.com',
    'Connection':'keep-alive',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
    'Accept':'*/*',
    'Referer':'http://movie.mtime.com/217497/'
      }
    try:
        html=requests.get(url,headers=headers,verify=False)
        if html:
            return html.text
        return None
    except Exception as e:
        print(e)

def parse_movie(url):
    global ranking
    ranking=None
    global endDate
    endDate=None
    global movie_massage
    global release
    global rank
    html=get_movie(url)
    pattern=re.compile('var result.*?({.*?});',re.S)
    json_string=re.search(pattern,html)
    json_html=json.loads(json_string.group())
    if json_html and 'value' in json_html:
        value=json_html.get('value')
    else:
        print('no value')
        return None

    if value and 'boxOffice' in value:
        boxoffice=value.get('boxOffice')
        endDate=boxoffice.get('EndDate')
        rank=boxoffice.get('Rank')
    elif value and 'hotValue' in value:
        hotvalue=value.get('hotValue')
        ranking=hotvalue.get('Ranking')
    if value and 'isRelease' in value:
        isrelease=value.get('isRelease')
    else:
        print('no isrelease')
        return None
    if value and 'movieRating' in value:
        movierating=value.get('movieRating')
        attitudeCount=movierating.get('AttitudeCount')
        movie_id=movierating.get('MovieId')
        usercount=movierating.get('Usercount')
    else:
        print('no movierating')
        return None
    if value and 'movieTitle' in value:
        movietitle=value.get('movieTitle')
    else:
        print('no movietitle')
        return None
    if endDate!=None or ranking !=None:
        try:
            if endDate:
                movie_massage={
                    'endDate':endDate,
                    'rank':rank,
                    'isrelease':isrelease,
                    'attitudeCount':attitudeCount,
                    'movie_id':movie_id,
                    'usercount':usercount,
                    'movietitle':movietitle
                }
        except Exception:
            pass
        try:
            if ranking:
                movie_massage = {
                    'movie_id': movie_id,
                    'ranking': ranking,
                    'AttitudeCount': attitudeCount,
                    'Usercount': usercount,
                    'movietitle': movietitle,
                    'isrelease': isrelease
                }
        except Exception:
            pass
    else:
        try:
            movie_massage = {
                'movie_id': movie_id,
                'AttitudeCount': attitudeCount,
                'Usercount': usercount,
                   'movietitle': movietitle,
                'isrelease': isrelease
            }
        except Exception as e:
            print(e)

    return movie_massage

def make_detail_url(x):
    one='http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&'
    two='Ajax_RequestUrl=%s'%x[]
    three='&t=%s'%d.datetime.now().strftime("%Y%m%d%H%M%S3282")
    four='&Ajax_CallBackArgument0=%s'%x[]
    url=one+two+three+four
    return url

def make_index_url(x):
    one='http://service.channel.mtime.com/Search.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Channel.Services&Ajax_CallBackMethod=GetSearchResult&Ajax_CrossDomain=1&'
    two='Ajax_RequestUrl=http%3A%2F%2F'
    three='search.mtime.com%2Fsearch%2F%3Fq%3D%25E5%25A4%258D%25E4%25BB%2587%25E8%2580%2585%26t%3D1%26p%3D2%26i%3D0%26c%3D295'
    four=r'&t=%s'% d.datetime.now().strftime("%Y%m%d%H%M%S3282")
    five=r'&Ajax_CallBackArgument0=%E5%A4%8D%E4%BB%87%E8%80%85'
    six=r'&Ajax_CallBackArgument1=1&Ajax_CallBackArgument2=295&Ajax_CallBackArgument3=0&Ajax_CallBackArgument4=%s'%x
    url = one +two +three+four+five+six
    return url

if __name__=='__main__':
    massage_list=[]
    for f in range(,):
        url=make_index_url(f)
        b=parse_index(url)
        for i in b:
            url=make_detail_url(i)
            get_massage=parse_movie(url)
            if isinstance(get_massage,dict):
                massage_list.append(get_massage)
    collection.insert_many(massage_list)

最后，欢迎和我一起交流呀。留言把

爬取猫眼电影随意输入关键字爬取所有结果的信息

继续阅读

Python漫画爬虫开源 66漫画 AJAX，包含数据库连接，图片下载处理

requests模块进行人人网模拟登陆

【崔庆才教材】《Python3网络爬虫开发实战》3.4爬取猫眼电影排行代码更正（绕过美团验证码）

Python image.show() 出错FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬虫学习笔记 -- 多线程操作

M团店铺评价采集不到问题问题展示：解决方案：

Python爬虫学习（1）

Python爬虫学习进阶

Python爬虫（入门+进阶）学习笔记 1-2 初识Python爬虫

Python进阶爬虫——Class1：认识爬虫

python爬虫学习笔记-1

python学习之urllib使用小结

NOIp模拟题之肮脏的牧师（桶排序）

一篇文章教你如何在一个月内学会爬取大规模数据

Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗Pyhton爬虫实战 - 抓取BOSS直聘职位描述和数据清洗

sort()函数到底是怎样进行数字排序的