从零实现一个高性能网络爬虫（一）网络请求分析及代码实现

摘要

从零实现一个高性能网络爬虫系列教程第一篇，后续会有关于url去重、如何反爬虫、如何提高抓取效率、分布式爬虫系列文章。

以我写的一个知乎爬虫为Demo讲解,github地址 (https://github.com/wycm/zhihu-crawler) ,有兴趣的朋友可以star下。

网络请求的分析是写网络爬虫非常关键且重要的一个步骤。这篇文章以知乎网站为例，从网络请求分析到代码(java)实现。

目的

获取某个知乎用户的所有关注用户的个人资料

请求分析

就目前的大部分网页来说，网页上能看到的数据大多都是直接在网站后台生成好数据(有的网页是在网站前端通过js代码处理后显示,如数据混淆、加密等)直接在前台显示。
虽然很多网站采用了ajax异步加载，但是归根结底它还是一个http请求。只要能够分析出对应数据的请求来源，那么就很容易的拿到你想要的数据了。以下步骤讲解如何分析http请求。

以我的知乎账户为例，获取我的所有关注用户资料。首先打开我的关注列表，可以看到主面板就是我的关注用户列表，

我一共关注233个用户，现在目的是就是要获取这233个用户的个人资料信息。打开F12->NetWork,勾选上Preserve log和Disable cache（如下图）。

从零实现一个高性能网络爬虫（一）网络请求分析及代码实现
下拉滚动条，点击下一页获取对应请求(在翻页的过程会有很多无关的请求),待页面加载完成后，在请求列表中右键->Save as HAR with content，这个文件是把当前请求(request)列表保存为

json格式文本，保存后使用chrome打开这个文件，搜索(Ctrl+F)页面出现的关键字，要注意这里中文采用了unicode编码，我这里直接搜索5032(李博杰的关注者数，见下图)。这一步骤的目的是获取我们想要数据(关注用户的个人资料)的请求来源。

从零实现一个高性能网络爬虫（一）网络请求分析及代码实现
由步骤2搜索得出，关注用户的资料数据来自以下请求(如下图),url解码后为 https://www.zhihu.com/api/v4/members/wo-yan-chen-mo/followees?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset=20&limit=20 (url1)，从这里可以看出关注列表的数据并不是从(url2)同步加载而来的，而是直接通过ajax异步请求url1来获得关注用户数据，然后通过js代码填充数据。这里要注意用红色矩形圈住的 authorization request header，在代码实现的时候必须加上这个header。这个数据并不是动态改变的，通过步骤2的方式可以发现它是来自一个js文件。该步骤注意的是，我写该文章的时候是2017-04-27，随着时间推移，知乎可能会更新相关api接口的url，也就是说通过步骤2得出的url有可能并不是我上面的url1，但是具体分析的方法还是通用的。

从零实现一个高性能网络爬虫（一）网络请求分析及代码实现

多测试几次可以得出以上url1的参数含义如下

`参数名`	`类型`	`必填`	`值`	`说明`
`include`	`String`	`是`	`data[*]answer_count,articles_count`	`需要返回的字段（这个值可以改根据需要增加一些字段）`
`offset`	`int`	`是`		`偏移量（通过调整这个值可以获取到一个用户的所有关注用户资料）`
`limit`	`int`	`是`	`20`	`返回用户数（最大20，超过20无效）`

关于如何测试请求，我常用的以下三种方式
- 原生chrome浏览器。可以做一些简单的GET请求测试，这种方式有很大的局限性，不能编辑http header。如果直接(未登录知乎)通过浏览器访问url1，会得到401的response code。因为它没有带上 authorization request header。所以这种方式能测试一些简单且没有特殊request header的GET请求。
  
  从零实现一个高性能网络爬虫（一）网络请求分析及代码实现
- chrome插件Postman。一个很强大的http请求测试工具，可以直接编辑request header(包括cookies)。如果可以FQ的话，强烈推荐。GET、POST、PUT等都是支持的，几乎可以发送任意类型的http请求,测试的url1如下图。通过修改它参数的值，来看服务器响应数据的变化来确定参数含义
  
  从零实现一个高性能网络爬虫（一）网络请求分析及代码实现
- intellij idea ultimate版自带的工具。打开方式 Tools->Test RESTful Web Service。也是可以直接编辑http header(包括cookies)请求发送，GET、POST、PUT等请求方式也都是支持的。
response是一段json格式的数据，中文是采用的unicode编码，解码后数据内容如下图。

代码实现

代码采用的Java HttpClient4.x，关于HttpClient4.x的使用我这里不过多讲解，要注意的是HttpClient4.x和3.x API有很大的差异。

1 package com.cnblogs.wycm;
 2 
 3 import com.alibaba.fastjson.JSON;
 4 import com.alibaba.fastjson.JSONObject;
 5 import org.apache.http.client.methods.CloseableHttpResponse;
 6 import org.apache.http.client.methods.HttpGet;
 7 import org.apache.http.impl.client.CloseableHttpClient;
 8 import org.apache.http.impl.client.HttpClients;
 9 import org.apache.http.util.EntityUtils;
10 
11 import java.io.IOException;
12 
13 /**
14  * 获取wo-yan-chen-mo关注的所有知乎用户信息
15  * 只是把用户资料打印出来，没有具体解析（关于解析出详细数据可以采用正则表达式、json库、jsonpath等方式）
16  */
17 public class Demo {
18     public static void main(String[] args) throws IOException {
19         //创建http客户端
20         CloseableHttpClient httpClient = HttpClients.createDefault();
21 
22         String url = "https://www.zhihu.com/api/v4/members/wo-yan-chen-mo/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20";
23 
24         //创建http request(GET)
25         HttpGet request = new HttpGet(url);
26 
27         //设置http request header
28         request.setHeader("authorization", "oauth c3cef7c66a1843f8b3a9e6a1e3160e20");
29         //执行http请求
30         CloseableHttpResponse response = httpClient.execute(request);
31         //打印response
32         String responseStr = EntityUtils.toString(response.getEntity());
33         System.out.println(responseStr);
34 
35         String nextPageUrl = getNextPageUrl(responseStr);
36         boolean isEnd = getIsEnd(responseStr);
37 
38         while (!isEnd && nextPageUrl != null){
39             //创建http request(GET)
40             request = new HttpGet(nextPageUrl);
41 
42             //设置http request header
43             request.setHeader("authorization", "oauth c3cef7c66a1843f8b3a9e6a1e3160e20");
44             response = httpClient.execute(request);
45             //打印response
46             responseStr = EntityUtils.toString(response.getEntity());
47             System.out.println(responseStr);
48             nextPageUrl = getNextPageUrl(responseStr);
49             isEnd = getIsEnd(responseStr);
50         }
51     }
52 
53     /**
54      * 获取next url
55      * @param responseStr
56      * @return
57      */
58     private static String getNextPageUrl(String responseStr){
59         JSONObject jsonObject = (JSONObject) JSON.parse(responseStr);
60         jsonObject = (JSONObject) jsonObject.get("paging");
61         return jsonObject.get("next").toString();
62     }
63 
64     /**
65      * 获取is_end
66      * @param responseStr
67      * @return
68      */
69     private static boolean getIsEnd(String responseStr){
70         JSONObject jsonObject = (JSONObject) JSON.parse(responseStr);
71         jsonObject = (JSONObject) jsonObject.get("paging");
72         return (boolean) jsonObject.get("is_end");
73     }
74 }

maven依赖

1 　　<dependency>
 2       <groupId>org.apache.httpcomponents</groupId>
 3       <artifactId>httpclient</artifactId>
 4       <version>4.5</version>
 5     </dependency>
 6 
 7     <!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
 8     <dependency>
 9       <groupId>com.alibaba</groupId>
10       <artifactId>fastjson</artifactId>
11       <version>1.2.31</version>
12     </dependency>

一个程序员日常分享，包括但不限于爬虫、Java后端技术，欢迎关注。

从零实现一个高性能网络爬虫（一）网络请求分析及代码实现

摘要

目的

请求分析

代码实现

继续阅读

Java小案例——随机数猜测随机数猜测

nginx location中斜线的位置的重要性

sort()函数到底是怎样进行数字排序的

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method