前不久,看到一篇文章我用爬蟲一天時間“偷了”知乎一百萬使用者,隻為證明PHP是世界上最好的語言,該文章中使用的登入方式是直接複制cookie到代碼中,這裡呢,我不以爬資訊為目的。隻是簡單的介紹使用java來進行模拟登入的基本過程,之前寫過的文章android 項目實戰——打造超級課程表一鍵提取課表功能其實就是模拟登入的範疇。再加上最近在知乎上看到很多人問關于超級課程表的實作,其實本質就是模拟登入,掌握了這篇文章的内容,你不再擔心抓不到資訊了。然後,這篇文章會使用到之前的一篇Cookie保持的文章Android OkHttp的Cookie自動化管理,還有Jsoup的使用 Jsoup庫使用完全解析,為了簡單處理,直接使用javaSE來,而不再使用Android進行。如果要移植到Android,唯一的處理可能就是把網絡請求工作扔到子線程中去 。
首先使用Chrome打開知乎首頁 , 點選登入,你會看到下面這個界面
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsICNwkTMzYDNxEDMxgDM1EDMy8CX0Vmbu4GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.jpg)
在Chorme中按F12,調出開發者工具,切到Network頁籤,勾選Preserve Log,記得一定要勾選,不然你會看不到資訊。
一切就緒後,在輸入框中輸出賬号密碼點選登入,登入成功後你會看到這麼一條記錄
點選圖中的email,在最下方你會看到本次請求送出了4個參數,以及在上方,你會看到本次請求的位址是http://www.zhihu.com/login/email
你會驚訝的發現知乎的密碼是明文傳輸的,送出的參數的意思也很簡單,email就是賬号,password就是密碼,remember_me就是是否記住,這裡傳true就可以了,還有一個_xsrf參數,這個毛估估應該是防爬蟲的。是以在送出前我們要從源代碼中将這個值抓取下來。該值在表單的隐藏域中
一切準備就緒後,你就興高采烈的用代碼去模拟登入,然後你會發現會傳回一個驗證碼錯誤的資訊。其實,我們還需要送出一個驗證碼,其參數名為captcha,驗證碼的位址為,
http://www.zhihu.com/captcha.gif?r=時間戳
于是我們得出了這樣的一個資料。
- 請求位址
http://www.zhihu.com/login/email
- 請求參數
_xsrf 表單中提取的隐藏域的值
captcha 驗證碼
email 郵箱
password 密碼
remember_me 記住我
還有一個問題,驗證碼的值怎麼得到呢,答案是人工輸入,将驗證碼儲存到本地進行認為識别,輸入後進行登陸即可。
這裡的網絡請求使用OkHttp,以及解析使用Jsoup,然後我們會使用到Gson,将他們加入maven依賴
<dependencies>
<dependency>
<groupId>com.squareup.okhttp</groupId>
<artifactId>okhttp</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.3.1</version>
</dependency>
</dependencies>
在編碼之前,我們得想想怎麼維持登陸狀态,沒錯,就是Cookie如何保持,我們隻進行登陸一次,後續都直接采集資料就可以了,是以需要将cookie持久化,對之前的文章中的一個Android類進行改造。使其變成java平台可用的類,可以看到我們将它從之前儲存到SharePrefrences中改成了儲存到檔案中,并以json形式存儲,這就是為什麼會用到Gson的原因了
package cn.edu.zafu.zhihu;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.reflect.TypeToken;
import java.io.*;
import java.net.CookieStore;
import java.net.HttpCookie;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
/**
* User:lizhangqu([email protected])
* Date:2015-07-18
* Time: 16:54
*/
public class PersistentCookieStore implements CookieStore {
private static final Gson gson= new GsonBuilder().setPrettyPrinting().create();
private static final String LOG_TAG = "PersistentCookieStore";
private static final String COOKIE_PREFS = "CookiePrefsFile";
private static final String COOKIE_NAME_PREFIX = "cookie_";
private final HashMap<String, ConcurrentHashMap<String, HttpCookie>> cookies;
private Map<String,String> cookiePrefs=new HashMap<String, String>();
/**
* Construct a persistent cookie store.
*
*/
public PersistentCookieStore() {
String cookieJson = readFile("cookie.json");
Map<String,String> fromJson = gson.fromJson(cookieJson,new TypeToken<Map<String, String>>() {}.getType());
if(fromJson!=null){
System.out.println(fromJson);
cookiePrefs=fromJson;
}
cookies = new HashMap<String, ConcurrentHashMap<String, HttpCookie>>();
// Load any previously stored cookies into the store
for(Map.Entry<String, ?> entry : cookiePrefs.entrySet()) {
if (((String)entry.getValue()) != null && !((String)entry.getValue()).startsWith(COOKIE_NAME_PREFIX)) {
String[] cookieNames = split((String) entry.getValue(), ",");
for (String name : cookieNames) {
String encodedCookie = cookiePrefs.get(COOKIE_NAME_PREFIX + name);
if (encodedCookie != null) {
HttpCookie decodedCookie = decodeCookie(encodedCookie);
if (decodedCookie != null) {
if(!cookies.containsKey(entry.getKey()))
cookies.put(entry.getKey(), new ConcurrentHashMap<String, HttpCookie>());
cookies.get(entry.getKey()).put(name, decodedCookie);
}
}
}
}
}
}
public void add(URI uri, HttpCookie cookie) {
String name = getCookieToken(uri, cookie);
// Save cookie into local store, or remove if expired
if (!cookie.hasExpired()) {
if(!cookies.containsKey(uri.getHost()))
cookies.put(uri.getHost(), new ConcurrentHashMap<String, HttpCookie>());
cookies.get(uri.getHost()).put(name, cookie);
} else {
if(cookies.containsKey(uri.toString()))
cookies.get(uri.getHost()).remove(name);
}
cookiePrefs.put(uri.getHost(), join(",", cookies.get(uri.getHost()).keySet()));
cookiePrefs.put(COOKIE_NAME_PREFIX + name, encodeCookie(new SerializableHttpCookie(cookie)));
String json=gson.toJson(cookiePrefs);
saveFile(json.getBytes(), "cookie.json");
}
protected String getCookieToken(URI uri, HttpCookie cookie) {
return cookie.getName() + cookie.getDomain();
}
public List<HttpCookie> get(URI uri) {
ArrayList<HttpCookie> ret = new ArrayList<HttpCookie>();
if(cookies.containsKey(uri.getHost()))
ret.addAll(cookies.get(uri.getHost()).values());
return ret;
}
public boolean removeAll() {
cookiePrefs.clear();
cookies.clear();
return true;
}
public boolean remove(URI uri, HttpCookie cookie) {
String name = getCookieToken(uri, cookie);
if(cookies.containsKey(uri.getHost()) && cookies.get(uri.getHost()).containsKey(name)) {
cookies.get(uri.getHost()).remove(name);
if(cookiePrefs.containsKey(COOKIE_NAME_PREFIX + name)) {
cookiePrefs.remove(COOKIE_NAME_PREFIX + name);
}
cookiePrefs.put(uri.getHost(), join(",", cookies.get(uri.getHost()).keySet()));
return true;
} else {
return false;
}
}
public List<HttpCookie> getCookies() {
ArrayList<HttpCookie> ret = new ArrayList<HttpCookie>();
for (String key : cookies.keySet())
ret.addAll(cookies.get(key).values());
return ret;
}
public List<URI> getURIs() {
ArrayList<URI> ret = new ArrayList<URI>();
for (String key : cookies.keySet())
try {
ret.add(new URI(key));
} catch (URISyntaxException e) {
e.printStackTrace();
}
return ret;
}
/**
* Serializes Cookie object into String
*
* @param cookie cookie to be encoded, can be null
* @return cookie encoded as String
*/
protected String encodeCookie(SerializableHttpCookie cookie) {
if (cookie == null)
return null;
ByteArrayOutputStream os = new ByteArrayOutputStream();
try {
ObjectOutputStream outputStream = new ObjectOutputStream(os);
outputStream.writeObject(cookie);
} catch (IOException e) {
System.out.println("IOException in encodeCookie"+ e);
return null;
}
return byteArrayToHexString(os.toByteArray());
}
/**
* Returns cookie decoded from cookie string
*
* @param cookieString string of cookie as returned from http request
* @return decoded cookie or null if exception occured
*/
protected HttpCookie decodeCookie(String cookieString) {
byte[] bytes = hexStringToByteArray(cookieString);
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);
HttpCookie cookie = null;
try {
ObjectInputStream objectInputStream = new ObjectInputStream(byteArrayInputStream);
cookie = ((SerializableHttpCookie) objectInputStream.readObject()).getCookie();
} catch (IOException e) {
System.out.println("IOException in decodeCookie"+e);
} catch (ClassNotFoundException e) {
System.out.println("ClassNotFoundException in decodeCookie"+e);
}
return cookie;
}
/**
* Using some super basic byte array <-> hex conversions so we don't have to rely on any
* large Base64 libraries. Can be overridden if you like!
*
* @param bytes byte array to be converted
* @return string containing hex values
*/
protected String byteArrayToHexString(byte[] bytes) {
StringBuilder sb = new StringBuilder(bytes.length * );
for (byte element : bytes) {
int v = element & ;
if (v < ) {
sb.append('0');
}
sb.append(Integer.toHexString(v));
}
return sb.toString().toUpperCase(Locale.US);
}
/**
* Converts hex values from strings to byte arra
*
* @param hexString string of hex-encoded values
* @return decoded byte array
*/
protected byte[] hexStringToByteArray(String hexString) {
int len = hexString.length();
byte[] data = new byte[len / ];
for (int i = ; i < len; i += ) {
data[i / ] = (byte) ((Character.digit(hexString.charAt(i), ) << ) + Character.digit(hexString.charAt(i + ), ));
}
return data;
}
public static String join(CharSequence delimiter, Iterable tokens) {
StringBuilder sb = new StringBuilder();
boolean firstTime = true;
for (Object token: tokens) {
if (firstTime) {
firstTime = false;
} else {
sb.append(delimiter);
}
sb.append(token);
}
return sb.toString();
}
public static String[] split(String text, String expression) {
if (text.length() == ) {
return new String[]{};
} else {
return text.split(expression, -);
}
}
public static void saveFile(byte[] bfile, String fileName) {
BufferedOutputStream bos = null;
FileOutputStream fos = null;
File file = null;
try {
file = new File(fileName);
fos = new FileOutputStream(file);
bos = new BufferedOutputStream(fos);
bos.write(bfile);
} catch (Exception e) {
e.printStackTrace();
} finally {
if (bos != null) {
try {
bos.close();
} catch (IOException e1) {
e1.printStackTrace();
}
}
if (fos != null) {
try {
fos.close();
} catch (IOException e1) {
e1.printStackTrace();
}
}
}
}
public static String readFile(String fileName) {
BufferedInputStream bis = null;
FileInputStream fis = null;
File file = null;
try {
file = new File(fileName);
fis = new FileInputStream(file);
bis = new BufferedInputStream(fis);
int available = bis.available();
byte[] bytes=new byte[available];
bis.read(bytes);
String str=new String(bytes);
return str;
} catch (Exception e) {
e.printStackTrace();
} finally {
if (bis != null) {
try {
bis.close();
} catch (IOException e1) {
e1.printStackTrace();
}
}
if (fis != null) {
try {
fis.close();
} catch (IOException e1) {
e1.printStackTrace();
}
}
}
return "";
}
}
然後建立一個OkHttp請求類,并設定其Cookie處理類為我們編寫的類。
private static OkHttpClient client = new OkHttpClient();
client.setCookieHandler(new CookieManager(new PersistentCookieStore(), CookiePolicy.ACCEPT_ALL));
好了,可以開始擷取_xsrf以及驗證碼了。驗證碼儲存在項目根目錄下名為code.png的檔案
private static String xsrf;
public static void getCode() throws IOException{
Request request = new Request.Builder()
.url("http://www.zhihu.com/")
.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
.build();
Response response = client.newCall(request).execute();
String result = response.body().string();
Document parse = Jsoup.parse(result);
System.out.println(parse + "");
result = parse.select("input[type=hidden]").get().attr("value")
.trim();
xsrf=result;
System.out.println("_xsrf:" + result);
String codeUrl = "http://www.zhihu.com/captcha.gif?r=";
codeUrl += System.currentTimeMillis();
System.out.println("codeUrl:" + codeUrl);
Request getcode = new Request.Builder()
.url(codeUrl)
.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
.build();
Response code = client.newCall(getcode).execute();
byte[] bytes = code.body().bytes();
saveCode(bytes, "code.png");
}
public static void saveCode(byte[] bfile, String fileName) {
BufferedOutputStream bos = null;
FileOutputStream fos = null;
File file = null;
try {
file = new File(fileName);
fos = new FileOutputStream(file);
bos = new BufferedOutputStream(fos);
bos.write(bfile);
} catch (Exception e) {
e.printStackTrace();
} finally {
if (bos != null) {
try {
bos.close();
} catch (IOException e1) {
e1.printStackTrace();
}
}
if (fos != null) {
try {
fos.close();
} catch (IOException e1) {
e1.printStackTrace();
}
}
}
}
然後将擷取來的參數連同賬号密碼進行送出登入
public static void login(String randCode,String email,String password) throws IOException{
RequestBody formBody = new FormEncodingBuilder()
.add("_xsrf", xsrf)
.add("captcha", randCode)
.add("email", email)
.add("password", password)
.add("remember_me", "true")
.build();
Request login = new Request.Builder()
.url("http://www.zhihu.com/login/email")
.post(formBody)
.addHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
.build();
Response execute = client.newCall(login).execute();
System.out.println(decode(execute.body().string()));
}
public static String decode(String unicodeStr) {
if (unicodeStr == null) {
return null;
}
StringBuffer retBuf = new StringBuffer();
int maxLoop = unicodeStr.length();
for (int i = ; i < maxLoop; i++) {
if (unicodeStr.charAt(i) == '\\') {
if ((i < maxLoop - )
&& ((unicodeStr.charAt(i + ) == 'u') || (unicodeStr
.charAt(i + ) == 'U')))
try {
retBuf.append((char) Integer.parseInt(
unicodeStr.substring(i + , i + ), ));
i += ;
} catch (NumberFormatException localNumberFormatException) {
retBuf.append(unicodeStr.charAt(i));
}
else
retBuf.append(unicodeStr.charAt(i));
} else {
retBuf.append(unicodeStr.charAt(i));
}
}
return retBuf.toString();
}
當看到下面的資訊就代碼登入成功了
之後你就可以擷取你想要的資訊了,這裡簡單擷取一些資訊,比如我要擷取輪子哥的followers的昵稱,分頁自己處理下就ok了。
public static void getFollowers() throws IOException{
Request request = new Request.Builder()
.url("http://www.zhihu.com/people/zord-vczh/followees")
.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
.build();
Response response = client.newCall(request).execute();
String result=response.body().string();
Document parse = Jsoup.parse(result);
Elements select = parse.select("div.zm-profile-card");
StringBuilder builder=new StringBuilder();
for (int i=;i<select.size();i++){
Element element = select.get(i);
String name=element.select("h2").text();
System.out.println(name+"");
builder.append(name);
builder.append("\n");
}
}
下圖就是擷取到的資訊。當然,隻要你登入了,什麼資訊你都可以擷取到。
最後上源碼,Intelij的maven項目
http://download.csdn.net/detail/sbsujjbcy/8984375