网络爬虫之JSOUP

JSOUP中文文档：http://www.open-open.com/jsoup/

推荐博客：http://www.cnblogs.com/jycboy/p/jsoupdoc.html

从一个URL加载一个Document

Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

使用DOM方法来遍历一个文档

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
    String linkHref = link.attr("href");
    String linkText = link.text();
}

查找元素

getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key) (and related methods

元素数据

attr(String key)获取属性attr(String key, String value)设置属性
attributes()获取所有属性
id(), className() and classNames()
text()获取文本内容text(String value) 设置文本内容
html()获取元素内HTMLhtml(String value)设置元素内的HTML内容

Elements links = doc.select("a[href]"); //带有href属性的a元素
Elements pngs = doc.select("img[src$=.png]");//扩展名为.png的图片
Element masthead = doc.select("div.masthead").first(); //class等于masthead的div标签
Elements resultLinks = doc.select("h3.r > a"); //在h3元素之后的a元素

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class OnlineProductDownload {

	public void demo1() {
		// 从优酷下载包含视频结果的html
		Document doc = getHtmlByName("冷然之天秤");
		try {
			//选择第一个结果
			Element element = doc.select("div.s_inform").first();
			// 获取播放源，不是优酷本站的返回（优酷可能会跳转到其他网站，如：爱奇艺）
			Element playSource = element.select("div.pos_area span").first();
			if (playSource.text() != "优酷") {
				return;
			}

			//每个li一集（海贼王：第一集、第二集）
			Elements li_Elements = element.select("ul.clearfix li");
			for (Element li : li_Elements) {
				// 获取第几集
				Element span = li.getElementsByTag("span").first();
				String text = span.text();
				// 获取每集详情的url（并不是视频的真实url）
				Element a = li.getElementsByTag("a").first();
				String href = a.attr("href");

				// 根据href获取详情网页文本
				doc = getHtmlTextByUrl(href);
				//查找实际结果标签
				Element sourceLi = doc.select("ul.fn-share-code li").first();
				// 根据doc获取播放插件. 如：
				// <iframe src='http://player.youku.com/embed/XMzUwNjM1OTA0MA=='></iframe>
				Element input = sourceLi.getElementsByTag("input").first();
				String value = input.attr("value");
				System.out.println(text + value);
			}
			// http://v.youku.com/v_show/id_XMzUwNjM1OTA0MA==.html
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

	// 功能：使用优酷的搜库来搜索视频，并根据名称获取网页文本 如：海贼王
	public Document getHtmlByName(String name) {
		if (name.isEmpty() || name.length() <= 0) {
			return null;
		}
		// 拼接URL 优酷搜索形式：http://www.soku.com/search_video/q_海贼王
		String url = "http://www.soku.com/search_video/q_";
		try {
			//url编码：%E6%B5%B7%E8%B4%BC%E7%8E%8B=海贼王
			String encodeStr = URLEncoder.encode(name, "utf-8");
			url = url + encodeStr;
		} catch (UnsupportedEncodingException e1) {
			e1.printStackTrace();
		}
		// 通过url获取html
		try {
			Document doc = Jsoup.connect(url).get();
			return doc;
		} catch (IOException e) {
			e.printStackTrace();
		}
		return null;
	}

	// 根据url从网络获取网页文本
	public Document getHtmlTextByUrl(String url) {
		Document doc = null;
		try {
			//拼接成完整的路径
			String str = "http:";
			str = str + url;
			doc = Jsoup.connect(str).get();
		} catch (IOException e) {
			e.printStackTrace();
		}
		return doc;
	}

}

网络爬虫之JSOUP

继续阅读

关于Gradle配置的小结

Java小案例——随机数猜测随机数猜测

nginx location中斜线的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method