基于WebMagic爬虫
一、WebMagic简介
WebMagic是一个简单灵活的爬虫框架。基于WebMagic,你可以快速开发出一个高效、易维护的爬虫。
特性:
1. 简单的API,可快速上手
2. 模块化的结构,可轻松扩展
3. 提供多线程和分布式支持
- 项目地址:http://webmagic.io/
- API中文地址:http://webmagic.io/docs/zh/
二、示列代码
- 1) Maven依赖
<!-- https://mvnrepository.com/artifact/us.codecraft/webmagic-core -->
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.5.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.5.3</version>
</dependency>
- 2) Java代码
爬取地址:http://blog.csdn.net/zhengyong15984285623 下面所有该用户发表的文章标题和创建时间
package crawl.webmagic.csdn;
import org.apache.commons.collections.CollectionUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Html;
import java.util.List;
/**
* csdn 爬取控制
*/
public class CsdnProcessor implements PageProcessor {
public static Logger logger = LoggerFactory.getLogger(CsdnProcessor.class);
private Site site = Site.me().setSleepTime().setCycleRetryTimes();
/**
* csdn uri 后缀
*/
public static final String CSDN_URI = "zhengyong15984285623";
@Override
public void process(Page page) {
List pagenation = page.getHtml().links().regex("/" + CSDN_URI + "/article/list/\\d*").all();
page.addTargetRequests(pagenation);
// 文章列表页面只捕捉每篇文章url,并将其加入爬取队列
if (CollectionUtils.isNotEmpty(pagenation)) {
List<String> titleList = page.getHtml().xpath("//div[@id='article_list']/div[@class=list_item]").all();
for (String titleHtml : titleList) {
page.addTargetRequests(new Html(titleHtml).links().regex("/" + CSDN_URI
+ "/article/details/\\d*").all());
}
page.setSkip(true);
} else { // csdn具体文章页面
String title = page.getHtml().xpath("//div[@class=article_title]/h1/span/a/text()").toString();
String createTime = page.getHtml().xpath("//div[@class=article_r]/span[@class=link_postdate]/text()").toString();
page.putField("title", title);
page.putField("createTime", createTime);
}
}
@Override
public Site getSite() {
return site;
}
}
package crawl.webmagic.csdn;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
/**
* 基于webmagic爬取csdn
*/
public class CsdnMain {
public static Logger logger = LoggerFactory.getLogger(CsdnMain.class);
public static void main(String[] args) {
Spider.create(new CsdnProcessor())
// 从url开始抓
.addUrl("http://blog.csdn.net/" + CsdnProcessor.CSDN_URI)
// 设置Scheduler,使用File来管理URL队列
// .setScheduler(new FileCacheQueueScheduler("/Users/zhengyong/queue"))
// 设置Pipeline,将结果以console方式输出到控制台
.addPipeline(new ConsolePipeline())
// 开启5个线程同时执行
.thread()
// 启动爬虫
.run();
}
}
- 3) 运行结果
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: google guava Joiner 示列
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: elastic-job 构建
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: Google Guava Cache 示列
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: CountDownLatch使用
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: ElasticSearch-学习笔记
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: kafka 版本
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: git配置用户名和邮箱
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: java身份证格式强校验
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: jdk动态代理例子
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: spring ioc 源码解析(二)
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 图解https
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: spring ioc 源码解析(一)
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: ClassPathXmlApplicationContext 源码解析
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: ThreadPoolExecutor源码分析
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: JVM底层又是如何实现synchronized
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: jstat查看内存
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: ThreadPoolExecutor 分析
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: java中volatile关键字的含义
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: mysql 行锁
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 设计模式
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: mysql 优化
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: mysql 事务级别
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: https客户端证书.p12maven打包后tomcat启动不正确
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: java NIO详解
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: jvm学习
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: redis持久化机制
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: redis配置说明
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: dobbo配置
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: mac 下安装wget
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: logback.xml文件配置
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: linux 启动或停止jar shell脚本
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: mac下安装redis
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: google guava 测试
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: spring batch + spring boot 配置
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: google工具包
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: redis命令学习笔记
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: mybatis 批量插入oracle与mysql
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: linux项目发布常见命令
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: http方式调用webservice
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: http post请求
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: mac 配置maven环境变量
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 本地git上传至远程git仓库
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: http get请求
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: maven 提交oracle jar 包
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: mybatis foreach 熟悉
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: maven 利用axis2插件配置webservice
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 使用 git 进行项目同步开发步骤
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: MQ开启密码访问平台服务步骤
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: mac 文件显示与隐藏
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: jvm 内存回收finalize如何在垃圾清除前工作原理
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: maven项目修改jsp代码不用打包就能运行
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 观察者模式
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: postgre循环插入模拟数据
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 事件通知
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: maven批处理命令
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: java反射
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: CAS单点登录配置
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: maven学习笔记
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: struts2 表单标签属性
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: cas单点登录
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: struts2 数据标签
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: struts2注解
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: struts2 控制标签
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 集合操作
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 异步线程池
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 字符串大写字母转下划线
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: OGNL 中的集合操作
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: struts2中的数据校验文件配置方法
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: nginx 配置说明
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 使用BlockingQueue创建生产者消费者模式
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: spring mvc 线程安全问题说明
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: [解决异常] spring batch 报错 ORA-: 无法连续访问此事务处理
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: linux操作oracle命令
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: memcache查看数据命令
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: Python学习指南
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: struts2结果类型
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: struts2拦截器
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: struts2输出国际信息
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: Action访问Servlet API
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: struts2拦截器配置
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: struts2标签
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 日志
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 快速排序算法
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 培训总结
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: struts2常量配置
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流之 分配任务给一个"组的成员"GroupTask
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流之"事件处理Event"
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流之"分支聚合Join-Fork"
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流之"自定义活动Custom"
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: CheckBox
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流之Decision流程决策(判断活动执行方向)
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: RadioGroup
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: EditText
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: TextView
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 绝对定位布局管理器
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 布局管理器的嵌套
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: Button
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 相对布局管理器
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 表格布局管理器
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: Activity 初步
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 线性布局管理器
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 框架布局管理器
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: Android项目文件夹解析
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: AndroidManifest.xml解析
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: main.xml解析
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流之根据流程变量分配任务Task
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流之状态活动State
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流的流转Transition
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流管理方法扩展
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流管理
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流执行变量Variable
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: -jbpm工作流实现
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 向字符串对象中追加replaceAll方法
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 获取登录用户IP地址
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 根据tree文件菜单的path,拼接文件夹路径
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: EasyUI的from表单,根据皮肤变换 改变表单颜色
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 初试.net使用ajax调用后台方法
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: myelipse 版本破解
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: .net 如何连接mysql
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 集合按某个属性排序
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: EasyUI+Struts2整合KindEditor
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: kaptcha验证码使用配置
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: 定时器-----每天定时删除临时文件
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: html5绘图铺满整个屏幕
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: canvas像素化video
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: .net 当GridView编辑状态获取新值时,往往获取的是修改前的值。
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: HTML5 之 notifications
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: Html5 之 Google地图
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: CSS3 动画之animation
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: CSS3 过渡之transition
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: CSS3 D 转换
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: CSS3 D 转换
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: CSS3 border-radius 属性
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: html5视频播放
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: CSS3 border-image 属性
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: CSS3 box-shadow 属性
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: html5之绘图板
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: html5中canvas基本应用
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: html5之canvas动画
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: easyui扁平Json生成树形菜单
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: AutoCompleteTextView
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: easyui动态创建一个dialog
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: Dialog
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: ScrollView
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: ListView
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: DatePicker
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: ImageView
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: TimePicker
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: ImageButton
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title: Spinner
createTime: -- :
三、 注意事项
WebMagic使用log4j打印日志
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration>
<appender name="CONSOLE" class="org.apache.log4j.ConsoleAppender">
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern"
value="%d - %c -%-4r [%t] %-5p %x - %m%n" />
</layout>
<!--限制输出级别-->
<filter class="org.apache.log4j.varia.LevelRangeFilter">
<param name="LevelMax" value="ERROR"/>
<param name="LevelMin" value="TRACE"/>
</filter>
</appender>
<appender name="FILE" class="org.apache.log4j.FileAppender">
<param name="File" value="/Users/zhengyong/log/crawl.log"/>
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern"
value="%d - %c -%-4r [%t] %-5p %x - %m%n" />
</layout>
</appender>
<root>
<priority value="info" />
<appender-ref ref="CONSOLE" />
<appender-ref ref="FILE" />
</root>
</log4j:configuration>
Java中main方法使用
import org.apache.log4j.xml.DOMConfigurator;
DOMConfigurator.configure("/Users/zhengyong/crawl/log4j.xml");//加载.xml文件