天天看点

基于WebMagic爬虫

基于WebMagic爬虫

一、WebMagic简介

WebMagic是一个简单灵活的爬虫框架。基于WebMagic,你可以快速开发出一个高效、易维护的爬虫。

特性:

1. 简单的API,可快速上手

2. 模块化的结构,可轻松扩展

3. 提供多线程和分布式支持

  • 项目地址:http://webmagic.io/
  • API中文地址:http://webmagic.io/docs/zh/

二、示列代码

  • 1) Maven依赖
<!-- https://mvnrepository.com/artifact/us.codecraft/webmagic-core -->
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.5.3</version>
</dependency>

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.5.3</version>
</dependency>
           
  • 2) Java代码

爬取地址:http://blog.csdn.net/zhengyong15984285623 下面所有该用户发表的文章标题和创建时间

package crawl.webmagic.csdn;

import org.apache.commons.collections.CollectionUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Html;

import java.util.List;

/**
 * csdn 爬取控制
 */
public class CsdnProcessor implements PageProcessor {

    public static Logger       logger   = LoggerFactory.getLogger(CsdnProcessor.class);

    private Site               site     = Site.me().setSleepTime().setCycleRetryTimes();

    /**
     * csdn uri 后缀
     */
    public static final String CSDN_URI = "zhengyong15984285623";

    @Override
    public void process(Page page) {
        List pagenation = page.getHtml().links().regex("/" + CSDN_URI + "/article/list/\\d*").all();
        page.addTargetRequests(pagenation);
        // 文章列表页面只捕捉每篇文章url,并将其加入爬取队列
        if (CollectionUtils.isNotEmpty(pagenation)) {
            List<String> titleList = page.getHtml().xpath("//div[@id='article_list']/div[@class=list_item]").all();
            for (String titleHtml : titleList) {
                page.addTargetRequests(new Html(titleHtml).links().regex("/" + CSDN_URI
                                                                         + "/article/details/\\d*").all());
            }
            page.setSkip(true);
        } else { // csdn具体文章页面
            String title = page.getHtml().xpath("//div[@class=article_title]/h1/span/a/text()").toString();
            String createTime = page.getHtml().xpath("//div[@class=article_r]/span[@class=link_postdate]/text()").toString();
            page.putField("title", title);
            page.putField("createTime", createTime);
        }
    }

    @Override
    public Site getSite() {
        return site;
    }

}
           
package crawl.webmagic.csdn;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;

/**
 * 基于webmagic爬取csdn
 */
public class CsdnMain {

    public static Logger logger = LoggerFactory.getLogger(CsdnMain.class);

    public static void main(String[] args) {

        Spider.create(new CsdnProcessor())
              // 从url开始抓
              .addUrl("http://blog.csdn.net/" + CsdnProcessor.CSDN_URI)
              // 设置Scheduler,使用File来管理URL队列
              // .setScheduler(new FileCacheQueueScheduler("/Users/zhengyong/queue"))
              // 设置Pipeline,将结果以console方式输出到控制台
              .addPipeline(new ConsolePipeline())
              // 开启5个线程同时执行
              .thread()
              // 启动爬虫
              .run();
    }
}
           
  • 3) 运行结果
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   google guava Joiner 示列 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   elastic-job 构建 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   Google Guava Cache 示列 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   CountDownLatch使用 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   ElasticSearch-学习笔记 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   kafka  版本 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   git配置用户名和邮箱 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   java身份证格式强校验 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   jdk动态代理例子 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   spring ioc 源码解析(二) 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   图解https 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   spring ioc 源码解析(一) 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   ClassPathXmlApplicationContext 源码解析 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   ThreadPoolExecutor源码分析 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   JVM底层又是如何实现synchronized 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   jstat查看内存 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   ThreadPoolExecutor 分析 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   java中volatile关键字的含义 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   mysql 行锁 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   设计模式 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   mysql 优化 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   mysql 事务级别 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   https客户端证书.p12maven打包后tomcat启动不正确 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   java NIO详解 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   jvm学习 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   redis持久化机制 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   redis配置说明 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   dobbo配置 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   mac 下安装wget 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   logback.xml文件配置 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   linux 启动或停止jar shell脚本 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   mac下安装redis 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   google guava 测试 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   spring batch + spring boot 配置 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   google工具包 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   redis命令学习笔记 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   mybatis 批量插入oracle与mysql 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   linux项目发布常见命令 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   http方式调用webservice 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   http post请求 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   mac 配置maven环境变量 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   本地git上传至远程git仓库 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   http get请求 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   maven 提交oracle jar 包 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   mybatis foreach 熟悉 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   maven 利用axis2插件配置webservice 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   使用 git 进行项目同步开发步骤 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   MQ开启密码访问平台服务步骤 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   mac 文件显示与隐藏 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   jvm 内存回收finalize如何在垃圾清除前工作原理 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   maven项目修改jsp代码不用打包就能运行 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   观察者模式 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   postgre循环插入模拟数据 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   事件通知 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   maven批处理命令 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   java反射 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   CAS单点登录配置 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   maven学习笔记 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   struts2 表单标签属性 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   cas单点登录 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   struts2 数据标签 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   struts2注解 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   struts2 控制标签 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   集合操作 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   异步线程池 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   字符串大写字母转下划线 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   OGNL 中的集合操作 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   struts2中的数据校验文件配置方法 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   nginx 配置说明 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   使用BlockingQueue创建生产者消费者模式 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   spring mvc 线程安全问题说明 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   [解决异常] spring batch 报错 ORA-: 无法连续访问此事务处理 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   linux操作oracle命令 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   memcache查看数据命令 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   Python学习指南 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   struts2结果类型 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   struts2拦截器 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   struts2输出国际信息 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   Action访问Servlet API 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   struts2拦截器配置 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   struts2标签 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   日志 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   快速排序算法 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   培训总结 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   struts2常量配置 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流之 分配任务给一个"组的成员"GroupTask 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流之"事件处理Event" 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流之"分支聚合Join-Fork" 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流之"自定义活动Custom" 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    CheckBox 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流之Decision流程决策(判断活动执行方向) 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    RadioGroup 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    EditText 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    TextView 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    绝对定位布局管理器 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    布局管理器的嵌套 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    Button 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    相对布局管理器 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    表格布局管理器 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   Activity 初步 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    线性布局管理器 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    框架布局管理器 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   Android项目文件夹解析 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   AndroidManifest.xml解析 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   main.xml解析 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流之根据流程变量分配任务Task 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流之状态活动State 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流的流转Transition 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流管理方法扩展 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流管理 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流执行变量Variable 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   -jbpm工作流实现 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   向字符串对象中追加replaceAll方法 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   获取登录用户IP地址 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   根据tree文件菜单的path,拼接文件夹路径 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   EasyUI的from表单,根据皮肤变换 改变表单颜色 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   初试.net使用ajax调用后台方法 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   myelipse 版本破解 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   .net 如何连接mysql 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   集合按某个属性排序 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   EasyUI+Struts2整合KindEditor 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   kaptcha验证码使用配置 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   定时器-----每天定时删除临时文件 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   html5绘图铺满整个屏幕 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   canvas像素化video 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   .net 当GridView编辑状态获取新值时,往往获取的是修改前的值。 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   HTML5 之 notifications 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   Html5 之 Google地图 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   CSS3 动画之animation 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   CSS3 过渡之transition 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   CSS3 D 转换 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   CSS3 D 转换 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   CSS3 border-radius 属性 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   html5视频播放 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   CSS3 border-image 属性 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   CSS3 box-shadow 属性 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   html5之绘图板 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   html5中canvas基本应用 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   html5之canvas动画 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   easyui扁平Json生成树形菜单 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    AutoCompleteTextView 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:   easyui动态创建一个dialog 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    Dialog 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    ScrollView 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    ListView 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    DatePicker 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    ImageView 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    TimePicker 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    ImageButton 
createTime: -- :
get page: http://blog.csdn.net/zhengyong15984285623/article/details/
title:    Spinner 
createTime: -- :
           

三、 注意事项

WebMagic使用log4j打印日志

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">

<log4j:configuration>

    <appender name="CONSOLE" class="org.apache.log4j.ConsoleAppender">
        <layout class="org.apache.log4j.PatternLayout">
            <param name="ConversionPattern"
                   value="%d - %c -%-4r [%t] %-5p %x - %m%n" />
        </layout>

        <!--限制输出级别-->
        <filter class="org.apache.log4j.varia.LevelRangeFilter">
            <param name="LevelMax" value="ERROR"/>
            <param name="LevelMin" value="TRACE"/>
        </filter>
    </appender>

    <appender name="FILE" class="org.apache.log4j.FileAppender">
        <param name="File" value="/Users/zhengyong/log/crawl.log"/>
        <layout class="org.apache.log4j.PatternLayout">
            <param name="ConversionPattern"
                   value="%d - %c -%-4r [%t] %-5p %x - %m%n" />
        </layout>
    </appender>

    <root>
        <priority value="info" />
        <appender-ref ref="CONSOLE" />
        <appender-ref ref="FILE" />
    </root>


</log4j:configuration>
           

Java中main方法使用

import org.apache.log4j.xml.DOMConfigurator;

DOMConfigurator.configure("/Users/zhengyong/crawl/log4j.xml");//加载.xml文件