自定义分片策略解决大量小文件问题

@(Hadoop)

默认的TextInputFormat

应该都知道默认的TextInputFormat是一行行的读取文件内容，这对于一个或几个超大型的文件来说并没有什么问题，但是在实验读取大量小文件的时候，性能及其低下。

实验过程

分别有5个文件夹，每个文件夹下有不同数量（1-2千个）的小文件（10+k大小），总量大概有8k+个文件，使用CLI命令上传到HDFS就花费了一个多小时。

环境为本地电脑安装的伪分布式Hadoop集群，机器配置为四核I7的CPU，16G的RAM。

编写简单的worldCount程序，一切默认，放到集群上跑的时候出现以下情况：

1.启动的mapper总数量为8k+个！而一个节点能同时运行的mapper数量为4

2.整个map过程及其缓慢，50%跑了2h

2.CPU总用率高达80%，整个机器开始发出呲呲呲的声音

可见大量的小文件对mapreduce程序性能的影响有多大。

问题的根本所在

HDFS上的文件是按block来存储的。

如果一个文件很大，超出了一个block的设定，那么它就会被划分为多个block存储，mapreduce程序读取时，每个block都会对应输入一个mapper，所以大文件，默认的分片策略是可以hold住的。

但是如果是很多小文件的话，每个小文件存储的时候都会是一个block，即使它很小，远远达不到block大小（默认128M），HDFS还是会将其存储在一个block中，那么问题就来了，默认的分片策略在读取这些小文件的时候，每个block都会产生一个mapper，所以就有了上面程序中出现了8k+个mapper的情况。

解决方案

既然知道了问题所在，那么就可以指定对应的解决方案，无非就是从两点入手：

1.默认每个小文件对应一个block，那么可以采取压缩等手段将多个小文件进行合并存储，以达到每个block存储的内容都是足够大的。

2.修改mapreduce默认的分片策略，使得读取文件进行分片的时候让每个block可以对应多个小文件，而不再是仅仅一个小文件。

自定义的分片策略

关于Hadoop的InputFormat类，网上有很多详细介绍的文章，默认的分片策略使用的就是其TextInputFormat子类，这里将介绍另外一个子类：CombineFileInputFormat。

顾名思义，CombineFileInputFormat是用来将输入的小文件进行合并，然后输入到一个mapper中的策略。

这是一个抽象类，只实现了InputFomat接口的getSplit方法。（P.S.所有的分片策略都要继承InputFormat，并实现getSplit和createRecordReader两个方法）

既然我们需要用到CombineFileInputFormat，但他留了一个接口方法让我们实现，那么就可以自定义一个MyInputFormat类继承自CombineFileInputFormat，重写createRecordReader。

关于InputFormat接口的两个方法：

1.getSplit是从HDFS上读取文件，并形成逻辑的分片，在本文中，这个分片会包含多个小文件。

2.createRecordReader会创建一个RecordReader对象，用来读取getSplit产生的分片，mapper中的键值对就是这个RecordReader输出的。

之前讨论到的自定义MyInputFormat类实现分片策略，但是分片之后如何读取分片内的数据是createRecordReader方法创建的RecordReader对象决定的。

所以自定义分片策略的关键在于两点：

1.MyInputFormat类自定义分片策略

2.MyRecordReader类自定义读取分片内的数据

MyInputFormat

新建MyInputFormat类继承CombineFileInputFormat：

/**
 * 自定义的分片类,继承CombineFileInputFormat可以将多个小文件进行合并,泛型参数为输入map函数中的key,value类型
 */
public class MyInputFormat extends CombineFileInputFormat<Text, Text> {

    /**
     * 重写次方法,直接放回false,对所有文件都不进行切割,保持完整
     */
    @Override
    protected boolean isSplitable(JobContext context, Path file) {
        return false;
    }

    /**
     * 重写此方法,返回的CombineFileRecordReader为处理每个分片的recordReader,在构造函数中设置自定义的RecordReader对象
     */
    @Override
    public RecordReader<Text, Text> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException {
        return new CombineFileRecordReader<Text, Text>((CombineFileSplit) inputSplit, taskAttemptContext, PaodingRecordReader.class);
    }
}

关于createRecordReader的细节将在MyRecordReader中详细讨论。

MyRecordReader

新建MyRecordReader类继承RecordReader：

/**
 * 自定义的RecordReader类,用来处理CombineFileInputSplit返回的每个分片
 * 泛型的参数为输入mapper时候的键值对
 */
public class MyRecordReader extends RecordReader<Text, Text> {

}

继承RecordReader需要实现RecordReader的6个抽象方法。

1.initialize

/**
* 初始化RecordReader的一些设置
* */
@Override
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {

}

该方法会在InputFormat的createRecordReader创建一个RecordReader的时候调用，MyInputFormat中的代码：

@Override
    public RecordReader<Text, Text> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException {
        return new CombineFileRecordReader<Text, Text>((CombineFileSplit) inputSplit, taskAttemptContext, MyRecordReader.class);
    }

CombineFileRecordReader的构造函数有三个参数：

1.inputSplit：当前输入的分片

2.taskAttemptContext：当前输入的系统环境

3.MyRecordReader.class：自定义的RecordReader

这里只关心第三个参数是如何在CombineFileRecordReader中构造的，进入CombineFileRecordReader源码（只看注释部分即可）：

public CombineFileRecordReader(CombineFileSplit split, TaskAttemptContext context, Class<? extends RecordReader<K, V>> rrClass) throws IOException {
        this.split = split;
        this.context = context;
        this.idx = 0;
        this.curReader = null;
        this.progress = 0L;

        try {
            //在构造函数中，会对传进来的RecordReader子类进行检查，其构造函数必须满足constructorSignature的中的类型
            this.rrConstructor = rrClass.getDeclaredConstructor(constructorSignature);
            this.rrConstructor.setAccessible(true);
        } catch (Exception var5) {
            //不满足构造函数的要求则抛出异常
            throw new RuntimeException(rrClass.getName() + " does not have valid constructor", var5);
        }
        //满足要求之后就会进行RecordReader对象的初始化
        this.initNextRecordReader();
    }

constructorSignature的定义是这样子的：

static final Class[] constructorSignature = new Class[]{CombineFileSplit.class, TaskAttemptContext.class, Integer.class};

意思是，MyRecordReader的构造函数必须要有三个CombineFileSplit，TaskAttemptContext和Integer类型的参数。

构造函数检查通过之后就进行RecordReader对象的初始化阶段

initNextRecordReader的代码：

protected boolean initNextRecordReader() throws IOException {
        //如果curReader不为null，那么说明不是第一次初始化，该对象已经实例化过了，直接进行处理之后，继续执行下一步操作
        if(this.curReader != null) {
            this.curReader.close();
            this.curReader = null;
            if(this.idx > 0) {
                this.progress += this.split.getLength(this.idx - 1);
            }
        }
        //无论RecordReader对象有没有被初始化最终都会进行下列步骤的处理
        if(this.idx == this.split.getNumPaths()) {
            return false;
        } else {
            this.context.progress();

            try {
                Configuration e = this.context.getConfiguration();
                e.set("mapreduce.map.input.file", this.split.getPath(this.idx).toString());
                e.setLong("mapreduce.map.input.start", this.split.getOffset(this.idx));
                e.setLong("mapreduce.map.input.length", this.split.getLength(this.idx));
                //在这里就会调用MyRecordReader的构造函数！将当前的处理的分片，系统环境和当前处理的文件索引传入，并设置MyRecordReader为当前处理文件的RecordReader！
                this.curReader = (RecordReader)this.rrConstructor.newInstance(new Object[]{this.split, this.context, Integer.valueOf(this.idx)});
                if(this.idx > 0) {
                    //这里会调用MyRecordReader中的initialize方法
                    this.curReader.initialize(this.split, this.context);
                }
            } catch (Exception var2) {
                throw new RuntimeException(var2);
            }
            //处理结束后，当前处理的文件索引进行自增，以处理分片中的下一个文件
            ++this.idx;
            return true;
        }
    }

从源码上来看，每读取分片中的一个新文件，都会实例化一个MyRecordReader对象进行操作，对MyRecordReader对象的初始化工作可以在initialize和构造函数中进行，但是initialize方法需要当前处理的文件索引大于0时才调用，并且，MyRecordReader构造函数中多了一个记录当前处理的文件索引的参数index，所以我们需要使用MyRecordReader的构造函数来做一些事情：

/**
     * 构造函数必须的三个参数,自定义的InputFormat类每次读取新的分片时,都会实例化自定义的RecordReader类对象来对其进行读取
     *
     * @param combineFileSplit   当前读取的分片
     * @param taskAttemptContext 系统上下文环境
     * @param index              当前分片中处理的文件索引
     */
    public MyRecordReader(CombineFileSplit combineFileSplit, TaskAttemptContext taskAttemptContext, Integer index) {
        this.combineFileSplit = combineFileSplit;
        this.conf = taskAttemptContext.getConfiguration();
        this.currentIndex = index;
    }

2.nextKeyValue

由于我们在MyInputFormat的createRecordReader方法中返回的是CombineFileRecordReader对象，而不是直接返回MyRecordReader，mapper中获得键值对的方式就是调用当前处理的RecordReader对象的nextKeyValue方式，也就是CombineFileRecordReader。

CombineFileRecordReader中的nextKeyValue方法代码：

public boolean nextKeyValue() throws IOException, InterruptedException {
    do {
        //如果当前的RecordReader可以使用，那么就执行当前RecordReader的nextKeyValue
        if(this.curReader != null && this.curReader.nextKeyValue()) {
            return true;
        }
    //对分片中下一个要处理的文件进行RecordReader的初始化工作
    } while(this.initNextRecordReader());

    return false;
}

如果当前的RecordReader可以使用，那么就执行当前RecordReader的，而RecordReader的初始化工作刚刚已经通过CombineFileRecordReader的源码看到了，当前的RecordReader就是自定义的MyRecordReader。

所以mapper中获得键值对的逻辑主要是在MyRecordReader的nextKeyValue中实现的：

/**
     * 返回true就取出key和value,返回false就结束循环表示没有文件内容可读取了
     * 之后在父类的nextKeyValue方法中进行index的前移
     */
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        //没被读取过的文件才进行读取
        if (!this.isReaded) {
            //根据当前的文件索引从当前分片中找到对应的文件路径
            Path path = this.combineFileSplit.getPath(this.currentIndex);
            //获取父目录名即为类别名
            this.currentKey.set(path.getParent().getName());
            //从当前分片中获得当前文件的长度
            byte[] content = new byte[(int) this.combineFileSplit.getLength(this.currentIndex)];
            try {
                //读取该文件内容
                FileSystem fs = path.getFileSystem(this.conf);
                this.inputStream = fs.open(path);
                this.inputStream.readFully(content);
            } catch (Exception ignored) {
            } finally {
                assert inputStream != null;
                inputStream.close();
            }
            this.currentValue.set(content);
            this.isReaded = true;
            return true;
        }
        return false;
    }

3.getCurrentKey和getCurrentValue

CombineFileRecordReader中的getCurrentKey和getCurrentValue：

public K getCurrentKey() throws IOException, InterruptedException {
        //直接调用当期RecordReader的getCurrentKey
        return this.curReader.getCurrentKey();
    }

public V getCurrentValue() throws IOException, InterruptedException {
        //直接调用当期RecordReader的getCurrentValue
        return this.curReader.getCurrentValue();
    }

自定义的MyRecordReader中：

/**
     * 返回当前key的方法
     */
    @Override
        public Text getCurrentKey() throws IOException, InterruptedException {
        return this.currentKey;
    }

    /**
     * 返回当前value的方法
     */
    @Override
    public Text getCurrentValue() throws IOException, InterruptedException {
        return this.currentValue;
    }

this.currentKey和this.currentValue已经在上面的nextKeyValue方法中得到了。

4.getProgress

返回当前分片的处理进度

/**
 * 返回当前的处理进度
 */
@Override
public float getProgress() throws IOException, InterruptedException {
    //获得当前分片中的总文件数
    int splitFileNum = this.combineFileSplit.getPaths().length;
    if (this.currentIndex >= 0 && this.currentIndex < splitFileNum) {
        //当前处理的文件索引除以文件总数得到处理的进度
        this.currentProgress = (float) this.currentIndex / splitFileNum;
        return this.currentProgress;
    }
    return this.currentProgress;
}

5.close

进行一些收尾工作

@Override
public void close() throws IOException {
    if (this.inputStream != null) {
        this.inputStream.close();
    }
}

完整的MyRecordReader类代码：

/**
 * Created by xiaohei on 16/2/29.
 * 自定义的RecordReader类,用来处理CombineFileInputSplit返回的每个分片
 */
public class PaodingRecordReader extends RecordReader<Text, Text> {

    private CombineFileSplit combineFileSplit;//当前处理的分片
    private Configuration conf;//系统信息
    private int currentIndex;//当前处理到第几个分片

    private Text currentKey = new Text();//当前key
    private Text currentValue = new Text();//当前value
    private boolean isReaded = false;//是否已经读取过了该分片
    private float currentProgress = 0;//当前读取进度

    private FSDataInputStream inputStream;//HDFS文件流读取

    /**
     * 构造函数必须的三个参数,自定义的InputFormat类每次读取新的分片时,都会实例化自定义的RecordReader类对象来对其进行读取
     *
     * @param combineFileSplit   当前读取的分片
     * @param taskAttemptContext 系统上下文环境
     * @param index              当前分片中处理的文件索引
     */
    public PaodingRecordReader(CombineFileSplit combineFileSplit, TaskAttemptContext taskAttemptContext, Integer index) {
        this.combineFileSplit = combineFileSplit;
        this.conf = taskAttemptContext.getConfiguration();
        this.currentIndex = index;
    }

    /**
     * 初始化RecordReader的一些设置
     * */
    @Override
    public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {

    }


    /**
     * 返回true就取出key和value,之后index前移,返回false就结束循环表示没有文件内容可读取了
     */
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        //没被读取过的文件才进行读取
        if (!this.isReaded) {
            //根据当前的文件索引从当前分片中找到对应的文件路径
            Path path = this.combineFileSplit.getPath(this.currentIndex);
            //获取父目录名即为类别名
            this.currentKey.set(path.getParent().getName());
            //从当前分片中获得当前文件的长度
            byte[] content = new byte[(int) this.combineFileSplit.getLength(this.currentIndex)];
            try {
                //读取该文件内容
                FileSystem fs = path.getFileSystem(this.conf);
                this.inputStream = fs.open(path);
                this.inputStream.readFully(content);
            } catch (Exception ignored) {
            } finally {
                assert inputStream != null;
                inputStream.close();
            }
            this.currentValue.set(content);
            this.isReaded = true;
            return true;
        }
        return false;
    }

    /**
     * 返回当前key的方法
     */
    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
        return this.currentKey;
    }

    /**
     * 返回当前value的方法
     */
    @Override
    public Text getCurrentValue() throws IOException, InterruptedException {
        return this.currentValue;
    }

    /**
     * 返回当前的处理进度
     */
    @Override
    public float getProgress() throws IOException, InterruptedException {
        //获得当前分片中的总文件数
        int splitFileNum = this.combineFileSplit.getPaths().length;
        if (this.currentIndex >= 0 && this.currentIndex < splitFileNum) {
            //当前处理的文件索引除以文件总数得到处理的进度
            this.currentProgress = (float) this.currentIndex / splitFileNum;
            return this.currentProgress;
        }
        return this.currentProgress;
    }

    @Override
    public void close() throws IOException {
        if (this.inputStream != null) {
            this.inputStream.close();
        }
    }
}

在Job中设置自定义的分片策略

job.setsetInputFormatClass(MyInputFormat.class);

job会根据MyInputFormat的getSplit方法对数据进行分片，根据createRecordReader创建对分片的处理类。

自定义分片的实际场景使用Demo

Github源码地址

作者：

@小黑

自定义分片策略解决大量小文件问题自定义分片策略解决大量小文件问题

自定义分片策略解决大量小文件问题

默认的TextInputFormat

自定义的分片策略

MyInputFormat

MyRecordReader

在Job中设置自定义的分片策略

自定义分片的实际场景使用Demo

继续阅读

Windows下Cygwin环境的Hadoop安装（3）- 运行hadoop中的wordcount实例遇到的问题和解决方法

MapReduce运行Wordcount时一直卡在INFO mapreduce.Job: Running job，web查看一直处于accepted阶段

ubuntu hadoop2.6.1，terminal下运行wordcount

MapReduce(一)：入门级程序wordcount及其分析

hadoop操作遇到的问题问题一：输出文件已存在

Hadoop之运行wordcount

jdk1.7+Eclipse+Maven3.5+Hadoop2.7.3构建hadoop项目

Eclipse运行WordCount（详细版）相关连接Eclipse运行WordCount

BMP文件结构及图像每行字节计算方法

磁盘结构及在Linux中的命名

hadoop 用MR实现join操作

Centos7 下 Hadoop 2.6.4 分布式集群环境搭建摘要集群准备安装JDK 安装 Hadoop 2.6.4 部署 slaver1-slaver4 启动 hadoop 集群成功了

MapReduce的几个企业级经典面试案例MapReduce的几个企业级经典面试案例

ubuntu14.04下安装hbse1.0.1.1

User Defined Hadoop DataType

Ambari介绍和架构原理