hadoop inputformat

MapReduce过程Mapper的输出参数和Reducer的输入参数是一样的，都是中间需要处理的结果，而Reducer的输出结果便是我们想要的输出结果。所以根据需要对InputFormat进行较合理的设置，Job才能正常运行。Job过程中间的Key和Value的对应关系可以简单阐述如下：

map: <k1,v1> -> list(k2,v2)
combile: <k2,list(v2)> -> list(k2,v2)
reduce: <k2,list(v2)> -> list(k3,v3)

至于为什么需要显示指定中间、最终的数据类型，貌似看上去很奇怪，原因是Java的泛型机制有很多限制，类型擦出导致运行过程中类型信息并非一直可见，所以Hadoop不得不明确指定。

InputFormat的结构图如下：

hadoop inputformat

还有想说明的是，单个reducer的默认配置对于新手而言很容易上手，但是在真实的应用中，reducer被设置成一个较大的数字，否则作业效率极低。reducer的最大个数与集群中reducer的最大个数有关，集群中reducer的最大个数由节点数与每个节点的reducer数相乘得到。该值在mapred.tasktracker.reduce.tasks.maximum决定

下面介绍一些常用的InputFormat和用法。

FileInpuFormat

FileInputFormat是所有使用文件作为数据源的InputFormat的积累。它提供两个功能：一个是定义哪些文件包含在一个作业的输入中；一个为输入文件生成分片的实现。自动将作业分块作业分块大小与mapred-site.xml中的mapred.min.split.size和mapred.min.split.size和blocksize有关系。分片大小由如下公式来决定：

分片大小 = max(minimumSize, min(maximumSize, blockSize))

如果想避免文件被切分，可以采用如下两种之一，不过推荐第二种。

1)设置minimum size 大于文件大小即可

2)使用FileInputFormat子类并重载isSplitable方法返回false

import org.apache.hadoop.fs.*;

import org.apache.hadoop.mapred.TextInputFormat;

public class NonSplittableTextInputFormat extends TextInputFormat {

　　@Override

　　protected boolean isSplitable(FileSystem fs, Path file) {

　　　　return false;

　　}

CombileFileInputFormat

CombileFileInputFormat是为了解决大批量的小文件作业。

TextInputFormat

（LongWritable，Text：字节偏移量，每行的内容）

默认的InputFormat。键是改行文件在源文件中的偏移量，值是该行内容（不包括终止符，如换行符或者回车符）。如

On the top of the Crumpetty Tree

The Quangle Wangle sat,

But his face you could not see,

On account of his Beaver Hat.

被表示成键值对如下：

<0, On the top of the Crumpetty Tree>

<33, The Quangle Wangle sat,>

<57, But his face you could not see,>

<89, On account of his Beaver Hat.>

KeyValueTextInputFormat

如果文件中的每一行就是一个键值对，使用某个分界符进行分隔，比如Tab分隔符。例如Hadoop默认的OutputFormat产生的输出，即是每行用Tab分隔符分隔开的键值对。

可以通过key.value.separator.in.input.line属性来指定分隔符，默认的值是一个Tab分隔符。（注： → 代表一个Tab分隔符）

line1→On the top of the Crumpetty Tree

line2→The Quangle Wangle sat,

line3→But his face you could not see,

line4→On account of his Beaver Hat.

被表示成键值对如下：

<line1, On the top of the Crumpetty Tree>

<line2, The Quangle Wangle sat,>

<line3, But his face you could not see,>

<line4, On account of his Beaver Hat.>

NLineInputFormat

以行号来分割数据源文件。N作为输入的行数，可以有mapred.line.input.format.linespermap来指定。

On the top of the Crumpetty Tree

The Quangle Wangle sat,

But his face you could not see,

On account of his Beaver Hat.

如果N是2，则一个mapper会收到前两行键值对：

<0, On the top of the Crumpetty Tree>

<33, The Quangle Wangle sat,>

另一个mapper会收到后两行：

<57, But his face you could not see,>

<89, On account of his Beaver Hat.>

hadoop inputformat

FileInpuFormat

CombileFileInputFormat

TextInputFormat

KeyValueTextInputFormat

NLineInputFormat

继续阅读

大数据技术原理与应用（最后三天备考了！！！）

Hadoop FSDataInputStream 和FSDataOutputStream 用法

Windows下Cygwin环境的Hadoop安装（3）- 运行hadoop中的wordcount实例遇到的问题和解决方法

MapReduce运行Wordcount时一直卡在INFO mapreduce.Job: Running job，web查看一直处于accepted阶段

ubuntu hadoop2.6.1，terminal下运行wordcount

MapReduce(一)：入门级程序wordcount及其分析

hadoop操作遇到的问题问题一：输出文件已存在

Hadoop之运行wordcount

jdk1.7+Eclipse+Maven3.5+Hadoop2.7.3构建hadoop项目

Eclipse运行WordCount（详细版）相关连接Eclipse运行WordCount

hadoop 用MR实现join操作

Centos7 下 Hadoop 2.6.4 分布式集群环境搭建摘要集群准备安装JDK 安装 Hadoop 2.6.4 部署 slaver1-slaver4 启动 hadoop 集群成功了

MapReduce的几个企业级经典面试案例MapReduce的几个企业级经典面试案例

ubuntu14.04下安装hbse1.0.1.1

User Defined Hadoop DataType

Ambari介绍和架构原理