Hadoop的InputFormat和OutputFormat

2023-03-16 12:25:13

一 InputFormat概述：

(1)InputFormat類：InputFormat用來描述和控制MapReduce Job的資料輸入。

(2)InputSplit(輸入分片)：代表配置設定給單個map任務的資料。InputSplit存儲的并非資料本身，而是一個分片長度和一個記錄資料位置的數組，生成InputSplit的方法可通過InputFormat來設定。InputFormat的getSplits方法可以生成InputSplit相關資訊，包括兩部分：InputSplit中繼資料資訊和原始InputSplit資訊。InputSplit中繼資料資訊将被JobTracker使用，用以生成Task本地性相關資料結構；原始InputSplit資訊将被Map Task初始化時使用，用以擷取自己要處理的資料。

(3)map任務處理的資料是由InputFormat分解過的資料，InputFormat将資料集分割為輸入分片(InputSplit)。map會将分片傳送給InputFormat，InputFormat調用getRecordReader方法生成RecordReader，RecordReader再通過createKey、createValue方法建立可供Map處理的<key,value>。

(4)Hadoop預定義了多種方法将不同類型輸入資料轉化為Map能處理的<key,value>對(也可自定義)，它們都繼承自InputFormat，分别是：

*DBInputFormat

*DelegatingInputFormat

*FIleInputFormat：CombineFileInputFormat, KeyValueTextInputFormat, NLineInputFormat,SequenceFileInputFormat, TextInputFormat。

二 OutputFormat概述：

(1)OutputFormat類：OutputFormat類描述和控制MapReduce Job的資料輸出。

(2)MapReduce架構需要OutputFormat做的工作：

*Validate the output-specification of the job. For e.g. check that the output directory doesn't already exist.

*Provide the RecordWriter implementation to be used to write out the output files of the job. Output files are stored in a FileSystem.

Hadoop的InputFormat和OutputFormat

繼續閱讀

大資料技術原理與應用（最後三天備考了！！！）

Hadoop FSDataInputStream 和FSDataOutputStream 用法

Windows下Cygwin環境的Hadoop安裝（3）- 運作hadoop中的wordcount執行個體遇到的問題和解決方法

MapReduce運作Wordcount時一直卡在INFO mapreduce.Job: Running job，web檢視一直處于accepted階段

ubuntu hadoop2.6.1，terminal下運作wordcount

MapReduce(一)：入門級程式wordcount及其分析

hadoop操作遇到的問題問題一：輸出檔案已存在

Hadoop之運作wordcount

jdk1.7+Eclipse+Maven3.5+Hadoop2.7.3建構hadoop項目

Eclipse運作WordCount（詳細版）相關連接配接Eclipse運作WordCount

hadoop 用MR實作join操作

Centos7 下 Hadoop 2.6.4 分布式叢集環境搭建摘要叢集準備安裝JDK 安裝 Hadoop 2.6.4 部署 slaver1-slaver4 啟動 hadoop 叢集成功了

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

ubuntu14.04下安裝hbse1.0.1.1

User Defined Hadoop DataType

Ambari介紹和架構原理