1.概述

在進行資料傳輸中，批量加載資料到HBase叢集有多種方式，比如通過HBase API進行批量寫入資料、使用Sqoop工具批量導數到HBase叢集、使用MapReduce批量導入等。這些方式，在導入資料的過程中，如果資料量過大，可能耗時會比較嚴重或者占用HBase叢集資源較多（如磁盤IO、HBase Handler數等）。今天這篇部落格筆者将為大家分享使用HBase BulkLoad的方式來進行海量資料批量寫入到HBase叢集。

2.内容

在使用BulkLoad之前，我們先來了解一下HBase的存儲機制。HBase存儲資料其底層使用的是HDFS來作為存儲媒體，HBase的每一張表對應的HDFS目錄上的一個檔案夾，檔案夾名以HBase表進行命名（如果沒有使用命名空間，則預設在default目錄下），在表檔案夾下存放在若幹個Region命名的檔案夾，Region檔案夾中的每個列簇也是用檔案夾進行存儲的，每個列簇中存儲就是實際的資料，以HFile的形式存在。路徑格式如下：

/hbase/data/default/<tbl_name>/<region_id>/<cf>/<hfile_id>

2.1 實作原理

按照HBase存儲資料按照HFile格式存儲在HDFS的原理，使用MapReduce直接生成HFile格式的資料檔案，然後在通過RegionServer将HFile資料檔案移動到相應的Region上去。流程如下圖所示：

2.2. 生成HFile檔案

HFile檔案的生成，可以使用MapReduce來進行實作，将資料源準備好，上傳到HDFS進行存儲，然後在程式中讀取HDFS上的資料源，進行自定義封裝，組裝RowKey，然後将封裝後的資料在回寫到HDFS上，以HFile的形式存儲到HDFS指定的目錄中。實作代碼如下：

/**
 * Read DataSource from hdfs & Gemerator hfile.
 * 
 * @author smartloli.
 *
 *         Created by Aug 19, 2018
 */
public class GemeratorHFile2 {
    static class HFileImportMapper2 extends Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue> {
        
        protected final String CF_KQ = "cf";

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            System.out.println("line : " + line);
            String[] datas = line.split(" ");
            String row = new Date().getTime() + "_" + datas[1];
            ImmutableBytesWritable rowkey = new ImmutableBytesWritable(Bytes.toBytes(row));
            KeyValue kv = new KeyValue(Bytes.toBytes(row), this.CF_KQ.getBytes(), datas[1].getBytes(), datas[2].getBytes());
            context.write(rowkey, kv);
        }
    }

    public static void main(String[] args) {
        if (args.length != 1) {
            System.out.println("<Usage>Please input hbase-site.xml path.</Usage>");
            return;
        }
        Configuration conf = new Configuration();
        conf.addResource(new Path(args[0]));
        conf.set("hbase.fs.tmp.dir", "partitions_" + UUID.randomUUID());
        String tableName = "person";
        String input = "hdfs://nna:9000/tmp/person.txt";
        String output = "hdfs://nna:9000/tmp/pres";
        System.out.println("table : " + tableName);
        HTable table;
        try {
            try {
                FileSystem fs = FileSystem.get(URI.create(output), conf);
                fs.delete(new Path(output), true);
                fs.close();
            } catch (IOException e1) {
                e1.printStackTrace();
            }

            Connection conn = ConnectionFactory.createConnection(conf);
            table = (HTable) conn.getTable(TableName.valueOf(tableName));
            Job job = Job.getInstance(conf);
            job.setJobName("Generate HFile");

            job.setJarByClass(GemeratorHFile2.class);
            job.setInputFormatClass(TextInputFormat.class);
            job.setMapperClass(HFileImportMapper2.class);
            FileInputFormat.setInputPaths(job, input);
            FileOutputFormat.setOutputPath(job, new Path(output));

            HFileOutputFormat2.configureIncrementalLoad(job, table);
            try {
                job.waitForCompletion(true);
            } catch (InterruptedException e) {
                e.printStackTrace();
            } catch (ClassNotFoundException e) {
                e.printStackTrace();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

    }
}

在HDFS目錄/tmp/person.txt中，準備資料源如下：

1 smartloli 100
2 smartloli2 101
3 smartloli3 102

然後，将上述代碼編譯打包成jar，上傳到Hadoop叢集進行執行，執行指令如下：

hadoop jar GemeratorHFile2.jar /data/soft/new/apps/hbaseapp/hbase-site.xml

如果在執行指令的過程中，出現找不到類的異常資訊，可能是本地沒有加載HBase依賴JAR包，在目前使用者中配置如下環境變量資訊：

export HADOOP_CLASSPATH=$HBASE_HOME/lib/*:classpath

然後，執行source指令使配置的内容立即生生效。

2.3. 執行預覽

在成功送出任務後，Linux控制台會列印執行任務進度，也可以到YARN的資源監控界面檢視執行進度，結果如下所示：

等待任務的執行，執行完成後，在對應HDFS路徑上會生成相應的HFile資料檔案，如下圖所示：

2.4 使用BulkLoad導入到HBase

然後，在使用BulkLoad的方式将生成的HFile檔案導入到HBase叢集中，這裡有2種方式。一種是寫代碼實作導入，另一種是使用HBase指令進行導入。

2.4.1 代碼實作導入

通過LoadIncrementalHFiles類來實作導入，具體代碼如下：

/**
* Use BulkLoad inport hfile from hdfs to hbase.
* 
* @author smartloli.
*
* Created by Aug 19, 2018
*/
public class BulkLoad2HBase {

    public static void main(String[] args) throws Exception {
        if (args.length != 1) {
            System.out.println("<Usage>Please input hbase-site.xml path.</Usage>");
            return;
        }
        String output = "hdfs://cluster1/tmp/pres";
        Configuration conf = new Configuration();
        conf.addResource(new Path(args[0]));
        HTable table = new HTable(conf, "person");
        LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
        loader.doBulkLoad(new Path(output), table);
    }
    
}

執行上述代碼，運作結果如下：

2.4.2 使用HBase指令進行導入

先将生成好的HFile檔案遷移到目标叢集（即HBase叢集所在的HDFS上），然後在使用HBase指令進行導入，執行指令如下：

# 先使用distcp遷移hfile
hadoop distcp -Dmapreduce.job.queuename=queue_1024_01 -update -skipcrccheck -m 10 /tmp/pres hdfs://nns:9000/tmp/pres

# 使用bulkload方式導入資料
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /tmp/pres person

最後，我們可以到指定的RegionServer節點上檢視導入的日志資訊，如下所示為導入成功的日志資訊：

2018-08-19 16:30:34,969 INFO  [B.defaultRpcServer.handler=7,queue=1,port=16020] regionserver.HStore: Successfully loaded store file hdfs://cluster1/tmp/pres/cf/7b455535f660444695589edf509935e9 into store cf (new location: hdfs://cluster1/hbase/data/default/person/2d7483d4abd6d20acdf16533a3fdf18f/cf/d72c8846327d42e2a00780ac2facf95b_SeqId_4_)

2.5 驗證

使用BulkLoad方式導入資料後，可以進入到HBase叢集，使用HBase Shell來檢視資料是否導入成功，預覽結果如下：

3.總結

本篇部落格為了示範實戰效果，将生成HFile檔案和使用BulkLoad方式導入HFile到HBase叢集的步驟進行了分解，實際情況中，可以将這兩個步驟合并為一個，實作自動化生成與HFile自動導入。如果在執行的過程中出現RpcRetryingCaller的異常，可以到對應RegionServer節點檢視日志資訊，這裡面記錄了出現這種異常的詳細原因。

4.結束語

這篇部落格就和大家分享到這裡，如果大家在研究學習的過程當中有什麼問題，可以加群進行讨論或發送郵件給我，我會盡我所能為您解答，與君共勉！

另外，部落客出書了《Hadoop大資料挖掘從入門到進階實戰》，喜歡的朋友或同學，可以在公告欄那裡點選購買連結購買部落客的書進行學習，在此感謝大家的支援。

聯系方式：

郵箱：[email protected]

Twitter：https://twitter.com/smartloli

QQ群（Hadoop - 交流社群1）：424769183

QQ群（Kafka并不難學）： 825943084

溫馨提示：請大家加群的時候寫上加群理由（姓名＋公司/學校），友善管理者稽核，謝謝！

HBase BulkLoad批量寫入資料實戰

1.概述

2.内容

2.1 實作原理

2.2. 生成HFile檔案

2.3. 執行預覽

2.4 使用BulkLoad導入到HBase

2.4.1 代碼實作導入

2.4.2 使用HBase指令進行導入

2.5 驗證

3.總結

4.結束語

熱愛生活，享受程式設計，與君共勉！

公衆号：

HBase BulkLoad批量寫入資料實戰

作者：哥不是小蘿莉［關于我］［犒賞］

出處：http://www.cnblogs.com/smartloli/

轉載請注明出處，謝謝合作！

繼續閱讀

hbase shell出現ERROR: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException

HBase 列族屬性配置

史上最簡單的HBase表結構分析（有圖有真相）

Hbase-之架構設計(schema design)Hbase-之架構設計（schema design）

初識Hbase：第一個Hbase程式

HBASE預先配置設定regions的實作

HBase 實踐

HBASE通過預先建立regions，來平衡資料的負載

HBase Java API使用操作例子

impala、hive、phoenix、hbase映射測試

HBase第二天：HBase的API操作，判斷表存在、建立删除表、擷取表中一行或指定列族資料、向表中插入資料、HBase的wordcount、自定義HBaseMapReduce、Hbase內建Hive第6章 HBase API操作

hbase thrift C++ 簡單測試

Cloudera Manager HBase Thrift 接口 Go/Python用戶端

Percolator Google的海量資料增量處理系統

大資料技術原理與應用（最後三天備考了！！！）

ubuntu14.04下安裝hbse1.0.1.1

HBase BulkLoad批量寫入資料實戰

1.概述

2.内容

2.1 實作原理

2.2. 生成HFile檔案

2.3. 執行預覽

2.4 使用BulkLoad導入到HBase

2.4.1 代碼實作導入

2.4.2 使用HBase指令進行導入

2.5 驗證

3.總結

4.結束語

熱愛生活，享受程式設計，與君共勉！

公衆号：

HBase BulkLoad批量寫入資料實戰

作者：哥不是小蘿莉 ［關于我］［犒賞］

出處：http://www.cnblogs.com/smartloli/

轉載請注明出處，謝謝合作！

繼續閱讀

作者：哥不是小蘿莉［關于我］［犒賞］