Hadoop Hive

一、Hive：

1、定義： Hive是建立在 Hadoop 上的資料倉庫基礎構架。它提供了一系列的工具，可以用來進行資料提取轉化加載（ETL），這是一種可以存儲、查詢和分析存儲在 Hadoop 中的大規模資料的機制。Hive 定義了簡單的類 SQL 查詢語言，稱為 HQL，它允許熟悉 SQL 的使用者查詢資料。同時，這個語言也允許熟悉 MapReduce 開發者的開發自定義的 mapper 和 reducer 來處理内建的 mapper 和 reducer 無法完成的複雜的分析工作。

Hive是基于Hadoop的一個資料倉庫工具，可以将結構化的資料檔案映射為一張資料庫表，并提供簡單的sql查詢功能，可以将sql語句轉換為MapReduce任務進行運作。其優點是學習成本低，可以通過類SQL語句快速實作簡單的MapReduce統計，不必開發專門的MapReduce應用，十分适合資料倉庫的統計分析。

2、特性：

（1）、Hive是Facebook推出的，主要是為了讓不懂java的工程師也能通過SQL來駕馭Hadoop叢集進行資料分布式資料的多元度分析，甚至你可以隻通過Web界面來直接操作Hive。

（2）、Hive的核心是把自己的SQL語言即HQL翻譯成MapReduce代碼，然後交給Hadoop叢集執行，也就是說Hive本身是一個單機版本的軟體。

（3）、由于是用過寫SQL來完成業務需求的，是以相對于程式設計MapReduce而言，非常的簡單和靈活，能夠非常輕易的滿足業務的需要和多變的場景。

（4）、 Hive幾乎存在于一切使用大資料的公司中。

二、建表語句：

1、Managed類型的表也叫内部表（hdfs檔案需移動到表目錄下；drop Managed的表時，中繼資料和表的資料一并删除掉）：

CREATE TABLE tbl_t1(id int, userid bigint,

name string, referrer_url string,

ip string comment 'IP Address of the User')

comment 'This is the page view table'

row format delimited

fields terminated by '\t';

2、external類型的表也叫外部表（hdfs檔案無需移動到表目錄下；drop External的表時，隻删除中繼資料）：

CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,

ip STRING,

country STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

STORED AS TEXTFILE

LOCATION '/external/hive';

3、create as 建立表（用于建立一些臨時表存儲中間結果）

CREATE TABLE tab_ip_ctas

SELECT id new_id, name new_name, ip new_ip,country new_country

FROM tab_ip_ext;

4、建立分區表：（普通表和分區表差別：有大量資料增加的需要建分區表）

create table book (id bigint, name string) partitioned by (pubdate string) row format delimited fields terminated by '\t';

三、導入資料：

1、導入資料本地資料：

hive>>load data local inpath '/home/hadoop/test.log' into table tbl_t1;

2、導入hdfs上的資料：

hive>>load data inpath '/tt.log' into table tbl_t1;

3、分區表加載資料:

hive>>load data local inpath './book.txt' overwrite into table book partition (pubdate='2016-08-22');

4、通過hadoop導入資料：

hadoop fs -put book1.txt /usr/hive/test.db/book/

四、partition表：

hive>>create table tab_u_part(id int,name string,ip string,country string)

partitioned by (month string)

row format delimited fields terminated by ',';

hive>>load data local inpath '/home/hadoop/u.txt' overwrite into table tab_u_part

partition(month='201608');

hive>>select * from tab_u_part;

hive>>select * from tab_u_part where month='201608';

hive>>select count(*) from tab_u_part where month='201608';

五、select write to hdfs或者select write to local file：

1、local：

insert overwrite local directory '/home/hivetemp/test.txt' select * from tab_u_part where month='201608';

2、hdfs：

insert overwrite directory '/hiveout.txt' select * from tab_u_part where month='201608';

六、client shell：

hive -S -e 'select country,count(*) from dbName.tab_ext' > /home/hadoop/hivetemp/e.txt

有了這種執行機制，就使得我們可以利用腳本語言（bash shell,python）進行hql語句的批量執行

七、自定義函數UDF：

1、資料示例：

13500000001 212 123

13600000001 855 836

13700000001 123 563

13800000001 853 120

13900000001 785 563

create table tab_flow(phone string,uflow int,dflow int)

row format delimited fields terminated by ' ';

load data local inpath '/home/f.data' into table tab_flow;

将上面的資料變為：

13500000001 shanghai 212 123

13600000001 beijing 855 836

13700000001 tianjin 123 563

13800000001 chongqing 853 120

13900000001 shenzhen 785 563

2、hive自定義函數如下：

package cn.hive;

import java.util.HashMap;

import org.apache.hadoop.hive.ql.exec.UDF;

public class PhoneNbToArea extends UDF{

private static HashMap<String, String> areaMap = new HashMap<String, String>();

static {

areaMap.put("1350", "shanghai");

areaMap.put("1360", "beijing");

areaMap.put("1370", "tianjin");

areaMap.put("1380", "chongqing");

areaMap.put("1390", "shenzhen");

}

//一定要用public修飾才能被hive調用

public String evaluate(String pnb) {

return areaMap.get(pnb.substring(0,4))==null? (pnb+" other"):(pnb+" "+areaMap.get(pnb.substring(0,4)));

3、将自定含義函數的jar添加到hive，并建立函數與jar的關聯：

hive>add jar /home/hadoop/myudf.jar;

hive>CREATE TEMPORARY FUNCTION getArea AS 'cn.hive.PhoneNbToArea ';

hive>select getArea(phone),uflow,dflow from tab_flow;

本文轉自lzf0530377451CTO部落格，原文連結：http://blog.51cto.com/8757576/1839302 ，如需轉載請自行聯系原作者

Hadoop Hive

繼續閱讀

【python】【資料處理】畫多元資料分布圖

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

登入plsql 報錯 the account is locked --使用者被鎖

sqlServer根據經緯查距離

Effective Java 8:通用程式設計

SequoiaDB巨杉資料庫C++驅動概述

OOM三種類型

工廠模式-三種類型

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

在python中建立excel并寫入