Hive基本操作

文章目錄

- - DDL
  - - 1.資料庫操作
    - - 1.1建庫
      - 1.2切換庫
      - 1.3查詢正在使用的庫
      - 1.4查詢庫清單
      - 1.5查詢庫的較長的描述資訊
      - 1.6删除資料庫
    - 2.表操作
    - - 2.1建表語句
      - 2.1.1建立一個内部表
        
        2.1.2建立一個外部表
        
        2.1.3建立一個分區表
        
        2.1.4建立一個分桶表
        
        2.1.5建立一個複制表
        
        2.1.6ctas建表
      - 2.2檢視表的詳細資訊
      - 2.3删除表
      - 2.4檢視表清單
      - 2.5修改表
      - 2.5.1修改表名
        
        2.5.2修改表的列資訊
        
        2.5.3修改表的分區資訊( 針對分區表)
  - DML
  - - 1.資料導入
    - - 1.1load 資料加載
      - 1.1.1從本地加載資料
        
        1.1.2從hdfs加載資料
      - 1.2資料插入
      - 1.2.1單條資料插入
        
        1.2.2單重資料插入
        
        1.2.3多重資料插入
      - 1.3分區表的資料加載
      - 1.3.1靜态分區加載
        
        1.3.2動态分區加載
      - 1.4案例補充
      - 1.5分桶表的資料加載
  - 查詢操作
  - - 1.join
    - - 1.1内連接配接
      - 1.2外連接配接
      - 1.2.1左外連接配接
        
        1.2.2右外連接配接
        
        1.2.3全外連接配接
      - 1.3半連接配接
  - 2.group by 分組
  - 3.where 和 having
  - 4.order by
  - - 4.1order by 全局排序
    - 4.2sort by 局部排序
    - 4.3distribute by 指定字段
  - 5.hive的資料導出
  - - 5.1單重導出
    - 5.2多重導出

資料定義語言（data-define-lauguage ）

create database if not exists dbname;
if not exists 防止報錯

use dbname;

select current_database();

show databases;
show databases like "*test";

desc database dbname;

drop database if exists dbname;  預設隻能删除空資料庫中沒有表的
drop database dbname cascade;  級聯删除
drop database dbname;=== drop database dbname restrict;

案例

95002,劉晨,女,19,IS

建立資料庫

create database if not exists test_hive;
use test_hive;

create table if not exists stu_managed(sid int,name string,sex string,age int,dept string) row format delimited fields terminated by ",";

create external table if not exists stu_external(sid int,name string,sex string,age int,dept string) row format delimited fields terminated by ",";

分區字段   dept 
create table if not exists stu_ptn_dept(sid int,name string,sex string,age int) partitioned by (dept string) row format delimited fields terminated by "," location "/user/hivedata/stu_ptn";

分區字段 sex
create table if not exists stu_ptn_sex(sid int,name string,age int,dept string) partitioned by (sex string) row format delimited fields terminated by ",";

分桶字段  age    分桶個數  3
create table if not exists stu_buk(sid int,name string,sex string,age int,dept string) clustered by (age) sorted by(sid) into 3 buckets row format delimited fields terminated by ",";

僅僅複制表結構，不複制表資料和表屬性
like 
create table tbname like tbname1;
create external table stu_copy like stu_buk;
注意：表屬性取決于自己建表時候是否指定(external)的

create table tbname as select ....
将一個sql的查詢結果存儲在一個表中
create table tbname as select * from stu_managed;

desc tbname;
desc extended tbname;檢視表的擴充資訊
desc extended stu_managed;
desc formatted tbname; 格式化顯示表的詳細資訊
desc formatted stu_managed;

drop tbname;

show tables;
show tables in dbname;  指定資料庫下的所有表
show tables in test_test_hive;
show tables like "stu*";

中繼資料庫
alter table tbname rename to tbname_new;
alter table stu_copy01 rename to stu_copy;

添加列

alter table tbname add columns(col type);
alter table stu_test add columns(family string);

修改列

alter table tbname change old_col new_col type;
修改列名
alter table stu_test change family fa string;
修改列的類型
隻支援小->大	不支援大->小
int--> string 
alter table stu_test change sid sid string;
string ---> int 
alter table stu_test change sid sid int;

替換列

alter table tbname replace columns (col type);
alter table stu_test replace columns (name string);

stu_ptn_dept 分區字段，分區表相當于對原始表，字段分成兩部分存儲。一部分建表字段，另一部分是分區字段，分區字段中分區字段代表的是分區标志。

添加分區，指定一個分區字段對應的值

alter table tbname add partition(需要添加的分區);
alter table stu_ptn_dept add partition(dept="CS");  這個分區存儲都是dept 為CS的資料
alter table tbname add partition(需要添加的分區) partition(分區) partition(分區);
alter table stu_ptn_dept add partition(dept="IS") partition(dept="MA");

删除分區

alter table tbname drop partition(分區名);
alter table stu_ptn_dept drop partition(dept="CS");

修改分區

修改分區的存儲路徑 
    1.添加分區的時候需要指定一個新的路徑
    預設分區路徑在表路徑下
    alter table stu_ptn_dept add partition(dept="CS") location "/user/ptn/data/dept";
    2.修改已經添加過的分區的路徑 set location 
    alter table stu_ptn_dept partition(dept="CS") set location "/user/hivedata/stu_ptn"; 
    表名後 partition 指定需要操作的分區

查詢某一個表的分區

show partitions tbname; 查詢某一個表的所有分區

清空表

truncate table tbname;
清空表資料保留表結構,不适用外部表

檢視詳細建表語句

show create table table_name;
show create table stu_managed;

資料管理語言(data-manager-lauguage)

local 代表從本地加載資料，資料存儲在linux hive所在節點的本地的，不加local代表資料從hdfs加載的

load data local inpath "/home/hadoop/hive_data/student.txt" into table stu_managed;
相當于将本地檔案複制、上傳到hive表所在的路徑下

load data inpath "/hive_data" into table stu_managed;
相當于将hdfs的資料移動到 hive表所在的路徑下
hive資料加載的本質就是将檔案傳入hive的表所在的hdfs的路徑下
    
如果手動将一個資料直接上傳到hive的表路徑下hive是否可以解析到？
hadoop fs -put student.txt /user/hive/warehouse/test_hive.db/stu_managed/stu_sj
可以識别的   
換句話說hive 本質是可以解析表路徑下的所有資料的，隻要将資料放在表路徑下就都可以解析到。hive相當于hdfs的一個路徑的使用者。

效率很低
insert into table tbname values();
insert into table stu_managed values(1,"zs","f",45,"CS");

執行一次插入操作，插入多條資料，将一個表的查詢結果插入到另一個表中
insert into table tbname select .....
insert into table stu_external select * from stu_managed;

對原始資料表掃描一次，可以插入到多個表中或一個表的多個分區中
需求：
将stu_managed   age <=18    stu01   age <=19  stu02  剩下 stu03 
from tbname 
insert into table tbname01 select ... where .....
insert into table tbname02 select .... where .... 

from stu_managed 
insert into table stu01 select * where age <= 18 
insert into table stu02 select * where age>18 and age<=19 
insert into table stu03 select * where age>19;
多重>單重>單條

stu_ptn_dept dept String hdfs://test_hive/user/hivedata/stu_ptn

stu_ptn_sex sex string 預設路徑

加載資料的時候手動靜态指定分區值,手動添加一個分區
alter table stu_ptn_dept add partition(dept="CS");

load方式

load data local inpath "/home/hadoop/hive_data/student.txt" into table stu_ptn_dept partition(dept="CS");
注意： 使用load方式在進行資料加載的時候不會進行資料過濾,一定要十分确定的情況下才是用這種方式

insert方式

從一個非分區表中查詢資料然後插入到分區表中是用來資料過濾的
單重資料插入 
insert into stu_ptn_dept partition(dept="CS") select sid,name,sex,age from stu_managed01 where dept="CS";
多重 
from stu_managed01 
insert into stu_ptn_dept partition(dept="IS") select sid,name,sex,age where dept="IS" 
insert into stu_ptn_dept partition(dept="MA") select sid,name,sex,age where dept="MA"; 


from stu_managed01 
    insert into stu_ptn_sex partition(sex="man") select sid,name,age,dept where sex='男' 
    insert into stu_ptn_sex partition(sex="women") select sid,name,age,dept where sex='女';

靜态分區加載資料缺陷：必須足夠了解資料知道每一個分區名，不适用分區很多的時候。是以需要動态分區，根據資料自動生成分區不需要自己制定分區名。動态分區不支援 load方式，隻能使用insert方式。

單重插入

insert into table tbname partition(分區字段) select ...需要查詢分區字段分區字段放在最後

insert into table stu_ptn_dept partition(dept) select sid,name,sex,age,dept 
from stu_managed01;

hive中預設的動态分區是關閉的要進行更改
set hive.exec.dynamic.partition.mode=nonstrict;

日期  /year=2019/month=12/day=
多級分區 分區字段有多個 >=2
partitioned by (dept string,age int)
多級分區    
    字段靠前  進階分區
    字段靠後  低級分區
字段靠前的  類似  一級目錄  字段靠後的  類似 子目錄
目錄結構
/stu_ptn/dept="CS"/age=17
/stu_ptn/dept="CS"/age=18
/stu_ptn/dept="CS"/age=19

/stu_ptn/dept="IS"/age=19
/stu_ptn/dept="IS"/age=20
建表：
create table if not exists stu_ptn_dj(sid int,name string,sex string) partitioned by (dept string,age int) row format delimited fields terminated by ",";

資料插入：
靜态 
insert into table stu_ptn_dj partition(dept="CS",age=18) select sid,name,sex from stu_managed01 where dept="CS" and age=18;

一動一靜
靜态
insert into table stu_ptn_dj partition(dept="MA",age) select sid,name,sex,age from stu_managed01 where dept="MA"; 
動态
insert into table stu_ptn_dj partition(dept,age) select sid,name,sex,dept,age from stu_managed01;
注意： 分區表的操作一定是指定操作的分區

stu_buk    age   3   sort sid 1 
将資料按照分桶規則（分區算法） 分到不同的桶中
分桶規則：
    分桶  string 
    分桶字段 .hash % 分桶個數
    分桶   int 
    分桶字段 % 分桶個數 
分桶表不支援load的方式的
load data local inpath "/home/hadoop/hive_data/student.txt" into table stu_buk;  不支援這種方式 

隻能用insert方式 
insert into table stu_buk select * from stu_managed01;
日志：
Cannot run job locally: Number of reducers (= 3) is more than 1
底層啟動  分桶個數對應的  reducetask的個數
一個分桶--- mr一個分區
桶1  age%3=0 
95005,劉剛,男,18,MA
95008,李娜,女,18,CS
95009,夢圓圓,女,18,MA
95011,包小柏,男,18,MA
95013,馮偉,男,21,CS
95015,王君,男,18,MA
95017,王風娟,女,18,IS
95020,趙錢,男,21,IS

桶2   age%3=1 
95002,劉晨,女,19,IS
95003,王敏,女,22,MA
95004,張立,男,19,IS
95007,易思玲,女,19,MA
95010,孔小濤,男,19,CS
95014,王小麗,女,19,CS
95018,王一,女,19,IS
95019,邢小麗,女,19,IS

桶3  age%3 =2 
95001   李勇    男      20      CS
95006   孫慶    男      23      CS
95012   孫花    女      20      CS
95021   周二    男      17      MA
95022   鄭明    男      20      MA

select * from stu_buk;查詢是全表的所有桶
查詢某一個桶的文法
select * from tbname tablesample (bucket x out of y)
y  桶簇個數   一個或多個桶組成的一簇
    将桶分了   多個桶簇  平分 
    y=3   一個桶簇 ==  一個桶 
    提取某一個整桶資料   桶簇== 桶 
x 取的桶簇編号
    1開始的  順序遞增的
    
select * from stu_buk tablesample(bucket 1 out of 3);  取第一個桶簇

join where group by order by having limit

文法順序：select … from …join … on … where …group by …having…order by …limit …

注意：
    1）hive支援等值連接配接，不支援非等值連接配接
    select * a join b on a.id=b.id;
    2）hive有多個連接配接條件的時候支援and連接配接，不支援or連接配接
    select * a join b on a.id=b.id and a.name=b.name;
    select * a join b on a.id=b.id or a.name=b.name;  不支援 
    3）hive支援多表連接配接

建表

create table if not exists a(id int,name string) row format delimited fields terminated by "\t";

加載資料

load data local inpath "/home/hadoop/hive_data/a" into table a;

create table if not exists b(id int,age int) row format delimited fields terminated by "\t";

load data local inpath "/home/hadoop/hive_data/b" into table b;

inner join |join 
兩個或多個表的   交集  公共的部分
兩個或多個表   都有的關聯鍵
select * from a join b on a.id=b.id;

a.id    a.name  b.id    b.age
1       zs      1       12
2       ls      2       23
3       ww      3       43

left outer join|left join 
以join左側的表為主表的，主表中有幾條資料，最終關聯出來就幾條資料，右表填充，右表能關聯上，則關聯，不能關聯上補位null
select * from a left join b on a.id=b.id;
a.id    a.name  b.id    b.age
1       zs      1       12
2       ls      2       23
3       ww      3       43
5       xh      NULL    NULL

right outer join | right join 
主表為右表，右表資料不變，左表補充，有就補沒有 null
select * from a right join b on a.id=b.id;
a.id    a.name  b.id    b.age
1       zs      1       12
2       ls      2       23
3       ww      3       43
NULL    NULL    4       11

full outer join | full join 
左表和右表的關聯鍵的并集如果左表有就添加左表沒有就 null 右表有添加沒有就 null 
select * from a full join b on a.id=b.id;
1       zs      1       12
2       ls      2       23
3       ww      3       43
NULL    NULL    4       11
5       xh      NULL    NULL

semi join、left semi join 
mysql中:a表中在b中包含的id的資料 
select * from a where id in (select id from b);

mysql中查詢包含不包含  有 in/exists 文法查詢  
在hive中1 版本中不支援這種文法,2版本支援這種文法但是效率極低這個時候需要有一個替代方案
1）内連接配接
取a表的資料
select a.* from a join b on a.id=b.id;
a.id    a.name
1       zs
2       ls
3       ww
2)left semi join  in/exists左半連接配接
求兩個表的内連接配接輸出左表資料,左表在右表的關聯鍵的資料
select * from a left semi join b on a.id=b.id;
a.id    a.name
1       zs
2       ls
3       ww

将group by 後面的字段相同的分到一組中 
案例：求每一個部門的最大年齡分組字段：部門dept，求max 
select 
dept d,max(age) max_age 
from stu_managed01 
group by dept;
sql 語句  ： 1)select    from   2)group by  3)select後面的字段  4）order by 
注意：
1）group by是在select之前執行的所有不可以使用select後面字段的别名
2）group by 使用的時候會限制select後面的字段，可以使用聚合函數max、min、avg、count、sum

是用來過濾的 where 聚合函數 having  
where過濾的是聚合之前的資料為聚合函數準備資料 
案例：求每個部門中年齡大于19的人數。分組字段:部門dept,求count使用聚合函數，對資料做過濾 age >19
select 
dept,count(*) 
from stu_managed01 
where age>19 
group by dept;

having 執行順序在聚合函數之後是對聚合結果做的過濾
案例：求部門中總人數大于7的所有部門。分組字段dept,求count，過濾having count>7 
select 
dept,count(*) totalcount 
from stu_managed01 
group by dept 
having totalcount>7;

需求：
每個部門中年齡>=19的人數多餘4個人的部門。分組字段dept,求 count過濾age>=19 where 過濾 count >4

select 
dept,count(*) totalcount 
from stu_managed01 
where age >=19 
group by dept 
having totalcount>4;

where -> group by -> select ->  having

是用來指定排序的

對所有資料按照指定字段進行排序的
select * from stu_managed01 order by age,sid;
無論資料量多大都進行全局排序

針對每一個reducetask進行排序，每一個reducetask的内部結果有序的   全局是無序的，當隻有一個reducetask的時候 效果=== order by 。 

select * from stu_managed01 sort by age;
預設情況下reducetask是1個  

In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
  256000000
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>  1009
In order to set a constant number of reducers:
    設定reducetask的個數  預設-1 
  set mapreduce.job.reduces=<number>

set mapreduce.job.reduces=3;
select * from stu_managed01 sort by age;
問題：
如何配置設定資料到每一個reducetask的？
字段.hash % 3   字段 每一行随機選
結果：
reducetask0 
95008   李娜    女      18      CS
95011   包小柏  男      18      MA
95004   張立    男      19      IS
95014   王小麗  女      19      CS
95019   邢小麗  女      19      IS
95001   李勇    男      20      CS
95012   孫花    女      20      CS
95020   趙錢    男      21      IS
95003   王敏    女      22      MA

reducetask1
95009   夢圓圓  女      18      MA
95017   王風娟  女      18      IS
95005   劉剛    男      18      MA
95018   王一    女      19      IS
95007   易思玲  女      19      MA
95010   孔小濤  男      19      CS
95022   鄭明    男      20      MA
95013   馮偉    男      21      CS

reducetask2
95021   周二    男      17      MA
95015   王君    男      18      MA
95002   劉晨    女      19      IS
95006   孫慶    男      23      CS

sort by 無法進行每一個reducetask的資料配置設定

指定每一個reducetask配置設定字段的
分區|分桶
按照指定的字段将資料分到不同的reducetask中，分的算法是字段.hash % reducetasks。使用的時候distribute by 指定資料配置設定 + sort by  指定排序

select * from stu_managed01 distribute by age sort by age desc;
這裡的 distribute by的字段 和 sort by的字段  随意指定
reducetask0   age %3 =0 
reducetask1   age %3 =1
reducetask2   age %3 =2 

4）cluster by 字段   
按照指定字段配置設定資料并按照指定字段進行升序排序
= distribute by 字段 + sort by 字段    distribute by 和 sort by 的字段為同一個字段升序

select * from stu_managed01 cluster by age;

select * from stu_managed01 distribute by age sort by age;

注意： distribute by + sort by 功能  > cluster by

insert overwrite local directory "/home/hadoop/hive_data/ag19" select * from stu_managed01 where age>19;

對同一個表的不同查詢結果  導出到不同的路徑下
from tbname 
insert overwrite local directory "" select ... where ...
insert overwrite local directory "" select ... where ...

from stu_managed01 
insert overwrite directory "/data/age18" select * where age=18 
insert overwrite directory "/data/age19" select * where age=19;

Hive基本操作

文章目錄

繼續閱讀

luogu1078 文化之旅

Hadoop離線_Hive的基本操作

Hive中内部表、外部表、分區、分桶以及SQL的執行順序

Hive中的内部表外部表及分區表

Hive---外部分區表的建立

Hive學習筆記 3 Hive的資料模型：内部表、分區表、外部表、桶表、視圖

HiveQL(二):分區表

Hive的分區表入門

Hive的分區表

Hive（二）--分區分桶，内部表外部表

大資料高頻面試題之Hive的小檔案合并

世界因大資料而改變

hive sql通過具體位址解析出行政區劃(省＞市＞區＞縣＞鄉＞鎮＞村)

Hive最全常見錯誤及解決方案hive --service metastore &

《Hive權威指南》第八章：HiveQL索引8 HiveQL：索引

HiveQl語句應用執行個體：WordCount具體步驟如下：