使用者在建表的時候可以自定義 SerDe 或者使用自帶的 SerDe。如果沒有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED,将會使用自帶的 SerDe。在建表的時候,使用者還需要為表指定列,使用者在指定表的列的同時也會指定自定義的 SerDe,Hive 通過 SerDe 确定表的具體的列的資料。
如果檔案資料是純文字,可以使用 STORED AS TEXTFILE。如果資料需要壓縮,使用 STORED AS SEQUENCE 。
有分區的表可以在建立的時候使用 PARTITIONED BY 語句。一個表可以擁有一個或者多個分區,每一個分區單獨存在一個目錄下。而且,表和分區都可以對某個列進行 CLUSTERED BY 操作,将若幹個列放入一個桶(bucket)中。也可以利用SORT BY 對資料進行排序。這樣可以為特定應用提高性能。
表名和列名不區分大小寫,SerDe 和屬性名區分大小寫。表和列的注釋是字元串
注:
SerDe是Serialize/Deserilize的簡稱,目的是用于序列化和反序列化
STORED AS TEXTFILE:預設格式,資料不做壓縮,磁盤開銷大,資料解析開銷大。 可結合Gzip、Bzip2使用(系統自動檢查,執行查詢時自動解壓),但使用這種方式,hive不會對資料進行切分, 進而無法對資料進行并行操作
STORED AS SEQUENCE:Hadoop API提供的一種二進制檔案支援,其具有使用友善、可分割、可壓縮的特點。SequenceFile支援三種壓縮選擇:NONE,RECORD,BLOCK。Record壓縮率低,一般建議使用BLOCK壓縮。
2.1.2文法
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[
[ROW FORMAT row_format] [STORED AS file_format]
| STORED BY 'storage.handler.class.name' [ WITH SERDEPROPERTIES (...) ]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] [AS select_statement] CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
LIKE existing_table_name
[LOCATION hdfs_path]
data_type
: primitive_type
| array_type
| map_type
| struct_type
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| STRING
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
row_format
: DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format:
: SEQUENCEFILE
| TEXTFILE
| RCFILE (Note: only available starting with 0.6.0)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
2.1.3基本示例
1、如果一個表已經存在,可以使用if not exists
2、create table user(id int,cont string) row format delimited fields terminated by '\005' stored as textfile; terminated by:關于來源的文本資料的字段間隔符
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE;
如:建表
CREATE TABLE c02_clickstat_fatdt1
(yyyymmdd string,
id INT,
ip string,
country string,
cookie_id string,
page_id string ,
clickstat_url_id int,
query_string string,
refer string
)PARTITIONED BY(dt STRING)
row format delimited fields terminated by '\005' stored as textfile;
裝載資料:
LOAD DATA INPATH '/user/admin/SqlldrDat/CnClickstat/20131101/19/clickstat_gp_fatdt0/0' OVERWRITE INTO TABLE c02_clickstat_fatdt1
PARTITION(dt='20131101');
通路某個分區:
SELECT count(*)
FROM c02_clickstat_fatdt1 a
WHERE a.dt >= '20131101' AND a.dt < '20131102';
指定LOCATION位置:
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<hdfs_location>';
複制一個空表:
CREATE TABLE empty_key_value_store
LIKE key_value_store;
hive> desc xi;
OK
id int
cont string
dw_ins_date string
Time taken: 0.061 seconds
hive> create table xibak like xi;
OK
Time taken: 0.157 seconds
hive> alter table xibak replace columns (ins_date string);
OK
Time taken: 0.109 seconds
hive> desc xibak;
OK
ins_date string
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Load 操作隻是單純的複制/移動操作,将資料檔案移動到 Hive 表對應的位置。
如:從本地導入資料到表格并追加原表
LOAD DATA LOCAL INPATH `/tmp/pv_2013-06-08_us.txt` INTO TABLE c02 PARTITION(date='2013-06-08', country='US')
從本地導入資料到表格并追加記錄:
LOAD DATA LOCAL INPATH './examples/files/kv1.txt' INTO TABLE pokes;
從hdfs導入資料到表格并覆寫原表:
LOAD DATA INPATH '/user/admin/SqlldrDat/CnClickstat/20131101/18/clickstat_gp_fatdt0/0' INTO table c02_clickstat_fatdt1 OVERWRITE PARTITION (dt='20131201');
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
導出檔案到本地:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
導出檔案到HDFS:
INSERT OVERWRITE DIRECTORY '/user/admin/SqlldrDat/CnClickstat/20131101/19/clickstat_gp_fatdt0/0' SELECT a.* FROM c02_clickstat_fatdt1 a WHERE dt=’20131201’;
一個源可以同時插入到多個目标表或目标檔案,多目标insert可以用一句話來完成:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2013-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
from tbl1
insert overwrite table test2 select '1,2,3' limit 1
insert overwrite table d select '4,5,6' limit 1;
SELECT explode(explode(adid_list)) AS myCol... # 不支援
3、不支援GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY ,如:
SELECT explode(adid_list) AS myCol ... GROUP BY myCol
2.14EXPLODE
下面是一個示例:
場景:将資料進行轉置,如:
create table test2(mycol array<int>);
insert OVERWRITE table test2 select * from (select array(1,2,3) from a union all select array(7,8,9) from d)c;
hive> select * from test2;
OK
[1,2,3]
[7,8,9]
hive> SELECT explode(myCol) AS myNewCol FROM test2;
OK
1
2
3
7
8
9