天天看点

用户表 订单表 数据_数据仓库-拉链表

在数据仓库的数据模型设计过程中,经常会遇到这样的需求:

1.数据量比较大;

2.表中的部分字段会被update,如用户的地址,产品的描述信息,订单的状态等等;

3.需要查看某一个时间点或者时间段的历史快照信息,比如,查看某一个订单在历史某一个时间点的状态, 比如,查看某一个用户在过去某一段时间内,更新过几次等等;

4.变化的比例和频率不是很大,比如,总共有1000万的会员,每天新增和发生变化的有10万左右;

5.如果对这边表每天都保留一份全量,那么每次全量中会保存很多不变的信息,对存储是极大的浪费;遇到些问题在没解决吗?

拉链历史表,既能满足反应数据的历史状态,又可以最大程度的节省存储。

拉链表案例:

1、有一张订单表,2019-06-20 这天里面有三条订单数据:

用户表 订单表 数据_数据仓库-拉链表

到了2019-06-21这天,表中有5条记录:

用户表 订单表 数据_数据仓库-拉链表

到了2019-06-22日,表中有6条记录:

用户表 订单表 数据_数据仓库-拉链表

这种存储方式有什么问题呢?

1、只保留一份全量,则数据和6月22日的记录一样,如果需要查看6月21日订单001的状态,则无法满足;

2、每天都保留一份全量,则数据仓库中的该表共有14条记录,但好多记录都是重复保存,没有任务变化,如订单002,004,数据量大了,会造成很大的存储浪费;

如果设计成历史拉链表,如下:

用户表 订单表 数据_数据仓库-拉链表

拉链表解释:

说明:

1.dw_begin_date表示该条记录的生命周期开始时间,dw_end_date表示该条记录的生命周期结束时间;

2.dw_end_date = '9999-12-31'表示该条记录目前处于有效状态;

3.如果查询当前所有有效的记录,则select * from order_his where dw_end_date = '9999-12-31'

4.如果查询2012-06-21的历史快照,则select * from order_his where dw_begin_date <= '2012-06-21' and dw_end_date >= '2012-06-21',这条语句会查询到以下记录:

用户表 订单表 数据_数据仓库-拉链表

可以看出,这样的历史拉链表,既能满足对历史数据的需求,又能很大程度的节省存储资源;

拉链表实战:

源系统中订单表结构为:

CREATE TABLE orders_20190821 (orderid INT,createtime STRING,modifiedtime STRING,status STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','stored AS textfile;CREATE TABLE orders_20190822 (orderid INT,createtime STRING,modifiedtime STRING,status STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','stored AS textfile;CREATE TABLE orders_20190823 (orderid INT,createtime STRING,modifiedtime STRING,status STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','stored AS textfile;
           

数据:

2019-08-21订单数据表(order)1,2019-08-18,2019-08-18,创建2,2019-08-18,2019-08-18,创建3,2019-08-19,2019-08-21,支付4,2019-08-19,2019-08-21,完成5,2019-08-19,2019-08-20,支付6,2019-08-20,2019-08-20,创建7,2019-08-20,2019-08-21,支付8,2019-08-21,2019-08-21,创建2019-08-22订单数据表(order)*1,2019-08-18,2019-08-22,支付*2,2019-08-18,2019-08-22,完成3,2019-08-19,2019-08-21,支付4,2019-08-19,2019-08-21,完成5,2019-08-19,2019-08-20,支付*6,2019-08-20,2019-08-22,支付7,2019-08-20,2019-08-21,支付*8,2019-08-21,2019-08-22,支付*9,2019-08-22,2019-08-22,创建*10,2019-08-22,2019-08-22,支付2019-08-23订单数据表1,2019-08-18,2019-08-23,完成2,2019-08-18,2019-08-22,完成3,2019-08-19,2019-08-23,完成4,2019-08-19,2019-08-21,完成5,2019-08-19,2019-08-23,完成6,2019-08-20,2019-08-22,支付7,2019-08-20,2019-08-21,支付8,2019-08-21,2019-08-23,完成9,2019-08-22,2019-08-22,创建10,2019-08-22,2019-08-22,支付11,2019-08-23,2019-08-23,创建12,2019-08-23,2019-08-23,创建13,2019-08-23,2019-08-23,支付
           
load data  inpath '/user/root/csii/orders_20190821' into table orders_20190821;load data  inpath '/user/root/csii/orders_20190822' into table orders_20190822;load data  inpath '/user/root/csii/orders_20190823' into table orders_20190823;
           

ODS层设计:

在数据仓库的ODS层,有一张订单的增量数据表,按天分区,存放每天的增量数据:

CREATE TABLE ods_orders_inc (orderid INT,createtime STRING,modifiedtime STRING,status STRING)PARTITIONED BY (day STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','stored AS textfile;
           

DW层设计:

在数据仓库的DW层,有一张订单的历史数据拉链表,存放订单的历史状态数据:

CREATE TABLE dw_orders_his (orderid INT,createtime STRING,modifiedtime STRING,status STRING,dw_start_date STRING,dw_end_date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','stored AS textfile;
           

第一步,抽取全量数据到ODS:

INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-20')SELECT orderid,createtime,modifiedtime,statusFROM orders_20190821WHERE createtime <= '2019-08-20';
           

第二步,从ODS刷新到DW:

INSERT overwrite TABLE dw_orders_hisSELECT orderid,createtime,modifiedtime,status,createtime AS dw_start_date,'9999-12-31' AS dw_end_dateFROM ods_orders_incWHERE day = '2019-08-20';
           

完成后,DW订单历史表中数据:

hive> select * from dw_orders_his;1       2019-08-18      2019-08-18      创建    2019-08-18      9999-12-31      2       2019-08-18      2019-08-18      创建    2019-08-18      9999-12-313       2019-08-19      2019-08-21      支付    2019-08-19      9999-12-314       2019-08-19      2019-08-21      完成    2019-08-19      9999-12-315       2019-08-19      2019-08-20      支付    2019-08-19      9999-12-316       2019-08-20      2019-08-20      创建    2019-08-20      9999-12-317       2019-08-20      2019-08-21      支付    2019-08-20      9999-12-31
           

增量抽取

每天,从源系统订单表中,将前一天的增量数据抽取到ODS层的增量数据表。 这里的增量需要通过订单表中的创建时间和修改时间来确定:

INSERT overwrite TABLE ods_orders_inc PARTITION (day = '${day}')SELECT orderid,createtime,modifiedtime,statusFROM ordersWHERE createtime = '${day}' OR modifiedtime = '${day}';
           

增量刷新历史数据

从2019-08-22开始,需要每天正常刷新前一天(2019-08-21)的增量数据到历史表。

第一步,通过增量抽取,将2019-08-21的数据抽取到ODS:

INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-21')SELECT orderid,createtime,modifiedtime,statusFROM orders_20190821WHERE modifiedtime = '2019-08-21' or createtime = '2019-08-21';
           

ODS增量表中2019-08-21的数据如下:

hive> select * from ods_orders_inc where day = '2019-08-21';3       2019-08-19      2019-08-21      支付    2019-08-214       2019-08-19      2019-08-21      完成    2019-08-217       2019-08-20      2019-08-21      支付    2019-08-218       2019-08-21      2019-08-21      创建    2019-08-21
           

第二步,通过DW历史数据(数据日期为2019-08-20),和ODS增量数据(2019-08-21),刷新历史表:**

先把数据放到一张临时表中:

DROP TABLE IF EXISTS dw_orders_his_tmp;CREATE TABLE dw_orders_his_tmp AS SELECT orderid,createtime,modifiedtime,status,dw_start_date,dw_end_date FROM (    SELECT a.orderid,    a.createtime,    a.modifiedtime,    a.status,    a.dw_start_date,    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2019-08-21' THEN '2019-08-20' ELSE a.dw_end_date END AS dw_end_date     FROM dw_orders_his a     left outer join (SELECT * FROM ods_orders_inc WHERE day = '2019-08-21') b     ON (a.orderid = b.orderid)     UNION ALL     SELECT orderid,    createtime,    modifiedtime,    status,    modifiedtime AS dw_start_date,    '9999-12-31' AS dw_end_date     FROM ods_orders_inc     WHERE day = '2019-08-21' ) x ORDER BY orderid,dw_start_date;
           

最后把临时表中数据插入历史表:

INSERT overwrite TABLE dw_orders_hisSELECT * FROM dw_orders_his_tmp;
           

查看拉链表数据:

用户表 订单表 数据_数据仓库-拉链表

将2019-08-22的增量数据刷新到历史表:

第一步,通过增量抽取,将2019-08-22的数据抽取到ODS:INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-22') SELECT orderid,createtime,modifiedtime,status FROM orders_20190822WHERE createtime = '2019-08-22' OR modifiedtime = '2019-08-22';第二步,通过DW历史数据(数据日期为2019-08-21),和ODS增量数据(2019-08-22)DROP TABLE IF EXISTS dw_orders_his_tmp;CREATE TABLE dw_orders_his_tmp AS SELECT orderid,createtime,modifiedtime,status,dw_start_date,dw_end_date FROM (    SELECT a.orderid,    a.createtime,    a.modifiedtime,    a.status,    a.dw_start_date,    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2019-08-22' THEN '2019-08-21' ELSE a.dw_end_date END AS dw_end_date     FROM dw_orders_his a     left outer join (SELECT * FROM ods_orders_inc WHERE day = '2019-08-22') b     ON (a.orderid = b.orderid)     UNION ALL     SELECT orderid,    createtime,    modifiedtime,    status,    modifiedtime AS dw_start_date,    '9999-12-31' AS dw_end_date     FROM ods_orders_inc     WHERE day = '2019-08-22' ) x ORDER BY orderid,dw_start_date; 第三步:最后把临时表中数据插入历史表:INSERT overwrite TABLE dw_orders_his SELECT * FROM dw_orders_his_tmp;
           

刷新完后历史表数据如下:

用户表 订单表 数据_数据仓库-拉链表

查看2019-08-21的历史快照数据:

select * from dw_orders_his where dw_start_date <= '2019-08-21' and dw_end_date >= '2019-08-21';
           

将2019-08-23的增量数据刷新到历史表:

第一步,通过增量抽取,将2019-08-22的数据抽取到ODS:INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-23') SELECT orderid,createtime,modifiedtime,status FROM orders_20190823WHERE createtime = '2019-08-23' OR modifiedtime = '2019-08-23';第二步,通过DW历史数据(数据日期为2019-08-22),和ODS增量数据(2019-08-23)DROP TABLE IF EXISTS dw_orders_his_tmp;CREATE TABLE dw_orders_his_tmp AS SELECT orderid,createtime,modifiedtime,status,dw_start_date,dw_end_date FROM (    SELECT a.orderid,    a.createtime,    a.modifiedtime,    a.status,    a.dw_start_date,    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2019-08-23' THEN '2019-08-22' ELSE a.dw_end_date END AS dw_end_date     FROM dw_orders_his a     left outer join (SELECT * FROM ods_orders_inc WHERE day = '2019-08-23') b     ON (a.orderid = b.orderid)     UNION ALL     SELECT orderid,    createtime,    modifiedtime,    status,    modifiedtime AS dw_start_date,    '9999-12-31' AS dw_end_date     FROM ods_orders_inc     WHERE day = '2019-08-23' ) x ORDER BY orderid,dw_start_date;第三步:最后把临时表中数据插入历史表:INSERT overwrite TABLE dw_orders_his SELECT * FROM dw_orders_his_tmp;
           
hive> select * from dw_orders_his order by orderid,dw_start_date;1       2019-08-18      2019-08-18      创建    2019-08-18      2019-08-211       2019-08-18      2019-08-22      支付    2019-08-22      2019-08-221       2019-08-18      2019-08-23      完成    2019-08-23      9999-12-312       2019-08-18      2019-08-18      创建    2019-08-18      2019-08-212       2019-08-18      2019-08-22      完成    2019-08-22      9999-12-313       2019-08-19      2019-08-21      支付    2019-08-19      2019-08-203       2019-08-19      2019-08-21      支付    2019-08-21      2019-08-223       2019-08-19      2019-08-23      完成    2019-08-23      9999-12-314       2019-08-19      2019-08-21      完成    2019-08-19      2019-08-204       2019-08-19      2019-08-21      完成    2019-08-21      9999-12-315       2019-08-19      2019-08-20      支付    2019-08-19      2019-08-225       2019-08-19      2019-08-23      完成    2019-08-23      9999-12-316       2019-08-20      2019-08-20      创建    2019-08-20      2019-08-216       2019-08-20      2019-08-22      支付    2019-08-22      9999-12-317       2019-08-20      2019-08-21      支付    2019-08-20      2019-08-207       2019-08-20      2019-08-21      支付    2019-08-21      9999-12-318       2019-08-21      2019-08-21      创建    2019-08-21      2019-08-218       2019-08-21      2019-08-22      支付    2019-08-22      2019-08-228       2019-08-21      2019-08-23      完成    2019-08-23      9999-12-319       2019-08-22      2019-08-22      创建    2019-08-22      9999-12-3110      2019-08-22      2019-08-22      支付    2019-08-22      9999-12-3111      2019-08-23      2019-08-23      创建    2019-08-23      9999-12-3112      2019-08-23      2019-08-23      创建    2019-08-23      9999-12-3113      2019-08-23      2019-08-23      支付    2019-08-23      9999-12-31
           

查看最新的状态:

hive> select * from dw_orders_his where dw_end_date = '9999-12-31';   
           
用户表 订单表 数据_数据仓库-拉链表