天天看點

【重新發現PostgreSQL之美】- 20 為什麼啤酒&紙尿褲最搭

背景

場景:

  • 電商、零售等行業, 根據使用者購物車、訂單等資料找到最合理的搭配組合. 用于引導營銷政策.
    • 或者以使用者最近N筆或N天的訂單内的所有商品作為一個大的group
  • 根據使用者評論涉及的關鍵詞, 找到最佳搭配關鍵詞. 用于引導品牌政策.

挑戰:

  • 每一個組合都是一組商品、标簽或關鍵詞. 相當于需要從現有的海量組合中找到高頻組合(最搭組合).
  • 傳統資料庫不支援多值列(數組), 展開成多條記錄資料量至少上升1個量級, 而且需要Join聚合才能得到最佳組合, 效率極差.
  • 在劃窗分析需求中, 需要大量曆史資料的基礎進行計算, 性能差

PG解決方案:

  • 1、内置數組, 資料量節省至少1個量級.
  • 2、内置數組反向索引, 快速定位想要得到的搭配組合.
  • 3、内置數組的元素級别統計資訊, 可以利用統計資訊快速定位到最佳組合.
  • 4、datasketches 近似解, 支援海量資料毫秒級别實時輸出搭配組合.
  • 5、使用topn的方法, 可以每天為每個商品存儲1條記錄, 這樣就能實行實時滑窗分析, 也是傳統資料庫無法高效實作的.

例子

1、購物車、訂單場景

每個元素都有标簽, 在一起就形成了标簽集合

...      
訂單xxx : [商品1, 商品2, ...] -> [标簽1, 标簽2, ...]      

2、評論場景

...      
評論xxx : [分詞1, 分詞2, ...] -> [标簽2, 标簽3, ...]      

以購物車、訂單場景為例

1、傳統資料庫表結構:

每個訂單需要N條記錄(取決于訂單内有多少商品)

order_id int8,  -- 訂單id      
itemid int  -- 商品id      
);      

傳統結構找到商品ID=1的最佳TOP 10組合.

(select order_id from t where itemid=1)      
group by itemid order by count(*) desc limit 11;      

2、PG 資料庫, 一條訂單隻需要一條記錄:

order_id serial8 primary key,  -- 訂單id      
itemid int[]  -- 商品id數組      
);      

3、生成測試資料: 使用長尾分布, 容易觀察到近似搜尋的效果

\set o2 random_zipfian(10,100000,1.2)      
\set o3 random_zipfian(20,100000,1.2)      
\set o4 random_zipfian(30,100000,1.2)      
\set o5 random_zipfian(40,100000,1.2)      
\set o6 random_zipfian(50,100000,1.2)      
\set o7 random_zipfian(60,100000,1.2)      
\set o8 random_zipfian(70,100000,1.2)      
\set o9 random_zipfian(80,100000,1.2)      
\set o10 random_zipfian(90,100000,1.2)      
insert into t (itemid) values (array[:o1,:o2,:o3,:o4,:o5,:o6,:o7,:o8,:o9,:o10]::int[]);      

共360萬個訂單

count      
---------      
3600000      
(1 row)      

4、對itemid 建立gin索引

商品ID 1和哪10個商品最搭配?

傳統方法:

order_id int8,  -- 訂單id      
itemid int  -- 商品id      
);      
create index idx_tt on tt (order_id);      
insert into tt select order_id, unnest(itemid) from t;      
select count(*) from tt;      
-- 展開商品後記錄數比PG多了10倍      

傳統結構找到商品ID=1的最佳TOP 10組合, 使用了hash agg和index的情況下需要4.8秒.

(select order_id from tt where itemid=1)      
group by itemid order by count(*) desc limit 11;      
itemid | count      
--------+--------      
1 | 706647      
90 | 157916      
80 | 157076      
70 | 156776      
60 | 155228      
50 | 153810      
40 | 152431      
30 | 149441      
20 | 146573      
10 | 139074      
91 |  78262      
(11 rows)      
Time: 4832.984 ms (00:04.833)      

方法1:

PG可以自定義一個新的聚合函數

array_aggn(anyarray, order, n)

, 對數組進行聚合, 得到一個二維數組, 傳回按元素出現頻率的前N或後N個元素.

[

元素

][

元素出現頻率

]

這條SQL可以通過GIN索引快速定位到包含1的訂單記錄, 同時對訂單中的商品進行聚合,傳回元素出現頻率最高的前5個元素以及頻率.

例如得到

自定義聚合的方法:

《PostgreSQL 10 自定義并行計算聚合函數的原理與實踐- (含array_agg合并多個數組為單個一進制數組的

例子)》

[《PostgreSQL Oracle 相容性之- 自定義并行聚合函數PARALLEL_ENABLE AGGREGATE》](../201803/2018031

2_03.md)

《PostgreSQL并行計算解說之9 - parallel 自定義并行聚合》
擴充

除了用array來存儲商品和标簽, 也可以使用roaringbitmap來存儲, 同樣需要對rb類型新增一個聚合函數, 傳回元素和元素出現頻率.

使用rb存儲的好處是壓縮存儲, 效率更高一點.

《PostgreSQL pg_roaringbitmap - 使用者畫像、标簽、高效檢索》 https://github.com/ChenHuajun/pg_roaringbitmap

方法2:

用統計資訊得到近似解.

例如要計算商品ID=1的商品和哪些商品最搭

SELECT 706647      
Time: 926.742 ms      
itemid      
----------------------------------------      
{1,15,616,652,40,122,124,74,47543,167}      
{1,54477,112,35,72,50,76,70,91,103}      
{1,373,338,31,41,50,305,341,82,3478}      
{1,10,20,33,66,97,42797,151,205,316}      
{1,69,49,106,52,55,63,7390,315,90}      
{1,12,38,31,48,77,68,751,217,215}      
{1,10,147,92,41,69,77,74,81,90}      
{1,17,21,37,114,68609,72,73,1098,144}      
{1,37,21,44,44,50,84,70,3202,97}      
{1,5483,25,84,91,56,68,6944,332,190}      
(10 rows)      

統計資訊内有數組的element統計, 如下

View "pg_catalog.pg_stats"      
Column         |   Type   | Collation | Nullable | Default      
------------------------+----------+-----------+----------+---------      
schemaname             | name     |           |          |      
tablename              | name     |           |          |      
attname                | name     |           |          |      
inherited              | boolean  |           |          |      
null_frac              | real     |           |          |      
avg_width              | integer  |           |          |      
n_distinct             | real     |           |          |      
most_common_vals       | anyarray |           |          |      
most_common_freqs      | real[]   |           |          |      
histogram_bounds       | anyarray |           |          |      
correlation            | real     |           |          |      
most_common_elems      | anyarray |           |          |      
most_common_elem_freqs | real[]   |           |          |      
elem_count_histogram   | real[]   |           |          |      

收集統計資訊

ANALYZE      
Time: 15.658 ms      

查詢購物車數組字段的統計資訊詳情

-[ RECORD 1 ]----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------      
most_common_elems      | {1,10,11,12,13,14,15,16,17,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,115,116,117,119,120,122,123}      
most_common_elem_freqs | {1,0.20272727,0.08181818,0.053939395,0.03757576,0.03272727,0.02,0.021818181,0.015757576,0.013939394,0.20787878,0.10757576,0.05666667,0.041818183,0.03212121,0.030909091,0.023333333,0.01878788,0.02030303,0.015454546,0.20484848,0.097575754,0.0660606,0.05272727,0.04060606,0.030303031,0.025757575,0.025454545,0.016969698,0.021212121,0.21848485,0.10242424,0.0669697,0.05727273,0.036666665,0.034848485,0.041212123,0.026969697,0.022121212,0.017878788,0.21878788,0.10181818,0.06757576,0.04909091,0.043636363,0.04060606,0.035454545,0.03181818,0.022424242,0.028181817,0.21939394,0.10060606,0.06878788,0.06333333,0.042727273,0.036060605,0.028181817,0.033333335,0.03181818,0.026969697,0.22636363,0.095757574,0.07636364,0.055454545,0.04969697,0.043333333,0.035151515,0.033939395,0.028181817,0.026969697,0.20818181,0.11242424,0.07787879,0.062121212,0.048787877,0.043333333,0.035757575,0.036969695,0.03181818,0.03212121,0.22030303,0.10636364,0.0730303,0.05878788,0.04969697,0.03939394,0.03969697,0.034242425,0.03727273,0.035151515,0.02969697,0.024848485,0.026969697,0.02878788,0.021818181,0.016060606,0.02030303,0.01939394,0.016363636,0.014242425,0.015151516,0.016969698,0.021515151,0.014545455,0.016060606,0.013333334,0.014242425,0.013939394,0.013939394,0.015757576,0.013333334,1,0}      

解讀:

《PostgreSQL 9.2 add array elements statistics》

擷取TOP 10 的組合:

第一個元素是自己, 是以要輸出11條

《PostgreSQL pg_stats used to estimate top N freps values and explain rows》
(select row_number() over(partition by r) as rn,ele from (select unnest(most_common_elems::text::int[]) ele,2 as r from pg_stats where tablename='tmp_t' and attname='itemid') t) t1      
join      
(select row_number() over(partition by r) as rn,freq from (select unnest(most_common_elem_freqs) freq,2 as r from pg_stats where tablename='tmp_t' and attname='itemid') t) t2      
on (t1.rn=t2.rn) order by t2.freq desc limit 11;      
rn | ele | rn |    freq      
----+-----+----+------------      
1 |   1 |  1 |          1  -- 自己的出現機率 100%      
52 |  60 | 52 | 0.22878788  -- 接下來的9個值符合pgbench使用的長尾分布随機生成算法      
62 |  70 | 62 | 0.22030303      
12 |  20 | 12 | 0.21848485      
32 |  40 | 32 |  0.2169697      
72 |  80 | 72 | 0.21636364      
42 |  50 | 42 |       0.21      
82 |  90 | 82 |       0.21      
2 |  10 |  2 | 0.20545454      
22 |  30 | 22 | 0.20060606      
73 |  81 | 73 | 0.11515152  -- 這個值是第二梯隊10個中的的任意一個(因為機率差别很小很小)      
(11 rows)      
Time: 1.852 ms      

即商品ID 1和商品ID 60,70,20,40,80,50,90,10,30,81最搭配, 以及他們與商品1雙雙同時出現的機率.

驗證近似值的準确度: 幾乎100%

使用unnest可以得到真實值, 和近似值幾乎完全一緻.

count      
--------      
706647      
(1 row)      
select unnest(itemid) i , count(*)/706647.0 from tmp_t group by 1 order by 2 desc limit 11;      
i  |        ?column?      
----+------------------------      
1 | 1.00000000000000000000  -- 自己的出現機率 100%      
90 | 0.22347225701092624748  -- 接下來的9個值符合pgbench使用的長尾分布随機生成算法      
80 | 0.22228354468355487252      
70 | 0.22185900456663652432      
60 | 0.21966837756333784761      
50 | 0.21766171794403712179      
40 | 0.21571024853993578123      
30 | 0.21147899870798291085      
20 | 0.20742039519024350206      
10 | 0.19680830740100785824      
91 | 0.11075119543421255592  -- 這個值是第二梯隊10個中的的任意一個(因為機率差别很小很小)      
(11 rows)      
Time: 1204.824 ms (00:01.205)      

使用統計資訊的近似查詢最佳組合方法, 查詢性能提升1000倍.

方法3:

使用topn插件

生成訂單、購物車的時候實時計算. 存儲到一種新的資料類型: 近似值類型.

https://github.com/pipelinedb/pipelinedb/blob/master/pipelinedb--1.0.0.sql https://github.com/apache/datasketches-postgresql https://github.com/citusdata/postgresql-topn https://github.com/ozturkosu/cms_topn

以topn為例:

postgres=# show topn.number_of_counters ;      
topn.number_of_counters      
-------------------------      
1000      
(1 row)      
Time: 0.192 ms      

每個商品隻需要存儲1條記錄(即統計結果記錄)

itemid int primary key,      
groups jsonb      
);      
insert into tmp_topn select 1, topn_add_agg(i::text) from (      
select unnest(itemid) i from t where itemid @> array[1]      
) t;      
INSERT 0 1      
Time: 1977.195 ms (00:01.977)      
postgres=# select (topn(groups,11)).* from tmp_topn where itemid=1;      
item | frequency      
------+-----------      
1    |    706647      
90   |    157916      
80   |    157076      
70   |    156776      
60   |    155228      
50   |    153810      
40   |    152431      
30   |    149441      
20   |    146573      
10   |    139074      
91   |     78262      
(11 rows)      
Time: 0.706 ms      

當産生新的訂單時, 這個訂單有N個商品, 那麼在以上表中, 增量更新N條記錄即可.

select unnest(itemid) i from t where itemid @> array[1]      
) t      
on conflict (itemid) do update set      
groups = topn_union(tmp_topn.groups,excluded.groups);      
INSERT 0 1      
Time: 1957.321 ms (00:01.957)      
item | frequency      
------+-----------      
1    |   1413294      
90   |    315832      
80   |    314152      
70   |    313552      
60   |    310456      
50   |    307620      
40   |    304862      
30   |    298882      
20   |    293146      
10   |    278148      
91   |    156524      
(11 rows)      
Time: 0.706 ms      

性能同樣是1000倍提升, 而且随着資料量增加, 性能提升會更加明顯.

使用topn的方法, 可以每天為每個商品存儲1條記錄, 這樣就能實行實時滑窗分析, 也是傳統資料庫無法高效實作的.

-- topn_union_agg用于劃窗聚合      
select topn(topn_union_agg(groups),11) from xx      
where ts >= '2021-11-01' and ts < '2021-11-12'      
and itemid=1 ;      

方法4:

流計算, 略, pipelinedb已停更, 團隊加入confluent, 可以折騰的小夥伴可以繼續維護pipelinedb.

目前也能通過任務排程+方法3來實作增量.

又或者使用timescaledb的持續聚合功能.

https://docs.timescale.com/api/latest/continuous-aggregates/create_materialized_view/

參考

1、topk 算法論文

https://www.hlt.inesc-id.pt/~fmmb/wiki/uploads/Work/dict.refd.pdf https://pipelinedb-doc-cn.readthedocs.io/zh_CN/latest/builtin.html#top-k

2、流計算

3、持續聚合

4、近似計算

5、

《為什麼啤酒和紙尿褲最搭- 用HybridDB/PostgreSQL查詢商品營銷最佳組合》

6、電商行業流量估算

https://finance.sina.com.cn/tech/2020-11-12/doc-iiznezxs1360582.shtml

11月1日至11月12日0:00,天貓雙11總交易額達4982億元。實時資料顯示,天貓雙11期間,成交額突破1億元的品牌超過450個。

11月11日23點,2020天貓雙11全球狂歡季實時物流訂單量破22.5億單, 平均每天1億單左右,這個數字約等于2010年全年中國快遞量的總和。

7、統計資訊解讀

8、長尾模型

《PostgreSQL pgbnech 支援長尾模型資料生成- 離散幂律機率分布- random_zipfian》

9、自定義聚合的方法

10、滑窗分析

《PostgreSQL count-min sketch top-n 機率計算插件cms_topn (結合視窗實作同比、環比、滑窗分析等) - 流計算核心功能之一》