天天看点

PostgreSQL运维案例--记使用pg_pathman的range分区踩到的坑

一、问题背景

最近一测试环境某个postgres进程多次将主机内存耗尽,触发了OOM,甚至导致主机多次重启,一些服务中断。从messages中OOM信息来看是进程占用anon达数十GB。

该进程看起来就是执行一条简单的select,如下:

考虑到信息安全红线,文中做的sql演示中表名等信息均来自个人电脑,与平安业务无关

对该sql打印执行计划后发现共扫描了20000+的分区表,使用的是pg_pathman的range分区。那么就产生了两个疑问:

(1)为什么没有筛选出符合条件的分区表,而是扫描了所有的分区表?

(2)这个分区range为7天,并且该表存储的是18年到现在的数据,为什么会存在20000+个分区表?

对于问题1,之前遇见过类似的情况,已知在筛选条件中如果对于分区字段套用了函数表达式,或者类型转换函数to_date(),to_timestamp()等,那么不会筛选出对应的分区表,会扫描所有的分区表;但是支持::date或者::timestamp这种类型转换 。

补充:

后续发现这种方式仅限psql、pgadmin、navicat客户端;jdbc驱动使用::timestamp这种方式有时也会出现expr条件被解析为T_FuncExpr 类型,不走选择分区逻辑,建议java代码中直接使用timestamp类型,去除类型转换。

示例如下:

使用::timestamp方式,执行计划中只扫描了查询范围内的两个分区表
postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >= '2020-01-09 00:00:00'::timestamp and add_time < '2020-01-19 00:00:00'::timestamp;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..71.00 rows=1360 width=12)
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_2  (cost=0.00..35.50 rows=680 width=12)
         Filter: (add_time >= '2020-01-09 00:00:00'::timestamp without time zone)
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_3  (cost=0.00..35.50 rows=680 width=12)
         Filter: (add_time < '2020-01-19 00:00:00'::timestamp without time zone)
(5 rows)
           
使用to_timestamp()方式,执行计划中扫描了全部11个分区表
postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >=to_timestamp( '2020-01-09 00:00:00','yyyy-mm-dd hh24-mi-ss') and add_time < to_timestamp('2020-01-19 00:00:00','yyyy-mm-dd hh24-mi-ss');
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
 Append  (cost=0.00..558.80 rows=110 width=12)
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_1  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_2  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_3  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_4  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_5  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_6  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_7  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_8  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_9  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_10  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_11  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
(23 rows)

postgres=#
           

所以将to_timestamp()改为::timestamp的方式,问题临时规避了,之前没有进一步研究过原因,这篇案例后文主要分析问题1的根本原因。

再看问题2,为什么会存在20000+个分区表?通过表的relfilenode,查看对应的物理文件,发现这些文件都是前一天下午某个时刻创建的。找到对应时间的日志,看到该表在某时刻插入了一条数据,但插入分区字段的值是一个未来很遥远的时间。刚好使用到了range自动扩展,也就是说插入一条数据,如果超出当前所有分区的范围,会自动创建新的分区,并补齐其中空缺的分区表。就是这一个insert,导致产生了20000+的分区表,真的是一条sql引发的惨案呐。

二、分区分析

那么为什么会产生问题1中的现象?这个时候就能体现出开源的一些便利条件了,可以自己从源代码中找答案。

众所周知,pg_pathman是以HOOK的方式,来修改原本的querytree和plantree。postgresql源代码中已经为这些类似的HOOK插件留好了入口,在postgresql启动时process_shared_preload_libraries()函数根据配置的插件名找到对应的lib,然后运行里边的pg_init()函数,pg_init()会做一些初始化,加载插件的HOOK函数,当业务逻辑走到HOOK入口时直接调用即可。

pg_pathman中怎么确定需要的range分区表呢?是通过以下的函数来完成的

/* Given 'value' and 'ranges', return selected partitions list */
void
select_range_partitions(const Datum value,
						const Oid collid,
						FmgrInfo *cmp_func,
						const RangeEntry *ranges,
						const int nranges,
						const int strategy,
						WrapperNode *result) /* returned partitions */
{
 	/*函数体比较长,这里省略了,后边gdb跟踪时会描述下大体的逻辑*/
}	

           

大致翻阅了下源代码,虽然pg_pathman的代码不多,但是对于笔者这样一个代码能力薄弱的人来说,里边的逻辑一时三刻无法理清,真的有点无从下手的感觉。所以选择了最“笨重”,但是对自己来说最有效的办法-gdb跟踪。

三、GDB跟踪

1. 准备工作

session1:执行sql

session2:跟踪调试

查询表的oid信息如下:

oid  |                     relname
-------+-------------------------------------------------
 16781 | qsump_pacloud_oscginfo_activity_detail_info_day --主表,共有11个分区表
 16863 | qsump_pacloud_oscginfo_activity_detail_info_day_2
 16869 | qsump_pacloud_oscginfo_activity_detail_info_day_3
 ....
 16917 | qsump_pacloud_oscginfo_activity_detail_info_day_11
           
2. 调试::timestamp形式的语句

session1:

postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
          31698
postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >= '2020-01-09 00:00:00'::timestamp and add_time < '2020-01-19 00:00:00'::timestamp;

|
           

session2:

为了观察完整的过程,几乎给pathman的所有HOOK函数,以及生成plantree的一些关键函数都设置了断点,这里只贴出选择range分区逻辑的跟踪过程

调试信息中,注释格式标记为:##注释##

[[email protected]_zabbix ~]$ gdb --pid 31698
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 31698
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres...done.
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so...done.
Loaded symbols for /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x00007fef912c15e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64
(gdb) b exec_simple_query   
Breakpoint 1 at 0x7b9bc2: file postgres.c, line 867.
(gdb) b pg_plan_queries   
Breakpoint 2 at 0x7b9b25: file postgres.c, line 834.
(gdb) b pathman_rel_pathlist_hook    
Breakpoint 3 at 0x7fef8a86594c: file src/hooks.c, line 263.
(gdb) b pathman_join_pathlist_hook
Breakpoint 4 at 0x7fef8a8652d7: file src/hooks.c, line 79.
(gdb) b pathman_shmem_startup_hook
Breakpoint 5 at 0x7fef8a866772: file src/hooks.c, line 687.
(gdb) b pathman_post_parse_analysis_hook
Breakpoint 6 at 0x7fef8a86650b: file src/hooks.c, line 587.
(gdb) b pathman_planner_hook
Breakpoint 7 at 0x7fef8a8662d3: file src/hooks.c, line 524.
(gdb) b pathman_process_utility_hook
Breakpoint 8 at 0x7fef8a86696b: file src/hooks.c, line 795.
(gdb) b pg_plan_query
Breakpoint 9 at 0x7b9a8e: file postgres.c, line 778.
(gdb) b planner
Breakpoint 10 at 0x6fc596: file planner.c, line 175.
(gdb) b add_partition_filters
Breakpoint 11 at 0x7fef8a86a713: file src/planner_tree_modification.c, line 378.
(gdb) b partition_filter_visitor
Breakpoint 12 at 0x7fef8a86a74a: file src/planner_tree_modification.c, line 390.
(gdb) b get_pathman_relation_info
Breakpoint 13 at 0x7fef8a851c05: file src/relation_info.c, line 361.
(gdb) b cache_parent_of_partition
Breakpoint 14 at 0x7fef8a852e6d: file src/relation_info.c, line 1015.
(gdb) b handle_modification_query
Breakpoint 15 at 0x7fef8a86a3bf: file src/planner_tree_modification.c, line 255.
(gdb) b select_range_partitions   
Breakpoint 16 at 0x7fef8a85940d: file src/pg_pathman.c, line 531.
(gdb) b walk_expr_tree           
Breakpoint 17 at 0x7fef8a85992d: file src/pg_pathman.c, line 717.
(gdb) b handle_opexpr               
Breakpoint 18 at 0x7fef8a85ab0e: file src/pg_pathman.c, line 1317.
(gdb) b IsConstValue
Breakpoint 19 at 0x7fef8a858811: file src/pg_pathman.c, line 114.
(gdb) n
(gdb) set print pretty                       
           

调用select_range_partitions的关键函数

(gdb) list                      ##打印handle_opexpr 函数体##
1328                            int                             strategy;
1329
1330                            tce = lookup_type_cache(prel->ev_type, TYPECACHE_BTREE_OPFAMILY);
1331                            strategy = get_op_opfamily_strategy(expr->opno, tce->btree_opf);
1332                            ##当IsConstValue为true时,调用handle_const函数##
1333                            if (IsConstValue(param, context))
1334                            {
1335                                    handle_const(ExtractConst(param, context),
1336                                                             expr->inputcollid,
1337                                                             strategy, context, result);
(gdb) n
Breakpoint 19, IsConstValue (node=0x20682f0, context=0x7ffe113be790) at src/pg_pathman.c:114   ##进入断点19,校验我们sql中>=和<条件是否是T_Const类型##
114             switch (nodeTag(node))
(gdb) p *node   ##可以看到我们传入的expr node_type为T_Const##
$4 = {
  type = T_Const
}
(gdb) list     ##打印函数体##
109
110     /* Can we transform this node into a Const? */
111     static bool
112     IsConstValue(Node *node, const WalkerContext *context)
113     {
114             switch (nodeTag(node))
115             {       ##当类型为T_Const时返回true##
116                     case T_Const:   
117                             return true;
118


(gdb) list
119                     case T_Param:
120                             return WcxtHasExprContext(context);
121
122                     case T_RowExpr:
123                             {
124                                     RowExpr    *row = (RowExpr *) node;
125                                     ListCell   *lc;
126
127                                     /* Can't do anything about RECORD of wrong type */
128                                     if (row->row_typeid != context->prel->ev_type)
(gdb) n
117                             return true;
(gdb)
141     }
(gdb)   ##函数返回了ture,进入handle_const,准备调用select_range_partitions## 
handle_opexpr (expr=0x2068360, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:1335
1335                                    handle_const(ExtractConst(param, context),
(gdb)      ##进入了select_range_partitions函数##
Breakpoint 15, select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,
    result=0x2070a10) at src/pg_pathman.c:531
(gdb) bt   ##打印堆栈信息,看看函数调用关系##
#0  select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,
    result=0x2070a10) at src/pg_pathman.c:540
#1  0x00007fef8a859ea3 in handle_const (c=0x20682f0, collid=0, strategy=4, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:929
#2  0x00007fef8a85abd7 in handle_opexpr (expr=0x2068360, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:1335
#3  0x00007fef8a8599d4 in walk_expr_tree (expr=0x2068360, context=0x7ffe113be790) at src/pg_pathman.c:734
#4  0x00007fef8a865bfd in pathman_rel_pathlist_hook (root=0x2067e40, rel=0x206fcc8, rti=1, rte=0x20674b8) at src/hooks.c:345   

    ##省略底层堆栈信息## 
           

select_range_partitions函数中选定分区表的逻辑

(gdb) 
      ##这里的value就是我们的查找范围的左区间,也就是>=的值'2020-01-09 00:00:00',可以看到nranges=11,即存在11个分区表##
Breakpoint 15, select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,
    result=0x2070a10) at src/pg_pathman.c:531
531             bool    lossy = false,
(gdb)
540             int             startidx = 0,
(gdb) bt

(gdb)
541                             endidx = nranges - 1,
(gdb)
546             Bound   value_bound = MakeBound(value); /* convert value to Bound */
(gdb)
550             result->found_gap = false;
(gdb)
(gdb) p *cmp_func      ##比较大小使用的函数为timestamp_cmp##
$40 = {
  fn_addr = 0x8b2ffa <timestamp_cmp>, 
  fn_oid = 2045,
  fn_nargs = 2,
  fn_strict = 1 '\001',
  fn_retset = 0 '\000',
  fn_stats = 2 '\002',
  fn_extra = 0x0,
  fn_mcxt = 0x1f82948,
  fn_expr = 0x0

(gdb)
553             if (nranges == 0)
(gdb)                   ##cmp_func函数为timestamp_cmp,返回值为 return (dt1 < dt2) ? -1 : ((dt1 > dt2) ? 1 : 0);##
						##cmp_bounds是个回调函数,调用cmp_func函数来确定value_bound(也就是我们sql中的查询范围常量即>=和<的值)和分区表的rang_min及rang_max之间的大小关系##
						##cmp_min为-1,说明value_bound小于ranges[i].min##
						##cmp_max为-1,说明value_bound小于ranges[i].max,这两个合起来就可以确定当前的ranges[i]是不是要找的分区表##
						 
						##这里很巧妙,先比较ranges[startidx].min也就是第一个分区表的左区间以及ranges[endidx].max也就是最后一个分区表的右区间,确认要找的值在不在整个分区范围内,若不在后边直接返回了,若在则继续轮巡比较##
566                     cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[startidx].min);
(gdb)                  
567                     cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[endidx].max);
(gdb)
569                     if ((cmp_min <= 0 && strategy == BTLessStrategyNumber) ||
(gdb) p nranges
$35 = 11                ##共11个分区表##
(gdb) p value_bound     ##当前的范围常量是sql中>=后的值##
$36 = {
  value = 631843200000000,
  is_infinite = 0 '\000'
}
(gdb) p ranges[startidx] ##rang[0]为第一个分区表##
$37 = {
  child_oid = 16857,
  min = {
    value = 631152000000000,  
    is_infinite = 0 '\000'
  },
  max = {
    value = 631756800000000,
    is_infinite = 0 '\000'
  }
}
(gdb) n


(gdb)
646                     else if (is_greater)
(gdb)
647                             startidx = i + 1;
(gdb)
651             }
(gdb)                   ##这里就开始操作分区下标了,类似对分区下标做二分查找##
611                     i = startidx + (endidx - startidx) / 2;  
(gdb)
615                     cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].min);
(gdb)
616                     cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].max);
(gdb)                   ##is_less 若为假,则说明左值(>=条件的值),大于当前分区的左区间range_min##
618                     is_less = (cmp_min < 0 || (cmp_min == 0 && strategy == BTLessStrategyNumber)); 
(gdb)                   ##is_greater若为假,说明左值(>=条件的值),小于当前分区的右区间range_max##
619                     is_greater = (cmp_max > 0 || (cmp_max >= 0 && strategy != BTLessStrategyNumber));
(gdb)                   ##if条件为真,说明这里已经找到了查询的左区间的分区表##      
621                     if (!is_less && !is_greater) 
(gdb)
623                             if (strategy == BTGreaterEqualStrategyNumber && cmp_min == 0)
(gdb)
625                             else if (strategy == BTLessStrategyNumber && cmp_max == 0)
(gdb)                                 
628                                     lossy = true;      
(gdb) p ranges[i]
$38 = {
  child_oid = 16863,                   ##查询左区间位于分区表oid=16863内,即qsump_pacloud_oscginfo_activity_detail_info_day_2##
  min = {
    value = 631756800000000,
    is_infinite = 0 '\000'
  },
  max = {
    value = 632361600000000,
    is_infinite = 0 '\000'
  }
}
(gdb) n
633                             break;
(gdb)
657             switch(strategy)
(gdb)
680                             if (lossy)
(gdb)
682                                     result->rangeset = list_make1_irange(make_irange(i, i, IR_LOSSY));
(gdb) n
683                                     if (i < nranges - 1)
(gdb)                                   ##将匹配到的左区间加入到result->rangeset node## 
685                                     lappend_irange(result->rangeset,
(gdb)
684                                             result->rangeset =
(gdb)
697                             break;

           
(gdb)##这里开始匹配右区间,value值为sql中< 条件的值'2020-01-19 00:00:00'
Breakpoint 20, select_range_partitions (value=632707200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=1,
    result=0x2036d30) at src/pg_pathman.c:531
531             bool    lossy = false,


(gdb) n
611                     i = startidx + (endidx - startidx) / 2;
(gdb)


(gdb)
615                     cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].min);
(gdb)
616                     cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].max);
(gdb)                   ##is_less 若为假,则说明左值(<条件的值),大于当前分区的左区间
618                     is_less = (cmp_min < 0 || (cmp_min == 0 && strategy == BTLessStrategyNumber));
(gdb)                   ##is_greater若为假,则说明左值(<条件的值),小于当前分区的右区间range_max##
619                     is_greater = (cmp_max > 0 || (cmp_max >= 0 && strategy != BTLessStrategyNumber));
(gdb)
(gdb) p is_less
$45 = 0 '\000'
(gdb) p is_greater
$46 = 0 '\000'
(gdb)                   ##if条件为真,说明找到了查询右区间的分区表
621                     if (!is_less && !is_greater)
(gdb)
623                             if (strategy == BTGreaterEqualStrategyNumber && cmp_min == 0)
(gdb)
625                             else if (strategy == BTLessStrategyNumber && cmp_max == 0)
(gdb)
628                                     lossy = true;
(gdb)
(gdb) p i
$44 = 2
(gdb)
(gdb) p ranges[i]             ##找到的右区间为oid=16869,即qsump_pacloud_oscginfo_activity_detail_info_day_3
$43 = {
  child_oid = 16869,
  min = {
    value = 632361600000000,
    is_infinite = 0 '\000'
  },
  max = {
    value = 632966400000000,
    is_infinite = 0 '\000'
  }
}

633                             break;
(gdb)
657             switch(strategy)
(gdb)
661                             if (lossy)
(gdb)                            ##将匹配的右区间加入到result->rangeset node
663                                     result->rangeset = list_make1_irange(make_irange(i, i, IR_LOSSY));
(gdb)
664                                     if (i > 0)

##到这里已经匹配到了需要查询的所有分区表##
           

下来就是将分区表插入到root node。可以看到循环执行了两次,通过append_child_relation函数将匹配到的两个分区表加入到了root node

384                     parent_rel = heap_open(rte->relid, NoLock);
(gdb)
387                     if (prel->enable_parent)
(gdb)
393                     foreach(lc, ranges)
(gdb)
395                             IndexRange irange = lfirst_irange(lc);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
393                     foreach(lc, ranges)
(gdb) p i
$49 = 3
(gdb) p children[2]
$51 = 16869
(gdb)
(gdb) p children[1]
$52 = 16863
(gdb)
           

省略了代价预估和get_cheapest_fractional_path的跟踪环节,直接来看最终的plantree

pg_plan_query (querytree=0x1fefd80, cursorOptions=256, boundParams=0x0) at postgres.c:792
792             if (log_planner_stats)
(gdb)           ##如果开启了配置参数Debug_print_plan,将会将完整的plantree打印到日志
817             if (Debug_print_plan)
(gdb)                   
818                     elog_node_display(LOG, "plan", plan, Debug_pretty_print);
(gdb) p *plan
$59 = {
  type = T_PlannedStmt,
  commandType = CMD_SELECT,
  queryId = 0,
  hasReturning = 0 '\000',
  hasModifyingCTE = 0 '\000',
  canSetTag = 1 '\001',
  transientPlan = 0 '\000',
  dependsOnRole = 0 '\000',
  parallelModeNeeded = 0 '\000',
  planTree = 0x206fde8,
  rtable = 0x20700b8,
  resultRelations = 0x0,
  utilityStmt = 0x0,
  subplans = 0x0,
  rewindPlanIDs = 0x0,
  rowMarks = 0x0,
  relationOids = 0x2070108,
  invalItems = 0x0,
  nParamExec = 0
} ##可以看到最终的plan->relationOids list中包含三个node,即主表和两个分区表,和我们之前看到“好的”执行计划结果是相符的##
(gdb) p *plan->relationOids
$60 = {
  type = T_OidList,
  length = 3,
  head = 0x20700e8,
  tail = 0x2070428
}
(gdb) p *plan->relationOids->head ##主表##
$61 = {
  data = {
    ptr_value = 0x418d,
    int_value = 16781,
    oid_value = 16781
  },
  next = 0x20702d8
}     
		##分区表qsump_pacloud_oscginfo_activity_detail_info_day_2##
(gdb) p *plan->relationOids->head->next 
$62 = {
  data = {
    ptr_value = 0x41df,
    int_value = 16863,
    oid_value = 16863
  },
  next = 0x2070428
}
		##分区表qsump_pacloud_oscginfo_activity_detail_info_day_3##
(gdb) p *plan->relationOids->tail
$63 = {
  data = {
    ptr_value = 0x41e5,
    int_value = 16869,
    oid_value = 16869
  },
  next = 0x0
}
(gdb) n
822             return plan;
(gdb)
823     }
	##跟踪结束##
           
3. 调试to_timestamp()形式的语句

这里只体现与之前不同的部分

session1:

postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >=to_timestamp( '2020-01-09 00:00:00','yyyy-mm-dd hh24-mi-ss') and add_time < to_timestamp('2020-01-19 00:00:00','yyyy-mm-dd hh24-mi-ss');
|

           

session2:

##这里与之前不同,之前node_type为T_Const,而当前为T_FuncExpr
Breakpoint 20, IsConstValue (node=0x2071350, context=0x7ffe113be790) at src/pg_pathman.c:114
114             switch (nodeTag(node))
(gdb) p *node
$2 = {
  type = T_FuncExpr
}
(gdb) list
109
110     /* Can we transform this node into a Const? */
111     static bool
112     IsConstValue(Node *node, const WalkerContext *context)
113     {
114             switch (nodeTag(node))
115             {
116                     case T_Const:
117                             return true;
118
(gdb)
119                     case T_Param:
120                             return WcxtHasExprContext(context);
121
122                     case T_RowExpr:
123                             {
124                                     RowExpr    *row = (RowExpr *) node;
125                                     ListCell   *lc;
126
127                                     /* Can't do anything about RECORD of wrong type */
128                                     if (row->row_typeid != context->prel->ev_type)
(gdb)
129                                             return false;
130
131                                     /* Check that args are const values */
132                                     foreach (lc, row->args)
133                                             if (!IsConstValue((Node *) lfirst(lc), context))
134                                                     return false;
135                             }
136                             return true;
137                    ##IsConstValue函数中,并没有对T_FuncExpr做一个case分支去处理,因此,走了default,返回了false
138                     default:
(gdb)
139                             return false;
140             }
141     }
142
143     /* Extract a Const from node that has been checked by IsConstValue() */
144     static Const *
145     ExtractConst(Node *node, const WalkerContext *context)
146     {
147             ExprState          *estate;
148             ExprContext        *econtext = context->econtext;
(gdb) n
139                             return false;
(gdb)
141     }
(gdb)
handle_opexpr (expr=0x20713c0, context=0x7ffe113be790, result=0x2079e40) at src/pg_pathman.c:1342
1342                            else if (IsA(param, Param) || IsA(param, Var))
(gdb) n
                ##由于IsConstValue返回了false,因此没有进入handle_const里调用select_range_partitions选择分区表,而是直接将所有分区追加到result->rangeset node##
1352            result->rangeset = list_make1_irange_full(prel, IR_LOSSY);
(gdb) list
1347                                    return; /* done, exit */
1348                            }
1349                    }
1350            }
1351
1352            result->rangeset = list_make1_irange_full(prel, IR_LOSSY);
1353            result->paramsel = 1.0;
1354    }
1355
1356
(gdb) n
1353            result->paramsel = 1.0;
(gdb)
1354    }
(gdb)
           

所有的分区表被追加到root node,而我们的where条件仅仅当做filter去处理,并没有先根据条件选择分区表。

##11个分区表均被追加到root node##
(gdb)
384                     parent_rel = heap_open(rte->relid, NoLock);
(gdb)
387                     if (prel->enable_parent)
(gdb)
393                     foreach(lc, ranges)
(gdb)
395                             IndexRange irange = lfirst_irange(lc);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
393                     foreach(lc, ranges)
(gdb) p i
$3 = 11
           

直接来看最后生成的plantree

(gdb)
pg_plan_query (querytree=0x1fefd80, cursorOptions=256, boundParams=0x0) at postgres.c:792
792             if (log_planner_stats)
(gdb)
817             if (Debug_print_plan)
(gdb)
818                     elog_node_display(LOG, "plan", plan, Debug_pretty_print);
(gdb) p *plan
$5 = {
  type = T_PlannedStmt,
  commandType = CMD_SELECT,
  queryId = 0,
  hasReturning = 0 '\000',
  hasModifyingCTE = 0 '\000',
  canSetTag = 1 '\001',
  transientPlan = 0 '\000',
  dependsOnRole = 0 '\000',
  parallelModeNeeded = 0 '\000',
  planTree = 0x208f028,
  rtable = 0x208f2f8,
  resultRelations = 0x0,
  utilityStmt = 0x0,
  subplans = 0x0,
  rewindPlanIDs = 0x0,
  rowMarks = 0x0,
  relationOids = 0x208f348,
  invalItems = 0x0,
  nParamExec = 0
}
 		##可以看到,最终的plan->relationOids list包含12个node,即主表加11个分区表##
(gdb) p *plan->relationOids
$4 = {
  type = T_OidList,
  length = 12,        
  head = 0x2091338,
  tail = 0x2092218
}
(gdb) p *plan->relationOids->head  ##主表##
$5 = {
  data = {
    ptr_value = 0x418d,
    int_value = 16781,
    oid_value = 16781
  },
  next = 0x20914b8
}
##第11个分区表qsump_pacloud_oscginfo_activity_detail_info_day_11##
(gdb) p *plan->relationOids->tail
$6 = {
  data = {
    ptr_value = 0x4215,
    int_value = 16917,
    oid_value = 16917
  },
  next = 0x0
}
(gdb)
           

四、总结反思

1. 修改思路

通过gdb跟踪,现在已经明确了是IsConstValue函数中,不存在T_FuncExpr case分支,导致T_FuncExpr类型直接走了default,没有进行分区表的筛选。

那么修改方案是否可以为:

1)IsConstValue函数中加入T_FuncExpr case分支处理,实现分区的筛选

2)大致翻阅了下postgresql主体的源代码,发现主体代码中存在很多处理方式,比如将T_FuncExpr转换为simple_expr。那么pg_pathman中能否对node type做下处理,将T_FuncExpr转化为T_Const

测试了目前最新的pg_pathman版本 1.5.11存在同样的问题,但是postgresql原生的声明式分区不存在这样的问题,都可以处理。

/* Can we transform this node into a Const? */
static bool
IsConstValue(Node *node, const WalkerContext *context)
{
	switch (nodeTag(node))
	{
		case T_Const:
			return true;

		case T_Param:
			return WcxtHasExprContext(context);

		case T_RowExpr:
			{
				RowExpr	   *row = (RowExpr *) node;
				ListCell   *lc;

				/* Can't do anything about RECORD of wrong type */
				if (row->row_typeid != context->prel->ev_type)
					return false;

				/* Check that args are const values */
				foreach (lc, row->args)
					if (!IsConstValue((Node *) lfirst(lc), context))
						return false;
			}
			return true;

		default:
			return false;
	}
}
           
2. 修改的必要性

这是一个相对性看待的问题,我认为对于应用开发同学来说,有些不友好,只能使用::date这种方式。

请教过社区的专家,得知postgresql12原生声明式分区的性能已经可以和pg_pathman媲美,几乎不相上下,因此对于12及以后版本可以考虑优先使用声明式分区,抛弃插件。但是11以前到9.5之间的版本,由于自身分区性能较差,可能还是需要pg_pathman来实现。