天天看点

利用pg_stat_activity做日常检查及异常SQL诊断

AnalyticDB for PostgreSQL(原HybridDB for PostgreSQL,以下简称ADB for PG)作为高性能分析型数据库,可以支持用户对其业务数据进行实时分析,能够让企业敏锐感知市场动态,做出必要决策。本文从ADB for PG用户角度出发,阐述如何借助pg_stat_activity这一系统视图实现实例的基本健康检查。

pg_stat_activity定义

postgres=#  \d+ pg_stat_activity;
                       View "pg_catalog.pg_stat_activity"
      Column      |           Type           | Modifiers | Storage  | Description
------------------+--------------------------+-----------+----------+-------------
 datid            | oid                      |           | plain    | OID of the database this backend is connected to
 datname          | name                     |           | plain    | Name of the database this backend is connected to
 procpid          | integer                  |           | plain    | Process ID of this backend
 sess_id          | integer                  |           | plain    | 
 usesysid         | oid                      |           | plain    | OID of the user logged into this backend
 usename          | name                     |           | plain    | Name of the user logged into this backend
 current_query    | text                     |           | extended | 
 waiting          | boolean                  |           | plain    | True if this backend is currently waiting on a lock
 query_start      | timestamp with time zone |           | plain    | Time when the currently active query was started
 backend_start    | timestamp with time zone |           | plain    | Time when this process was started, i.e., when the client connected to the server
 client_addr      | inet                     |           | plain    |
 client_port      | integer                  |           | plain    |
 application_name | text                     |           | extended |
 xact_start       | timestamp with time zone |           | plain    |
 waiting_reason   | text                     |           | extended | 
View definition:
 SELECT s.datid, d.datname, s.procpid, s.sess_id, s.usesysid, u.rolname AS usename, s.current_query, s.waiting, s.query_start, s.backend_start, s.client_addr, s.client_port, s.application_name, s.xact_start, s.waiting_reason
   FROM pg_database d, pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port, sess_id, waiting_reason), pg_authid u
  WHERE s.datid = d.oid AND s.usesysid = u.oid;           

这是当前ADB for PG的版本输出,其中Description列根据官方文档手动补充。

连接信息

想确认某个用户是否连接到当前DB上:

postgres=# SELECT datname,usename FROM pg_stat_activity WHERE usename = 'joe';
 datname  | usename
----------+---------
 postgres | joe
(1 row)           

进一步确认当前所有的连接信息,包括哪些机器在连接:

postgres=# SELECT datname,usename,client_addr,client_port FROM pg_stat_activity ;
datname  |  usename  |  client_addr   | client_port
----------+-----------+----------------+-------------
 postgres | joe       |  xx.xx.xx.xx   |       60621
 postgres | gpmon     |  xx.xx.xx.xx   |       60312
(9 rows)           

通过上述信息就能确认当前的连接用户和对应的连接机器。

SQL运行信息

当前DB上运行的SQL也可以通过pg_stat_activity来获取,首先需要确认参数track_activities = on已经设置。该参数会默认设置好,只需要确认即可:

postgres=# show track_activities;
 track_activities
------------------
 on
(1 row)           

获取当前用户执行SQL信息:

postgres=# SELECT datname,usename,current_query FROM pg_stat_activity ;
 datname  | usename  |                        current_query
----------+----------+--------------------------------------------------------------
 postgres | postgres | SELECT datname,usename,current_query FROM pg_stat_activity ;
 postgres | joe      | <IDLE>
(2 rows)           

只看当前正在运行的SQL信息:

SELECT datname,usename,current_query
   FROM pg_stat_activity
   WHERE current_query != '<IDLE>' ;           

查看耗时较长的查询:

select
       current_timestamp - query_start as runtime,
       datname,
       usename,
       current_query
    from pg_stat_activity
    where current_query != '<IDLE>'
    order by 1 desc;           

例如输出:

runtime     |    datname     | usename  |                                current_query
-----------------+----------------+----------+------------------------------------------------------------------------------
 00:00:34.248426 | tpch_1000x_col | postgres | select
                                             :         l_returnflag,
                                             :         l_linestatus,
                                             :         sum(l_quantity) as sum_qty,
                                             :         sum(l_extendedprice) as sum_base_price,
                                             :         sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
                                             :         sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
                                             :         avg(l_quantity) as avg_qty,
                                             :         avg(l_extendedprice) as avg_price,
                                             :         avg(l_discount) as avg_disc,
                                             :         count(*) as count_order
                                             : from
                                             :         public.lineitem
                                             : where
                                             :         l_shipdate <= date '1998-12-01' - interval '93' day
                                             : group by
                                             :         l_returnflag,
                                             :         l_linestatus
                                             : order by
                                             :         l_returnflag,
                                             :         l_linestatus;
 00:00:00        | postgres       | postgres | select
                                             :        current_timestamp - query_start as runtime,
                                             :        datname,
                                             :        usename,
                                             :        current_query
                                             :     from pg_stat_activity
                                             :     where current_query != '<IDLE>'
                                             :     order by 1 desc;
(2 rows)           

可以看到第一个查询耗时较久,已经运行了34s还没有结束。

异常SQL诊断及修复

如果一个SQL运行很长时间没有结果,需要检查该SQL还在运行中还是已经被block了:

SELECT datname,usename,current_query
   FROM pg_stat_activity
   WHERE waiting;           

需要注意的是这个输出只能获取当前因为lock而被block的SQL,因为其他原因被block的SQL这里获取不到。绝大多数情况下SQL都是因为lock而被block,但也会有一些其他情况例如等待i/o、定时器等。

如果上述SQL有结果输出说明有SQL被lock阻塞,进一步明确相互block的SQL信息:

SELECT
       w.current_query as waiting_query,
       w.procpid as w_pid,
       w.usename as w_user,
       l.current_query as locking_query,
       l.procpid as l_pid,
       l.usename as l_user,
       t.schemaname || '.' || t.relname as tablename
    from pg_stat_activity w
    join pg_locks l1 on w.procpid = l1.pid and not l1.granted
    join pg_locks l2 on l1.relation = l2.relation and l2.granted
    join pg_stat_activity l on  l2.pid = l.procpid
    join pg_stat_user_tables t on l1.relation = t.relid
    where w.waiting;           

通过这个SQL的输出信息就能确认相互block的SQL和对应的执行pid。在明确了SQL的阻塞信息后,可以通过cancel或者kill query的方式进行恢复。

通过cancel取消一个正在运行的query:

SELECT pg_cancel_backend(pid)           

需要在一个运行query的session中执行,如果session本身就是idle的,执行不起作用。另外取消这个query需要花费一定的时间来做清理和事务的回滚。

使用pg_terminate_backend来清理idle session,也可以用来终止query:

SELECT pg_terminate_backend(pid);           

该用户的连接会被断开。尽量避免在正在运行query的进程pid上执行。

需要注意的是文中提到操作需要用户有superuser的权限。