天天看點

AnalyticDB for PostgreSQL 6 新特性解析 - Index Only Scan原理介紹使用示例GP的限制參考

原理介紹

PG中所有的索引都是二級索引,即在進行索引查詢的過程中,需要同時通路索引資料和源表資料。Index Only Scan按照字面意思了解,即在查詢過程中隻需要掃描索引資料。這種掃描方式需要一個前提就是索引中包含了查詢所需要的所有資料(也叫覆寫索引),如出現在SELECT, WHERE, ORDER BY中所引用的列。

由于PG的MVCC機制,在沒有Index only scan之前,任何索引查詢都需要經過通過源表資料進行可見性檢查,如圖所示:

AnalyticDB for PostgreSQL 6 新特性解析 - Index Only Scan原理介紹使用示例GP的限制參考

在索引掃描到過程中,需要通過源表擷取每個Record的可見性資訊。

在PG9.2版本以後,支援了Index Only Scan,如果一個查詢所需要的資料能夠完全可以被索引覆寫,那麼Index Only Scan就會成為一種新的掃描路徑,并且通過Visibility map避免了通過擷取源表進行可見性檢查,提升了查詢性能,如果所示:

AnalyticDB for PostgreSQL 6 新特性解析 - Index Only Scan原理介紹使用示例GP的限制參考

這裡主要依賴了Visibility map的機制,Visibility map中有一個标記位,标記了Page中的元組是否都是可見的,也就意味着如果表沒有被delete、update過或者已經被vacuum過了。

如果Visibility map能夠确認該Index entry所對應的Page都是可見的,那麼就不再擷取源表Record進行可見性判斷了,否則還需要擷取源表元組并進行可見性判斷。

使用示例

GP6版本內建了PG9.4版本,是以也支援了Index Only Scan的特性。

例如存在一張表,并在其中一個列上建立了索引:

postgres=# \d customer_reviews_hp
         Table "public.customer_reviews_hp"
        Column        |      Type       | Modifiers
----------------------+-----------------+-----------
 customer_id          | text            |
 review_date          | date            |
 review_rating        | integer         |
 review_votes         | integer         |
 review_helpful_votes | integer         |
 product_id           | character(10)   |
 product_title        | text            |
 product_sales_rank   | bigint          |
 product_group        | text            |
 product_category     | text            |
 product_subcategory  | text            |
 similar_product_ids  | character(10)[] |
Indexes:
    "c_review_rating" btree (review_rating)
Distributed by: (customer_id)           

查詢:

postgres=# explain analyze select count(*), review_rating from customer_reviews_hp where review_rating > 1 group by 2;
                                                                                       QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------
 Gather Motion 4:1  (slice2; segments: 4)  (cost=49979.36..49979.50 rows=5 width=12) (actual time=782.673..782.726 rows=4 loops=1)
   ->  GroupAggregate  (cost=49979.36..49979.50 rows=2 width=12) (actual time=782.384..782.385 rows=2 loops=1)
         Group Key: customer_reviews_hp.review_rating
         ->  Sort  (cost=49979.36..49979.37 rows=2 width=12) (actual time=782.376..782.377 rows=8 loops=1)
               Sort Key: customer_reviews_hp.review_rating
               Sort Method:  quicksort  Memory: 132kB
               ->  Redistribute Motion 4:4  (slice1; segments: 4)  (cost=0.18..49979.30 rows=2 width=12) (actual time=76.538..782.345 rows=8 loops=1)
                     Hash Key: customer_reviews_hp.review_rating
                     ->  GroupAggregate  (cost=0.18..49979.20 rows=2 width=12) (actual time=5.102..73.709 rows=4 loops=1)
                           Group Key: customer_reviews_hp.review_rating
                           ->  Index Only Scan using c_review_rating on customer_reviews_hp  (cost=0.18..41742.09 rows=411854 width=4) (actual time=0.128..643.718 rows=1061311 lo
ops=1)
                                 Index Cond: (review_rating > 1)
                                 Heap Fetches: 0
 Planning time: 0.212 ms
   (slice0)    Executor memory: 220K bytes.
   (slice1)    Executor memory: 156K bytes avg x 4 workers, 156K bytes max (seg0).
   (slice2)    Executor memory: 92K bytes avg x 4 workers, 92K bytes max (seg0).  Work_mem: 33K bytes max.
 Memory used:  2047000kB
 Optimizer: Postgres query optimizer
 Execution time: 783.308 ms
(20 rows)           

由此可見啟用了Index Only Scan。

可以通過enable_indexonlyscan來控制是否使用Index Only Scan,例如同樣上面的查詢設定enable_indexonlyscan為off後,再次執行:

postgres=# explain analyze select count(*), review_rating from customer_reviews_hp where review_rating > 1 group by 2;
                                                                                     QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--
 Gather Motion 4:1  (slice2; segments: 4)  (cost=49979.36..49979.50 rows=5 width=12) (actual time=951.830..951.840 rows=4 loops=1)
   ->  GroupAggregate  (cost=49979.36..49979.50 rows=2 width=12) (actual time=951.566..951.567 rows=2 loops=1)
         Group Key: customer_reviews_hp.review_rating
         ->  Sort  (cost=49979.36..49979.37 rows=2 width=12) (actual time=951.556..951.556 rows=8 loops=1)
               Sort Key: customer_reviews_hp.review_rating
               Sort Method:  quicksort  Memory: 132kB
               ->  Redistribute Motion 4:4  (slice1; segments: 4)  (cost=0.18..49979.30 rows=2 width=12) (actual time=75.010..951.527 rows=8 loops=1)
                     Hash Key: customer_reviews_hp.review_rating
                     ->  GroupAggregate  (cost=0.18..49979.20 rows=2 width=12) (actual time=5.211..77.359 rows=4 loops=1)
                           Group Key: customer_reviews_hp.review_rating
                           ->  Index Scan using c_review_rating on customer_reviews_hp  (cost=0.18..41742.09 rows=411854 width=4) (actual time=0.118..817.460 rows=1061311 loops=1
)
                                 Index Cond: (review_rating > 1)
 Planning time: 0.217 ms
   (slice0)    Executor memory: 156K bytes.
   (slice1)    Executor memory: 92K bytes avg x 4 workers, 92K bytes max (seg0).
   (slice2)    Executor memory: 92K bytes avg x 4 workers, 92K bytes max (seg0).  Work_mem: 33K bytes max.
 Memory used:  2047000kB
 Optimizer: Postgres query optimizer
 Execution time: 952.473 ms
(19 rows)           

隻是用到了索引,不是Index Only Scan,執行時間上增加了将近200ms,下降了20%左右。

但同時需要注意的是,Index Only Scan并不是銀彈,做到Index Only Scan往往需要建立聯合索引,聯合索引本身也會有性能問題,例如影響寫入、更新性能等。需要具體問題具體分析,Index Only Scan隻是多了一種可優化路徑選擇。

GP的限制

  1. Orca優化器不支援Index Only Scan,GP6版本中,隻有PG原生的優化器支援Index Only Scan。
  2. 列存表也不支援Index Only Scan,Index Only Scan依賴Visibility map機制實作,列存表顯然做不到Index Only Scan。
  3. GP上的Index Only Scan在explain analyze時,Heap Fetches顯示不準确,例如:
create table test (a , b ,c);
create table test (a int, b int ,c int);
insert into test values(generate_series(1,100000),generate_series(1,100000),generate_series(1,100000));
create index a_ind on test(a,b,c);

-- Master上執行:
postgres=# explain analyze select * from test where a > 1 order by a;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.17..2463.87 rows=99990 width=12) (actual time=1.169..84.196 rows=99999 loops=1)
   Merge Key: a
   ->  Index Only Scan using a_ind on test  (cost=0.17..2463.87 rows=99990 width=12) (actual time=0.116..44.373 rows=99999 loops=1)
         Index Cond: (a > 1)
         Heap Fetches: 0
 Planning time: 0.685 ms
   (slice0)    Executor memory: 216K bytes.
   (slice1)    Executor memory: 148K bytes (seg0).
 Memory used:  128000kB
 Optimizer: Postgres query optimizer
 Execution time: 96.809 ms
(11 rows)           

顯示Heap Fetchs為0,而直接連上segment進行explain analyze:

postgres=# explain analyze select * from test where a > 1 order by a;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using a_ind on test  (cost=0.29..1255.62 rows=33334 width=12) (actual time=0.072..39.561 rows=99999 loops=1)
   Index Cond: (a > 1)
   Heap Fetches: 99999
 Planning time: 0.148 ms
   (slice0)
 Optimizer: Postgres query optimizer
 Execution time: 47.481 ms
(7 rows)           

其實是存在Heap Fetches的,從執行時間上看,Master上的Heap Fetches項顯示不對。

這種情況需要依賴Vacuum來做Visibility Map的清理工作了。正常情況下做下Vacuum analyze就能保證不不需要Heap Fetch。

參考

https://www.postgresql.org/docs/current/indexes-index-only-scans.html

繼續閱讀