前两篇讲的btree和gist的多字段索引,本篇顺理成章地讲一下gin的多字段索引。
前两篇请参考这里:
<a href="http://blog.chinaunix.net/uid-20726500-id-5088527.html" target="_blank">http://blog.chinaunix.net/uid-20726500-id-5088527.html</a>
<a href="http://blog.chinaunix.net/uid-20726500-id-5090166.html" target="_blank">http://blog.chinaunix.net/uid-20726500-id-5090166.html</a>
不像gist和btree,gin天生就适合做多字段索引,不管查询条件覆盖所有索引字段还是仅仅覆盖一个子集,它都可以胜任。
<a href="http://www.postgres.cn/docs/9.3/indexes-multicolumn.html" target="_blank">http://www.postgres.cn/docs/9.3/indexes-multicolumn.html</a>
---------------------------------------------------------------
一个多字段的 GIN 索引可以用于那些查询条件包含索引字段子集的查询中。 不像B-tree 或 GiST,除了查询条件使用的索引字段外,索引的搜索效率是相同的。
gin多字段索引的原理是,索引中的每个key由字段编号加上元素值构成,这样的key作为一个整体构成倒排索引树。
参考:src/backend/access/gin/README
...
* In a single-column index, a key tuple just contains the key datum, but
in a multi-column index, a key tuple contains the pair (column number,
key datum) where the column number is stored as an int2. This is needed
to support different key data types in different columns. This much of
the tuple is built by index_form_tuple according to the usual rules.
The column number (if present) can never be null, but the key datum can
be, in which case a null bitmap is present as usual. (As usual for index
tuples, the size of the null bitmap is fixed at INDEX_MAX_KEYS.)
为了和前面的btree和gist有个可比性,测试环境和数据还采用上一篇的标量数据,而不是采用gin特有的集合数据,所以要使用btree-gin扩展。
测试环境详见
点击(此处)折叠或打开
postgres=# create extension btree_gin;
CREATE EXTENSION
postgres=# \timing
Timing is on.
postgres=# create index tb1_idx5 on tb1 using gin(c1,c2);
CREATE INDEX
Time: 23119.722 ms
postgres=# select pg_size_pretty(pg_relation_size('tb1_idx5'));
pg_size_pretty
----------------
129 MB
(1 row)
postgres=# explain (analyze,buffers) select count(*) from tb1 where c1=99 and c2=999;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=404.70..404.71 rows=1 width=0) (actual time=15.797..15.798 rows=1 loops=1)
Buffers: shared hit=179 read=2
-> Bitmap Heap Scan on tb1 (cost=21.01..404.45 rows=99 width=0) (actual time=14.545..15.769 rows=92 loops=1)
Recheck Cond: ((c1 = 99) AND (c2 = 999))
Buffers: shared hit=179 read=2
-> Bitmap Index Scan on tb1_idx5 (cost=0.00..20.99 rows=99 width=0) (actual time=14.144..14.144 rows=92 loops=1)
Index Cond: ((c1 = 99) AND (c2 = 999))
Buffers: shared hit=87 read=2
Total runtime: 16.199 ms
(9 rows)
这个速度没有btree和gist的多字段索引快,因为它的原理和BitmapAnd组合索引相似,要扫描2次索引再取交集,但比组合索引的50ms快。
3.3 c1+c2多字段gist索引处理单字段查询
postgres=# explain (analyze,buffers) select count(*) from tb1 where c1=99;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=46880.08..46880.09 rows=1 width=0) (actual time=1569.289..1569.290 rows=1 loops=1)
Buffers: shared hit=142 read=39638
-> Bitmap Heap Scan on tb1 (cost=1097.08..46624.25 rows=102333 width=0) (actual time=37.484..1548.405 rows=99886 loops=1)
Recheck Cond: (c1 = 99)
Rows Removed by Index Recheck: 7363157
Buffers: shared hit=142 read=39638
-> Bitmap Index Scan on tb1_idx5 (cost=0.00..1071.50 rows=102333 width=0) (actual time=35.034..35.034 rows=99886 loops=1)
Index Cond: (c1 = 99)
Buffers: shared hit=78
Total runtime: 1569.548 ms
(10 rows)
postgres=# explain (analyze,buffers) select count(*) from tb1 where c2=999;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=23488.40..23488.41 rows=1 width=0) (actual time=79.362..79.362 rows=1 loops=1)
Buffers: shared hit=1974 read=6910
-> Bitmap Heap Scan on tb1 (cost=110.81..23464.27 rows=9652 width=0) (actual time=7.228..77.332 rows=9892 loops=1)
Recheck Cond: (c2 = 999)
Buffers: shared hit=1974 read=6910
-> Bitmap Index Scan on tb1_idx5 (cost=0.00..108.39 rows=9652 width=0) (actual time=4.468..4.468 rows=9892 loops=1)
Index Cond: (c2 = 999)
Buffers: shared hit=1 read=11
Total runtime: 79.408 ms
查询速度和gist的多字段索引差不多。
PostgreSQL 9.4对gin的性能做了很大提升,下面再在9.4上重复一下上面的测试。
chenhj=# create table tb1(c1 int,c2 int);
CREATE TABLE
chenhj=# insert into tb1 select round(random()*100),round(random()*1000) from generate_series(1,10000000);
INSERT 0 10000000
chenhj=# select pg_size_pretty(pg_table_size('tb1'));
346 MB
chenhj=# create extension btree_gin;
chenhj=# create index tb1_idx5 on tb1 using gin(c1,c2);
chenhj=# select pg_size_pretty(pg_relation_size('tb1_idx5'));
47 MB
9.4里gin索引的大小比9.3小了很多,只有9.3的三分之一。
chenhj=# explain (analyze,buffers) select count(*) from tb1 where c1=99 and c2=999;
Aggregate (cost=423.95..423.96 rows=1 width=0) (actual time=3.399..3.399 rows=1 loops=1)
Buffers: shared hit=166
-> Bitmap Heap Scan on tb1 (cost=25.05..423.69 rows=103 width=0) (actual time=3.151..3.375 rows=117 loops=1)
Heap Blocks: exact=117
Buffers: shared hit=166
-> Bitmap Index Scan on tb1_idx5 (cost=0.00..25.03 rows=103 width=0) (actual time=3.125..3.125 rows=117 loops=1)
Buffers: shared hit=49
Planning time: 0.099 ms
Execution time: 3.448 ms
(11 rows)
比9.3快了很多,但仍然没有btree和gist的多字段索引快,不过差别也不是太大。
查询条件只包含c1
chenhj=# explain (analyze,buffers) select count(*) from tb1 where c1=99;
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=46819.50..46819.51 rows=1 width=0) (actual time=590.436..590.437 rows=1 loops=1)
Buffers: shared hit=105 read=39583 written=700
-> Bitmap Heap Scan on tb1 (cost=981.50..46554.50 rows=106000 width=0) (actual time=49.895..570.300 rows=99790 loops=1)
Heap Blocks: exact=39665
Buffers: shared hit=105 read=39583 written=700
-> Bitmap Index Scan on tb1_idx5 (cost=0.00..955.00 rows=106000 width=0) (actual time=31.225..31.225 rows=99790 loops=1)
Buffers: shared hit=23
Planning time: 0.112 ms
Execution time: 590.496 ms
速度是9.3的2倍多,也比gist和btree都快。
查询条件只包含c2
chenhj=# explain (analyze,buffers) select count(*) from tb1 where c2=999;
Aggregate (cost=23539.39..23539.40 rows=1 width=0) (actual time=92.894..92.894 rows=1 loops=1)
Buffers: shared hit=1950 read=7130 written=3
-> Bitmap Heap Scan on tb1 (cost=95.12..23515.16 rows=9692 width=0) (actual time=5.672..90.431 rows=10178 loops=1)
Heap Blocks: exact=9073
Buffers: shared hit=1950 read=7130 written=3
-> Bitmap Index Scan on tb1_idx5 (cost=0.00..92.69 rows=9692 width=0) (actual time=2.671..2.671 rows=10178 loops=1)
Buffers: shared hit=6 read=1
Planning time: 0.187 ms
Execution time: 94.261 ms
和9.3差不多,和gist也差不多。
1)当查询条件中同时出现多字段索引的所有条件时,多字段索引的速度优于多个字段单独建索引然后通过BitmapAnd组合起来。btree和gist的多字段索引又快于gin多字段索引。
2)多字段索引用在只包含全部索引字段的子集时,除了btree在查询条件中多字段的第一个字段缺席时效率很差外,其它都表现良好。
3)9.4的gin索引的大小比btree和gist小了太多。(但gin有别的问题,在做比较查询时有时效率不高,后面准备再通过例子讲解)
4)9.4的gin索引比9.3的gin优化了太多了
最后引用一下手册中关于使用多字段索引还是多索引组合的说明。加黑的那个限制似乎只对btree的多字段索引有效。
-----------------------------------------------------------------------------------