天天看點

PostgreSQL 大表掃描政策 - BAS_BULKREAD , synchronize_seqscans

标簽

PostgreSQL , 大表掃描 , synchronize_seqscans

https://github.com/digoal/blog/blob/master/201804/20180414_02.md#%E8%83%8C%E6%99%AF 背景

資料庫BUFFER是一個對實體IO對沖非常有效的元件,是以BUFFER的穩态對IO波動影響很大。

在一些操作時,可能會導緻一些IO的波動,比如刷髒頁,檢查點。但是PostgreSQL在這方面都有很好的排程政策,避免IO的波動。具體見PG文檔的介紹,或如下

《PostgreSQL 9.6 平滑fsync, write原理淺析》

還有一些操作也可能打破BUFFER的穩态,例如全表掃描通路大表,不過還好PG也有對應的AM政策,當表大小超過了四分之一的shared_buffers時,并且對其進行全表掃描時,會使用大表通路政策:

包括每個ring通路多少個BLOCK,以及這些BUFFER的标記,是否優先清除。(進而避免大表全表掃描時,導緻一些較熱的BUFFER被清出。而是優先清出全表掃描的大表BUFFER)

另一個大表全表掃通路可能導緻IO紊亂的原因,如果全表掃描大表的并發比較多,比如有5個并發,如果每個會話都需要掃描一遍全表,實際IO會是5乘以表的大小。

PostgreSQL對此也做了優化。

https://github.com/digoal/blog/blob/master/201804/20180414_02.md#postgresql-%E5%B9%B6%E8%A1%8C%E4%BC%9A%E8%AF%9D%E5%A4%A7%E8%A1%A8%E5%90%8C%E6%AD%A5%E6%89%AB%E6%8F%8F%E5%85%B1%E4%BA%ABio PostgreSQL 并行會話大表同步掃描,共享IO。

設定synchronize_seqscans=true時,如果有并行的會話全表掃描大表,那麼PG會使用共享IO,協調多個會話的掃描步調,使得并行的會話盡量的同時通路同一個BLOCK,最後大家都掃描完整表。

這樣的話從磁盤讀取到SHARED BUFFER的IO可能就不是表大小 乘以 會話數的N備,而是更小,甚至1.

synchronize_seqscans (boolean)

This allows sequential scans of large tables to synchronize with each other, so that concurrent scans read the same block at about the same time and hence share the I/O workload. When this is enabled, a scan might start in the middle of the table and then “wrap around” the end to cover all rows, so as to synchronize with the activity of scans already in progress. This can result in unpredictable changes in the row ordering returned by queries that have no ORDER BY clause. Setting this parameter to off ensures the pre-8.3 behavior in which a sequential scan always starts from the beginning of the table. The default is on.

如果将其設定為OFF,那麼全表掃描都是從0号資料塊開始掃。那麼全表掃描的記錄傳回順序完全會與行号一緻。

而設定為ON時,由于要協調并行通路的步調,全表掃描可能從中間開始掃,是以傳回資料的順序可能會亂序。

https://github.com/digoal/blog/blob/master/201804/20180414_02.md#%E7%A4%BE%E5%8C%BA%E5%85%A8%E8%A1%A8%E6%89%AB%E6%8F%8F%E7%9A%84%E5%A4%A7%E8%A1%A8%E8%AE%BF%E9%97%AE%E7%AD%96%E7%95%A5 社群全表掃描的大表通路政策

/* ----------------  
 *              initscan - scan code common to heap_beginscan and heap_rescan  
 * ----------------  
 */  
static void  
initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)  
{  
  
  
        /*  
         * If the table is large relative to NBuffers, use a bulk-read access  
         * strategy and enable synchronized scanning (see syncscan.c).  Although  
         * the thresholds for these features could be different, we make them the  
         * same so that there are only two behaviors to tune rather than four.  
         * (However, some callers need to be able to disable one or both of these  
         * behaviors, independently of the size of the table; also there is a GUC  
         * variable that can disable synchronized scanning.)  
         *  
         * Note that heap_parallelscan_initialize has a very similar test; if you  
         * change this, consider changing that one, too.  
         */  
        if (!RelationUsesLocalBuffers(scan->rs_rd) &&  
                scan->rs_nblocks > NBuffers / 4)  
        {  
                allow_strat = scan->rs_allow_strat;  
                allow_sync = scan->rs_allow_sync;  
        }  
        else  
                allow_strat = allow_sync = false;  
  
        if (allow_strat)  
        {  
                /* During a rescan, keep the previous strategy object. */  
                if (scan->rs_strategy == NULL)  
                        scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD);  
        }  
  
....  
  
  
        else if (keep_startblock)  
        {  
                /*  
                 * When rescanning, we want to keep the previous startblock setting,  
                 * so that rewinding a cursor doesn't generate surprising results.  
                 * Reset the active syncscan setting, though.  
                 */  
                scan->rs_syncscan = (allow_sync && synchronize_seqscans);  
        }  
        else if (allow_sync && synchronize_seqscans)  
        {  
                scan->rs_syncscan = true;  
                scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);  
        }  
           

https://github.com/digoal/blog/blob/master/201804/20180414_02.md#nbuffers%E5%8F%98%E9%87%8F%E6%8C%87%E7%9A%84%E6%98%AFshared-buffer NBuffers變量指的是SHARED BUFFER

/*  
         * We sometimes multiply the number of shared buffers by two without  
         * checking for overflow, so we mustn't allow more than INT_MAX / 2.  
         */  
        {  
                {"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,  
                        gettext_noop("Sets the number of shared memory buffers used by the server."),  
                        NULL,  
                        GUC_UNIT_BLOCKS  
                },  
                &NBuffers,  
                1024, 16, INT_MAX / 2,  
                NULL, NULL, NULL  
        },  
           

https://github.com/digoal/blog/blob/master/201804/20180414_02.md#%E7%9B%AE%E5%89%8Dpg%E8%AE%BE%E8%AE%A1%E7%9A%84%E5%87%A0%E7%A7%8D%E8%AE%BF%E9%97%AE%E7%AD%96%E7%95%A5 目前PG設計的幾種通路政策

/* Possible arguments for GetAccessStrategy() */  
typedef enum BufferAccessStrategyType  
{  
        BAS_NORMAL,                                     /* Normal random access */  
        BAS_BULKREAD,                           /* Large read-only scan (hint bit updates are  
                                                                 * ok) */  
        BAS_BULKWRITE,                          /* Large multi-block write (e.g. COPY IN) */  
        BAS_VACUUM                                      /* VACUUM */  
} BufferAccessStrategyType;  
           

https://github.com/digoal/blog/blob/master/201804/20180414_02.md#%E8%8E%B7%E5%8F%96%E8%AE%BF%E9%97%AE%E7%AD%96%E7%95%A5 擷取通路政策

/*  
 * GetAccessStrategy -- create a BufferAccessStrategy object  
 *  
 * The object is allocated in the current memory context.  
 */  
BufferAccessStrategy  
GetAccessStrategy(BufferAccessStrategyType btype)  
{  
        BufferAccessStrategy strategy;  
        int                     ring_size;  
  
        /*  
         * Select ring size to use.  See buffer/README for rationales.  
         *  
         * Note: if you change the ring size for BAS_BULKREAD, see also  
         * SYNC_SCAN_REPORT_INTERVAL in access/heap/syncscan.c.  
         */  
        switch (btype)  
        {  
                case BAS_NORMAL:  
                        /* if someone asks for NORMAL, just give 'em a "default" object */  
                        return NULL;  
  
                case BAS_BULKREAD:  
                        ring_size = 256 * 1024 / BLCKSZ;  
                        break;  
                case BAS_BULKWRITE:  
                        ring_size = 16 * 1024 * 1024 / BLCKSZ;  
                        break;  
                case BAS_VACUUM:  
                        ring_size = 256 * 1024 / BLCKSZ;  
                        break;  
  
                default:  
                        elog(ERROR, "unrecognized buffer access strategy: %d",  
                                 (int) btype);  
                        return NULL;            /* keep compiler quiet */  
        }  
  
        /* Make sure ring isn't an undue fraction of shared buffers */  
        ring_size = Min(NBuffers / 8, ring_size);  
  
        /* Allocate the object and initialize all elements to zeroes */  
        strategy = (BufferAccessStrategy)  
                palloc0(offsetof(BufferAccessStrategyData, buffers) +  
                                ring_size * sizeof(Buffer));  
  
        /* Set fields that don't start out zero */  
        strategy->btype = btype;  
        strategy->ring_size = ring_size;  
  
        return strategy;  
}  
           

https://github.com/digoal/blog/blob/master/201804/20180414_02.md#%E6%A3%80%E6%9F%A5bas_bulkread%E6%A0%87%E8%AE%B0 檢查BAS_BULKREAD标記

/*  
 * StrategyRejectBuffer -- consider rejecting a dirty buffer  
 *  
 * When a nondefault strategy is used, the buffer manager calls this function  
 * when it turns out that the buffer selected by StrategyGetBuffer needs to  
 * be written out and doing so would require flushing WAL too.  This gives us  
 * a chance to choose a different victim.  
 *  
 * Returns true if buffer manager should ask for a new victim, and false  
 * if this buffer should be written and re-used.  
 */  
bool  
StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)  
{  
        /* We only do this in bulkread mode */  
        if (strategy->btype != BAS_BULKREAD)  
                return false;  
  
        /* Don't muck with behavior of normal buffer-replacement strategy */  
        if (!strategy->current_was_in_ring ||  
                strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))  
                return false;  
  
        /*  
         * Remove the dirty buffer from the ring; necessary to prevent infinite  
         * loop if all ring members are dirty.  
         */  
        strategy->buffers[strategy->current] = InvalidBuffer;  
  
        return true;  
}  
           

https://github.com/digoal/blog/blob/master/201804/20180414_02.md#%E4%BC%98%E5%85%88%E5%B9%B2%E6%8E%89bas_bulkread%E6%A0%87%E8%AE%B0%E7%9A%84buffer 優先幹掉BAS_BULKREAD标記的BUFFER

if (XLogNeedsFlush(lsn) &&  
                                                StrategyRejectBuffer(strategy, buf))  
                                        {  
                                                /* Drop lock/pin and loop around for another buffer */  
                                                LWLockRelease(BufferDescriptorGetContentLock(buf));  
                                                UnpinBuffer(buf, true);  
                                                continue;  
                                        }  
           

https://github.com/digoal/blog/blob/master/201804/20180414_02.md#%E5%8F%82%E8%80%83 參考

src/backend/access/heap/heapam.c

src/backend/access/heap/syncscan.c

src/include/storage/bufmgr.h

src/backend/storage/buffer/freelist.c

src/backend/utils/misc/guc.c

src/backend/storage/buffer/bufmgr.c

繼續閱讀