Block Throttle

block throttle 是 block QoS 的重要組成部分，也是最早的一個 QoS，其功能是限制每個 cgroup 的 IOPS/BPS 上限。

Group Hiererchy

每個 blkdev 會為每個 block group 建立一個對應的 throttle group

throttle data

每個 blkdev 對應一個 throttle data，儲存該 blkdev 的 blk-throttle policy 相關的資訊

struct throtl_data

系統初始化過程中，每次注冊一個 blkdev 時，會建立對應的 throttle data

blk_alloc_queue
    blkcg_init_queue
        blk_throtl_init(q)
            # allocate throttle data

throttle group

類似地，每個 blkcg_gq 即 (cgroup, request queue) pair 具有一個對應的 throttle group，儲存在 blkcg_gq 的 @pd[] 數組

block group 層定義有一系列的 policy，例如 blk-throttle/iolatency/iocost 等，這裡每個 policy 都會配置設定 policy specific data，這些 policy specific data 都會内嵌 struct blkg_policy_data，同時最終都儲存在 @pd[] 數組

struct blkcg_gq {
    struct blkg_policy_data     *pd[BLKCG_MAX_POLS];
    ...
};

blk-throttle 的 specific data 就是 struct throtl_grp

struct throtl_grp {
    /* must be the first member */
    struct blkg_policy_data pd;
    ...
};

root cgroup

系統初始化過程中，每次注冊一個 blkdev 時，會建立對應的 blkcg_gq 結構，使得該 blkdev 與 root block group 建立聯系，這一過程中會依次調用各個 policy 的 pd_alloc_fn() 回調函數，建立 policy specific data

對于 blk-throttle 來說，此時就會建立該 blkcg_gq 對應的 throttle group

blk_alloc_queue
    blkcg_init_queue
        blkg_alloc
            # allocate blkcg_gq (@blkg)
            (for each policy)
                blkg->pd[...] = pol->pd_alloc_fn(), e.g., throtl_pd_alloc() // allocate throttle group

child cgroup

child block group 對應的 blkcg_gq 結構，則是在 child block group 下發 bio 的時候延遲建立的，此時會建立目前下發的 bio 的 blkdev 對應的 blkcg_gq 結構

bio_set_dev
    bio_associate_blkg
        css = blkcg_css() // get css of current task
        bio_associate_blkg_from_css
            blkg_tryget_closest
                blkg_lookup_create(css_to_blkcg(css), queue)
                    blkg_create
                        blkg_alloc
                            # allocate blkcg_gq (@blkg)
                            (for each policy)
                                blkg->pd[...] = pol->pd_alloc_fn(), e.g., throtl_pd_alloc() // allocate throttle group
                            
                        (for each policy)
                            blkg->pd[...] = pol->pd_init_fn(), e.g., throtl_pd_init()

Throttle Routine

throttle limit

使用者可以配置 throttle group 的參數上限，這些參數儲存在 @iops[]/@bps[] 數組

struct throtl_grp {
    uint64_t bps[2][LIMIT_CNT]; /* internally used bps limits */
    unsigned int iops[2][LIMIT_CNT]; /* internally used IOPS limits */
    ...
}

@bps2 描述該 throttle group 的 read/write BPS 限制

@iops2 描述該 throttle group 的 read/write IOPS 限制

throttle check

throttle policy 是按照時間片 (time slice) 為機關對 IO 進行限流的，在一個時間片以内，下發的 IO 流量超過這個時間片對應的配額，才會觸發限流；當下一個時間片來臨的時候，配額會重新重新整理

@slice_start[rw]                @slice_end[rw]
--------+-------------------------------+--------
                      ^
                current jiffies

usage in slice

首先每個 throttle group 需要統計一個時間片内截止目前為止，該 block group 對該 block device 的使用量

struct throtl_grp {
    /* When did we start a new slice */
    unsigned long slice_start[2];
    unsigned long slice_end[2];

    /* Number of bytes disptached in current slice */
    uint64_t bytes_disp[2];
    /* Number of bio's dispatched in current slice */
    unsigned int io_disp[2];
    ...
}

在 (@slice_start[READ], @slice_end[READ]) 時間段内，資料讀取的總量為 @bytes_disp[READ] 位元組，處理的讀 IO 的數量為 @io_disp[READ]

在 (@slice_start[WRITE], @slice_end[WRITE]) 時間段内，資料寫入的總量為 @bytes_disp[WRITE] 位元組，處理的寫 IO 的數量為 @io_disp[WRITE]

@slice_start[] 與 @slice_end[] 的間隔為 @throtl_data->throtl_slice，一般預設為 HZ/10 即 100 ms

submit_bio
    submit_bio_noacct
        submit_bio_checks
            blk_throtl_bio
                tg_may_dispatch
                    # check if within limit
                
                # if pass the thottle check
                throtl_charge_bio
                        @bytes_disp[rw] += bio_size;
                        @io_disp[rw]++;

quota in slice

之後需要計算目前這個時間片内，目前為止可用的配額，即 (jiffies - @slice_start[rw]) / HZ * limit

check

那麼 IO 下發過程中，如果目前時間片内的使用量 (iops/bps) 超過了目前時間片内的可用的配額，那麼就會觸發限流操作

throttle

上述介紹到，IO 下發過程中可能會觸發限流操作，此時目前下發的 IO 就會暫時緩存在目前 throttle group 中，之後恢複配額時會重新送出這些緩存的 IO，此時相關資料結構的層次如圖所示

throttle service queue

每個 throttle group 維護有一個 throttle service queue，該 throttle group 下被限流的 bio 就組織在該 throttle service queue 中

struct throtl_grp {
    /* this group's service queue */
    struct throtl_service_queue service_queue;
    ...
}

值得注意的是，per blkdev 的 throttle data 也維護有一個 throttle service queue

struct throtl_data {
    /* service tree for active throtl groups */
    struct throtl_service_queue service_queue;
    ...
}

這些 throttle service queue 通過 @parent_sq 字段組成一個樹狀關系

struct throtl_service_queue {
    struct throtl_service_queue *parent_sq; /* the parent service_queue */
    ...
};

這裡需要注意的是，所有 throttle group 的 @parent_sq 字段都指向 throttle data 對應的 service queue

service queue of throttle data
                            +-------+
                            |       |
                            +-------+
                                ^
                                | @parent_sq
        +-----------------------+
        |                       |
service queue               service queue
of throttle group           of child throttle group
    +-------+               +-------+
    |       |               |       |
    +-------+               +-------+

pending bio list

被限流的 bio 就暫時緩存在其所在的 throttle group 對應的 service queue 的 @queued[rw] 連結清單中

但是需要注意的是，被限流的 bio 并不是直接緩存在 @queued[rw] 連結清單中，而是緩存在 qnode 的 @bios 連結清單中，@queued[rw] 連結清單組織所有的 qnode

struct throtl_qnode {
    struct bio_list        bios;     /* queued bios */
    ...
};

qnode 概念的提出，是為了解決在 dispatch 階段，各個 throttle group 能公平配置設定配額的問題。其實在 qnode 概念提出來之前，限流的 bio 就是直接緩存到 @queued[rw] 連結清單的

需要注意的是，目前 throttle group 的 @queued[rw] 連結清單中緩存的 bio，有可能是來自目前 throttle group 的，也有可能是來自 child throttle group 的，甚至是 grandchild throttle group 的。這需要了解 throttle check 檢查的過程，其中首先檢查所在的 throttle group 的配額，如果目前所在的 throttle group 的配額已經用盡了，即該 bio 被限流在目前 throttle group，那麼該 bio 就會緩存在目前 throttle group 的 @queued[rw] 連結清單中；而如果目前 throttle group 的配額還非常充足，此時會向上一層，檢查 parent throttle group 的配額，如果 parent throttle group 的配額用盡，那麼該 bio 就會被限流在 parent throttle group，此時該 bio 就會緩存在 parent throttle group 的 @queued[rw] 連結清單中

此時一個 throttle group 的 @queued[rw] 連結清單中緩存的 bio，既有來自目前 throttle group 的，也有來自 child throttle group、grandchild throttle group ... 之後當這個 throttle group 的配額恢複、進入 dispatch 階段的時候，往往隻能按照順序依次對 @queued[rw] 連結清單中緩存的 bio 進行 dispatch 操作；此時試想某一時刻某一個 child throttle group 大量下發 bio 觸發限流操作，緩存在其 parent throttle group 的 @queued[rw] 連結清單中，那麼之後該 parent throttle group 進入 dispatch 階段，此時大部分的配額配置設定給了之前大量下發 IO 的那個 child throttle group，而其他 child/grandchild throttle group 就存在餓死的風險

為了解決上述的公平配置設定配額的問題，就引入了 qnode 的概念。每個 throttle group 對應一個 qnode，在下發 IO 過程中，

如果目前 throttle group 配額用盡，導緻 bio 被限流在目前 throttle group 時，會使用 @qnode_on_self[rw] 這一套 qnode，此時被限流的 bio 緩存到 @qnode_on_self[rw]->bios 連結清單，同時 @qnode_on_self[rw] 被添加到目前 throttle group 的 @queued[rw] 連結清單
如果目前 throttle group 配額還很充足，那麼會向上檢查其 parent throttle group 的配額，如果 parent throttle group 的配額用盡，導緻 bio 被限流在 parent throttle group 時，會使用 @qnode_on_parent[rw] 這一套 qnode，此時被限流的 bio 緩存到 @qnode_on_parent[rw]->bios 連結清單，同時 @qnode_on_parent[rw] 被添加到其 parent throttle group 的 @queued[rw] 連結清單

struct throtl_grp {
    /*
     * qnode_on_self is used when bios are directly queued to this
     * throtl_grp so that local bios compete fairly with bios
     * dispatched from children.  qnode_on_parent is used when bios are
     * dispatched from this throtl_grp into its parent and will compete
     * with the sibling qnode_on_parents and the parent's
     * qnode_on_self.
     */
    struct throtl_qnode qnode_on_self[2];
    struct throtl_qnode qnode_on_parent[2];
    ...
}

submit_bio
    submit_bio_noacct
        submit_bio_checks
            blk_throtl_bio
                tg_may_dispatch // check
                
                # if need to be throttled
                throtl_add_bio_tg
                    throtl_qnode_add_bio // add bio to qnode's @bios list
                                         // add qnode to throttle group's @queued[rw] list
                    sq->nr_queued[rw]++

pending throttle group rbtree

此外所有包含有 pending bio 的 throttle group 會組織成一棵 rbtree，儲存在 throttle data 對應的 service queue 的 @pending_tree rbtree 中

需要注意的是，隻有存在 pending bio 需要處理的 throttle group 才會進入 @pending_tree rbtree 中

struct throtl_service_queue {
    /*
     * RB tree of active children throtl_grp's, which are sorted by
     * their ->disptime.
     */
    struct rb_root        pending_tree; /* RB tree of active tgs */
    ...
};

這棵 rbtree 的 value 是 throttle group，而 key 是對應 throttle group 的 dispatch time，dispatch time 描述該 throttle group 中最近一個可以執行 dispatch 操作的 pending bio 的時間

dispatch time of bio

何謂 dispatch time？之前介紹到，throttle policy 是按照時間片 (time slice) 為機關對 IO 進行限流的，每個 throttle group 都記錄了目前一個時間片内，截止到目前為止已經使用了的配額。IO 下發過程中，如果檢查到 throttle group 的配額已經用盡，導緻目前下發的 bio 需要被限流時，那麼根據該 throttle group 配置的 limit 上限、目前時間片内截止到目前為止已經使用了的配額、以及目前被限流的 bio 的大小，就可以計算出将來的某一時刻，throttle group 的配額可以恢複，進而使得該 bio 被重新下發，這一時刻就稱為這個 pending bio 的 dispatch time

dispatch time of throttle group

每個 throttle group 都會維護一個 @disptime 字段，描述該 throttle group 包含的所有 pending bio 中，離目前時刻最近的一個 dispatch time，實際上也就是該 throttle group 下一次被排程的時刻

struct throtl_grp {
    /*
     * Dispatch time in jiffies. This is the estimated time when group
     * will unthrottle and is ready to dispatch more bio. It is used as
     * key to sort active groups in service tree.
     */
    unsigned long disptime;
    ...
}

每當有一個 bio 被限流進而加入到 throttle group 中時，都會更新 throttle group 的 @disptime 字段

submit_bio
    submit_bio_noacct
        submit_bio_checks
            blk_throtl_bio
                tg_may_dispatch // check
                
                # if need to be throttled, add bio to qnode's @bios list
                throtl_add_bio_tg

                tg_update_disptime
                    # read_wait = time to wait for latest read IO
                    # write_wait = time to wait for latest write IO
                    # min_wait = min(read_wait, write_wait);
                    # tg->disptime = jiffies + min_wait;

throttle group rbtree

之後就會将該 throttle group 添加到 throttle data 對應的 service queue 的 @pending_tree rbtree 中

@pending_tree rbtree 中的所有 throttle group 按照 @tg->disptime 排序，即 rbtree 中最左邊的 throttle group 的 dispatch time 距離目前時刻最近，也就是下一個将被排程的 throttle group

submit_bio
    submit_bio_noacct
        submit_bio_checks
            blk_throtl_bio
                tg_may_dispatch // check
                
                # if need to be throttled, add bio to qnode's @bios list
                throtl_add_bio_tg
                
                tg_update_disptime
                    # update tg->disptime
                    
                    throtl_enqueue_tg
                        tg_service_queue_add // add throttle group to throttle_data->service_queue's @pending_tree rbtree

schedule dispatch timer

之後就會排程 dispatch timer 來處理緩存在 throttle group 中的 pending bio

這裡需要注意的是，需要等待一段時間，等待 throttle group 的配額恢複之後，才能排程 dispatch timer；@pending_tree rbtree 最左邊的 throttle group 對應的 @disptime 字段描述了最近一個 throttle group 将被排程的時刻，因而也就是等待這一時刻到來之後，再排程 dispatch timer

submit_bio
    submit_bio_noacct
        submit_bio_checks
            blk_throtl_bio
                tg_may_dispatch // check
                
                # if need to be throttled, add bio to qnode's @bios list
                throtl_add_bio_tg
                
                tg_update_disptime
                    # update tg->disptime
                    
                    throtl_enqueue_tg
                        tg_service_queue_add // add throttle group to throttle_data->service_queue's @pending_tree rbtree
                
                throtl_schedule_next_dispatch
                    # first_pending_disptime = disptime of the latest expiring throttle group
                    throtl_schedule_pending_timer(..., first_pending_disptime)

dispatch

dispatch timer

dispatch timer 被排程的時候，就會處理 @pending_tree rbtree 中的 throttle group

iterate throttle groups

dispatch timer 會優先處理 @pending_tree rbtree 最左邊的 throttle group，即 dispatch time 最近的一個 throttle group，但是每個 throttle group 單次最多隻能分發 THROTL_GRP_QUANTUM 個 IO (包括 READ IO 和 WRITE IO)，之後根據該 throttle group 中剩下的還未分發的 pending bio 更新該 throttle group 的 @disptime 字段，并根據更新後的 @disptime 字段，調整該 throttle group 在 @pending_tree rbtree 中的位置 (此時該 throttle group 往往不再位于 @pending_tree rbtree 的最左邊)

之後 dispatch timer 會在一個循環中重複上述過程，即從 @pending_tree rbtree 的最左邊取出一個 throttle group，分發其中的 pending bio，但同樣最多隻能分發 THROTL_GRP_QUANTUM 個 IO，...

重複以上過程，直到 @pending_tree rbtree 最左邊取出的 throttle group 的 @disptime 在目前時刻之後，才會結束

dispatch one throttle group

上述循環中，在對目前輪到的 throttle group 作 dispatch 操作的過程中實際上是将 @queued[rw] 連結清單中各個 qnode 的 @bios 連結清單中緩存的 pending bio 轉移到目前 throttle group 的 @qnode_on_parent[rw] 中，之後将 @qnode_on_parent[rw] 添加到 throttle data 的 service queue 的 @queued[rw] 連結清單中

也就是說此時尚未真正下發 pending bio，而隻是将這些 pending bio 轉移到 throttle data 的 service queue 的 @queued[rw] 連結清單中，之後會排程 dispatch worker 對這些 pending bio 作真正的下發操作

# throtl_pending_timer
throtl_pending_timer_fn // input @throtl_service_queue is from throtl_data
    throtl_select_dispatch
        # get the leftmost throttle group (@tg) in @pending_tree rbtree
        
        throtl_dispatch_tg(tg)
            # dispatch (75% * THROTL_GRP_QUANTUM) READ IO
                tg_dispatch_one_bio(tg, READ)
                    # get latest expiring qnode from @tg's @queued[READ] list
                    # get latest expiring bio from qnode
                    # add bio to current throttle group's qnode_on_parent[rw] list
                    # current throttle group's qnode_on_parent[rw] to throtl_data's throtl_service_queue's @queued[rw] list
                
            # dispatch (25% * THROTL_GRP_QUANTUM) WRITE IO
            ...
    
    queue_work(kthrotld_workqueue, &td->dispatch_work) // schedule @kthrotld_workqueue worker

每個 throttle group 能夠處理的 pending bio 的數量存在一個上限，即 THROTL_GRP_QUANTUM，其中會優先處理 READ IO，但是單次能夠處理的 READ IO 也隻能占目前能夠處理的 THROTL_GRP_QUANTUM 的 75%

這裡需要注意的是，在處理單個 throttle group 的過程中，每次都是從 @queued[rw] 連結清單的頭部取出一個 qnode，再從該 qnode 的 @bios 連結清單的頭部取出一個 pending bio 進行處理，之後就會将該 qnode 轉移到 @queued[rw] 連結清單的尾部；之後再從 @queued[rw] 連結清單的頭部取出下一個 qnode，循環往複

這一行為正是當初引入 qnode 的意義所在，即所有 qnode (即所有 child/grandchild throttle group) 公平地配置設定目前 throttle group 的配額，防止其中的某個 child/grandchild throttle group 存在餓死的風險

dispatch worker

上述介紹到，dispatch timer 隻是将 pending bio 轉移到 throttle data 的 service queue 的 @queued[rw] 連結清單中，尚未進行真正的下發，之後排程的 dispatch worker 會對這些 pending bio 作真正的下發操作

每個 block device 維護一個 @dispatch_work，當該 block device 下存在 pending bio 需要下發時，就會排程 worker thread 進行處理

worker thread 隻是依次将緩存在 throttle data 的 service queue 的 @queued[rw] 連結清單中的 pending bio，下發給 block layer 進行處理

# @kthrotld_workqueue worker
blk_throtl_dispatch_work_fn
    # for each qnode on throtl_data's throtl_service_queue's @queued[rw] list
        # for each bio in the qnode
            submit_bio_noacct(bio)

slice management

總的來說，slice 是一個動态移動的過程

bio 下發過程中做 limit 檢查的時候，@slice_end[rw] 會向後移，即 extend slice 操作
檢查通過 bio 成功下發，即 dispatch 階段，@slice_start[rw] 會向後移，即 trim slice 操作

@slice_start[rw]                @slice_end[rw]
--------+-------------------------------+--------

start new slice

向 throttle group 發送第一個 bio，或者這個 throttle group 在發生限流、之後發送完所有積壓的 bio 之後再重新發送一個 bio 時，此時這個 throttle group 是空的，即目前沒有 bio 在該 throttle group 中等待，同時目前的 slice 也已經過時了，那麼此時就會新開一個 slice

@slice_start[rw]                @slice_end[rw]
--------+-------------------------------+--------
        ^
current jiffies

submit_bio
    submit_bio_noacct
        submit_bio_checks
            blk_throtl_bio
                tg_may_dispatch
                    # if throttle group is empty, and current slice used up, start a new slice
                    throtl_start_new_slice
                            @bytes_disp[rw] = 0;
                            @io_disp[rw] = 0;
                            @slice_start[rw] = jiffies;
                            @slice_end[rw] = jiffies + @td->throtl_slice;

extend

如果目前的 slice 還沒有過時，但是目前 slice 中剩餘的時間 (即 (@slice_end[rw] - jiffies)) 還不足 @throtl_slice，由于 block throttle 中很多計算都是以 @throtl_slice 為機關的，因而此時就需要擴充目前的 slice，進而確定剩餘時間向上取整為 @throtl_slice 的倍數

@slice_start[rw]                @slice_end[rw]
--------+-------------------------------+--------
                      ^
                current jiffies

@slice_start[rw]                                @slice_end[rw]
--------+-------------------------------*-------------+-----------
                      ^
                current jiffies

另外如果目前的 slice 已經過時，但是 throttle group 不為空，即目前 throttle group 中還存在等待的 bio，由于這些還在等待的 bio 的 dispatch 操作必須依賴目前 slice 的相關資料，因而此時也還不能新開 slice，因而此時也需要擴充目前的 slice，進而確定剩餘時間向上取整為 @throtl_slice 的倍數

@slice_start[rw]        @slice_end[rw]
--------+--------------------+--------
                                    ^
                            current jiffies

@slice_start[rw]                                        @slice_end[rw]
--------+--------------------*--------------------------------+--------
                                    ^
                            current jiffies

submit_bio
    submit_bio_noacct
        submit_bio_checks
            blk_throtl_bio
                tg_may_dispatch
                    # if throttle group is not empty, or current slice not used up
                        # if remained time in current slice smaller than @throtl_slice
                        throtl_extend_slice
                            @slice_end[rw] = jiffies + @td->throtl_slice;

trim slice

之前介紹到，在做 throttle limit 檢查之前會作 extend slice 操作，現在檢查通過即目前 bio 可以直接下發、不需要等待，那麼此時需要将之前擴充的 slice 重新縮減回去

這是因為在 throttle limit 檢查之前做了 extend slice 操作，現在如果不做 trim slice 操作，那麼之後如果 throttle group 重新設定了一個相對很小的 limit，而此時目前這個 slice 的相關資料，主要是 @io_disp[rw]/@bytes_disp[rw]，都還是過去 limit 很大時的統計資料，這就會造成修改 limit 之後，新下發的 bio 需要 throttle 等待很長時間才能夠下發

@slice_start[rw]                @slice_end[rw]
--------+-------------------------------+--------
                                ^
                         current jiffies

                @slice_start[rw]                 @slice_end[rw]
--------*-----------------+-------------*-------------+-----------
                                ^
                         current jiffies

trim slice 操作會将整個 slice 往後移，其中 @slice_start[rw] 會移動到目前的 jiffies 附近，同時也會按照 @slice_start[rw] 變化的幅度，等比例地減小 @io_disp[rw]/@bytes_disp[rw]

submit_bio
    submit_bio_noacct
        submit_bio_checks
            blk_throtl_bio
                tg_may_dispatch // pass
                
                # if pass the thottle check
                throtl_charge_bio
                
                throtl_trim_slice
                    @slice_end[rw] = jiffies + @td->throtl_slice;
                    # move @slice_start[rw] around current jiffies
                    # modify @io_disp[rw]/@bytes_disp[rw] proportionally

dispatch 的過程中會從 throttle group 中取出等待的 bio，此時會再次調用 tg_may_dispatch() 檢查這個 bio 是否能夠下發，如果在 limit 之内就可以下發，進入 dispatch 階段，否則必須繼續等待

此時類似地在 tg_may_dispatch() 中，需要檢查目前 slice 中剩餘的時間，如果剩餘時間不足 @throtl_slice，就需要擴充目前的 slice，進而確定剩餘時間向上取整為 @throtl_slice 的倍數

# throtl_pending_timer
throtl_pending_timer_fn
    throtl_select_dispatch
        # get the leftmost throttle group (@tg) in @pending_tree rbtree
        
        throtl_dispatch_tg(tg)
            bio = throtl_peek_queued()
            tg_may_dispatch(bio, ...)
                # if remained time in current slice smaller than @throtl_slice
                throtl_extend_slice
                    @slice_end[rw] = jiffies + @td->throtl_slice;

trim

dispatch 的過程中如果檢查通過，基于以上類似的原因，在下發之後，需要将之前擴充的 slice 重新縮減回去

# throtl_pending_timer
throtl_pending_timer_fn
    throtl_select_dispatch
        # get the leftmost throttle group (@tg) in @pending_tree rbtree
        
        throtl_dispatch_tg(tg)
            bio = throtl_peek_queued()
            tg_may_dispatch(bio, ...) // pass
            
            tg_dispatch_one_bio
                throtl_charge_bio
                throtl_trim_slice

example

example 1

如果目前 throttle group 就被限流，那麼目前下發的 bio 緩存在目前 throttle group 的 @qnode_on_self，同時該 qnode 緩存在目前 throttle group 中

之後 dispatch 階段排程到該 throttle group 的時候，從該 throttle group 的 @queued 連結清單的第一個 qnode 取出一個 bio，将該 bio 轉移到目前 throttle group 的 @qnode_on_parent 中，之後将該 qnode (即 @qnode_on_parent) 轉移到 throttle data 的 @queued 連結清單中

之後排程的 dispatch worker 就會對 throttle data 的 @queued 連結清單中緩存的 pending bio 進行下發

example 2

如果目前 throttle group 配額充足，那麼就會一層層往上，如果在某一層 parent throttle group 被限流，那麼目前下發的 bio 緩存在其下一層 child throttle group 的 @qnode_on_parent，同時該 qnode 緩存在該 parent throttle group 中 (說明該 qnode 中緩存的 bio 來自目前 throttle group 的下一層 throttle group，而非直接來自目前的 throttle group)

之後 dispatch 階段排程到該 throttle group (緩存有該 bio 的 parent throttle group) 的時候，類似地，從該 throttle group 的 @queued 連結清單的第一個 qnode 取出一個 bio，将該 bio 轉移到目前 throttle group 的 @qnode_on_parent 中，之後将該 qnode (即 @qnode_on_parent) 轉移到 throttle data 的 @queued 連結清單中

example 3

類似地，如果目前 throttle group 配額充足，那麼就會一層層往上，如果在某一層 parent throttle group 被限流，那麼目前下發的 bio 緩存在其下一層 child throttle group 的 @qnode_on_parent，同時該 qnode 緩存在該 parent throttle group 中

之後 dispatch 階段也是類似地，當排程到該 throttle group 時，從該 throttle group 的 @queued 連結清單的第一個 qnode 取出一個 bio，将該 bio 轉移到目前 throttle group 的 @qnode_on_parent 中

此時如果該 @qnode_on_parent 已經存在于某個 @queued 連結清單 (例如其上一層 parent throttle group 的 @queued 連結清單，或者 throttle data 的 @queued 連結清單) 中，那麼此時不會再移動該 @qnode_on_parent

此時如果該 @qnode_on_parent 存在于其上一層 parent throttle group 的 @queued 連結清單，那麼之後在排程到該 parent throttle group 的時候，類似地，會将該 bio 轉移到目前 throttle group 的 @qnode_on_parent 中，之後将該 qnode (即 @qnode_on_parent) 轉移到 throttle data 的 @queued 連結清單中

Tunable

block group 是一組采用相同 block IO control policy 的程序的集合，在 cgroup filesystem 中，每個 block group 都對應一個目錄，該目錄下包含的配置檔案可以對該 block group 的 block IO control policy 的參數進行配置

采用 Throttling Limit Policy 的 block group 的目錄下，包含以下配置檔案

blkio.throttle.read_bps_device

該配置檔案描述該 block group 對該 block device 的讀操作的速度上限，機關為 bytes/second

echo "<major>:<minor>  <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device

例如下例中，該 block group 讀取 major/minor number 8:16 的 block device 時，速度上限為 1MB/s

echo "8:16  1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device

blkio.throttle.write_bps_device

該配置檔案描述該 block group 對該 block device 的寫操作的速度上限，機關為 bytes/second

echo "<major>:<minor>  <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device

blkio.throttle.read_iops_device

該配置檔案描述該 block group 對該 block device 的讀操作的速度上限，機關為 bios/second

當同時對某個 block device 的 read_bps_device 與 read_iops_device 進行限制時，該 block device 需要同時受到兩者的限制

echo "<major>:<minor>  <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device

blkio.throttle.write_iops_device

該配置檔案描述該 block group 對該 block device 的寫操作的速度上限，機關為 bios/second

當同時對某個 block device 的 write_bps_device 與 write_iops_device 進行限制時，該 block device 需要同時受到兩者的限制

echo "<major>:<minor>  <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device

Statistics

io_serviced/io_service_bytes

blkio.throttle.io_serviced

和

throttle.io_service_bytes

描述該 block group 對應的各個 throttle group 已經下發的資料量，其中前者描述下發的 IO 數量，後者描述下發的資料量 (位元組為機關)

一個 block group 可能對多個 blkdev 進行限流配置，此時每個配置的 blkdev 都對應一個 throttle group，因而一個 block group 可以對應多個 throttle group

# cat /sys/fs/cgroup/blkio/blkio.throttle.io_serviced
253:16 Read 380
253:16 Write 158342
253:16 Sync 158590
253:16 Async 132
253:16 Discard 0
253:16 Total 158722
253:0 Read 15390
253:0 Write 60458
253:0 Sync 38000
253:0 Async 37848
253:0 Discard 0
253:0 Total 75848
Total 234570

讀取 blkio.throttle.io_serviced 和 throttle.io_service_bytes 的時候，就會周遊目前 block cgroup 對應的所有 throttle group，依次輸出各個 throttle group 下發的資料量

每個 throttle group 都會統計自己的資料

struct throtl_grp {
    struct blkg_rwstat stat_bytes;
    struct blkg_rwstat stat_ios;
    ...
}

struct blkg_rwstat {
    struct percpu_counter    cpu_cnt[BLKG_RWSTAT_NR];
    ...
};

其中分為兩個次元進行統計

一個是按照 IO 類型統計，分别為 READ/WRITE/DISCARD
一個是按照操作類型統計，分别為 SYNC/ASYNC

BLKG_RWSTAT_READ,
    BLKG_RWSTAT_WRITE,
    BLKG_RWSTAT_SYNC,
    BLKG_RWSTAT_ASYNC,
    BLKG_RWSTAT_DISCARD

以上兩個次元是完全正交的，即

TOTAL = READ + WRITE + DISCARD = SYNC + ASYNC

charge when split

對于 bio 發生 split 的場景，目前的邏輯隻會對 split 之前的 original bio 作 charge 操作，因而目前限的都是 split 之前的 iops/bps

submit_bio
    submit_bio_noacct
        submit_bio_checks
            blk_throtl_bio
                if bio_flagged(bio, BIO_THROTTLED): return
                
                # if pass the throttle check
                throtl_charge_bio
                        @bytes_disp[rw] += bio_size;
                        @io_disp[rw]++;
                
                bio_set_flag(bio, BIO_THROTTLED)

blk_queue_split
    blk_bio_segment_split
        bio_split
            bio_clone_fast
                __bio_clone_fast
                    if (bio_flagged(bio_src, BIO_THROTTLED)
                        bio_set_flag(bio, BIO_THROTTLED);

通過 throttle 檢查的 original bio 都會打上 BIO_THROTTLED 标記，之後 original bio 發生 split 時，所有 split bio 也都會複制上 BIO_THROTTLED 标記，之後對這些 split bio 作 throttle 檢查的時候，檢查到 BIO_THROTTLED 标記就會直接通過，不會作任何限制

statistics when split

但是在 bio 發生 split 的場景下，io_serviced 和 io_service_bytes 的統計邏輯則存在差異

4.19 版本中，io_serviced 和 io_service_bytes 的統計邏輯為

submit_bio
    generic_make_request
        generic_make_request_checks
            blkcg_bio_issue_check
                blk_throtl_bio
        
                # if blk_throtl_bio() returns false, i.e., not throttled
                if (!bio_flagged(bio, BIO_QUEUE_ENTERED))
                         blkg_rwstat_add(@stat_bytes, bio->bi_iter.bi_size);
                     blkg_rwstat_add(&blkg->stat_ios, 1);

當發生 split 的時候，split 出來的 bio 會标記為 BIO_QUEUE_ENTERED

original struct bio
+-------------------------------+
|                               |
+-------------------------------+

cloned struct bio      original struct bio
+-------+           +-----------------------+
| split |           |       remain          |
+-------+           +-----------------------+

blk_queue_split
    bio_set_flag(*bio, BIO_QUEUE_ENTERED)

也就是說，io_serviced 會重複統計 split bio，而 io_service_bytes 則不會

以 io_serviced 統計為例，

最開始對 original bio 調用 submit_bio() 時，會增加 io_serviced 計數

original struct bio
+-------------------------------+
|                               |
+-------------------------------+

submit_bio (original bio)
    generic_make_request
        generic_make_request_checks
            blkcg_bio_issue_check
                blk_throtl_bio
                blkg_rwstat_add(&blkg->stat_ios, 1); // count for original bio

之後 original bio 發生 split 時，會對 split 之後的 remain bio 遞歸調用 generic_make_request()，此時會再次增加 io_serviced 計數

original struct bio
+-------------------------------+
|                               |
+-------------------------------+

cloned struct bio      original struct bio
+-------+           +-----------------------+
| split |           |       remain          |
+-------+           +-----------------------+

submit_bio (original bio)
    generic_make_request
        q->make_request_fn(), e.g., blk_mq_make_request()
            blk_queue_split
                split = blk_bio_segment_split() // split
                bio_set_flag(*bio, BIO_QUEUE_ENTERED) // split bio is flagged with BIO_QUEUE_ENTERED
                
                generic_make_request(remain)
                    generic_make_request_checks
                        blkcg_bio_issue_check
                            blk_throtl_bio
                            blkg_rwstat_add(&blkg->stat_ios, 1); // count for remain bio
                            # buffer the remain bio in bio_list temporarily
                
                # go on handling split bio

v5.5 引入的 commit f73316482977ac401ac37245c9df48079d4e11f3 ("blk-cgroup: reimplement basic IO stats using cgroup rstat") 重構了這些統計的實作，附帶改變了 io_serviced 的統計邏輯，此時 io_serviced 不再重複統計 split bio

submit_bio
    submit_bio_noacct
        submit_bio_checks
            blk_throtl_bio
                if (bio_flagged(bio, BIO_THROTTLED)): return
                
                blkg_rwstat_add(@stat_bytes, bio->bi_iter.bi_size);
                blkg_rwstat_add(@stat_ios, 1);

statistic observation

block throttle 配置的 IOPS/BPS 的語義，從字面上了解就是限制 block cgroup 每秒鐘下發的 IO 數量，至于如何觀測這一限制的效果，其中會存在一些微妙的問題

從語義上來說，配置的 IOPS/BPS 就是限制 block cgroup 每秒鐘下發的 IO 數量，但是 block throttle 的實作決定了，拉長一段時間平均來看，是能夠達到這一限制的，但是如果采用 iostat 這類工具檢視秒級的資料，可以發現 iostat 輸出的秒級資料可能小于、也有可能大于配置的 IOPS/BPS

以下以 WRITE BPS=1024 KB/s 為例

iostat 輸出的 BPS 可能超過配置的 BPS

例如以下 iostat 輸出的資料，其中有一秒的 WITE BPS 為 1404 KB/s

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    4.00    4.00  1024.00  1024.00   512.00     0.00    1.00    0.50    1.50   0.25   0.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    4.00    7.00  1024.00  1404.00   441.45     0.01    1.55    0.50    2.14   0.64   0.70

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    4.00    4.00  1024.00  1024.00   512.00     0.01    1.25    0.75    1.75   0.38   0.30

以下是 ftrace 抓取的 WRITE IO 下發的時序

[1]    kworker/56:1-559   [056] ....  7033.714639: block_bio_queue: 8,16 W 23071232 + 2048 [kworker/56:1]

[2]    kworker/56:1-559   [056] ....  7034.714647: block_bio_queue: 8,16 W 27265536 + 2048 [kworker/56:1]
[3]    kworker/56:1-559   [056] ....  7034.827649: block_bio_queue: 8,16 W 4196864 + 232 [kworker/56:1]
[4]    kworker/56:1-559   [056] ....  7035.085652: block_bio_queue: 8,16 W 4197096 + 528 [kworker/56:1]
    
[5]    kworker/56:1-559   [056] ....  7036.085659: block_bio_queue: 8,16 W 6294016 + 2048 [kworker/56:1]

block throttle 算法的原則是，等到有足夠的配額之後，再下發整個 IO，例如下發一個 1024KB 的 IO 時，需要等到時刻 [1] 才能下發這個 IO，其他的 IO 以此類推；因而在這條時間的長河上，block throttle 會嚴格地按照配置的 IOPS/BPS 限制，發送 IO，隻不過在其中任意截取的一段 (1s) 時間區間内，實際下發的 IO 資料量可能不足或超過配置的 IOPS/BPS 限制

[1]                  [2] [3]     [4]                 [5]
...+--------------------+--------------------+---+------+--------------------+...
                                        <-------------------->
                                                    1s

例如上述時序中，如果 iostat 觀察的是 7034.5~7035.5 時間段，就會發現這 1s 時間内 WRITE IO 下發了 2808 (2048+232+528) 個 sector 即 1404 KB/s

iostat 輸出的 BPS 可能小于配置的 BPS

例如以下 iostat 輸出的資料，其中有一秒的 WITE BPS 為 904 KB/s

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00   ......   9.00  1024.00   904.00    32.00     0.01    0.22    0.38    0.04   0.10   1.20

kworker/22:1-532   [022] ....  2020.029000: block_bio_queue: 8,16 W 27264512 + 256 [kworker/22:1]

    kworker/22:1-532   [022] ....  2020.154001: block_bio_queue: 8,16 W 16778752 + 256 [kworker/22:1]
    kworker/22:1-532   [022] ....  2020.279007: block_bio_queue: 8,16 W 23070208 + 256 [kworker/22:1]
    kworker/22:1-532   [022] ....  2020.279010: block_bio_queue: 8,16 W 2098072 + 8 [kworker/22:1]
    kworker/22:1-532   [022] ....  2020.279011: block_bio_queue: 8,16 W 2098080 + 8 [kworker/22:1]
    kworker/22:1-532   [022] ....  2020.411003: block_bio_queue: 8,16 W 14681600 + 256 [kworker/22:1]
    kworker/22:1-532   [022] ....  2020.536005: block_bio_queue: 8,16 W 4195840 + 256 [kworker/22:1]
    kworker/22:1-532   [022] ....  2020.661005: block_bio_queue: 8,16 W 6293248 + 256 [kworker/22:1]
    kworker/22:1-532   [022] ....  2020.786006: block_bio_queue: 8,16 W 8390400 + 256 [kworker/22:1]
    kworker/22:1-532   [022] ....  2020.911008: block_bio_queue: 8,16 W 31460352 + 256 [kworker/22:1]
    
    kworker/22:1-532   [022] ....  2021.036009: block_bio_queue: 8,16 W 25167616 + 256 [kworker/22:1]

如果 iostat 觀察的是 2020.030~2021.030 時間段，就會發現這 1s 時間内 WRITE IO 下發了 1808 (256*7+16) 個 sector 即 904 KB/s