Linux核心調試技術——程序D狀态死鎖檢測

linux的程序存在多種狀态，如task_running的運作态、exit_dead的停止态和task_interruptible的接收信号的等待狀态等等(可在include/linux/sched.h中檢視)。其中有一種狀态等待為task_uninterruptible，稱為d狀态，該種狀态下程序不接收信号，隻能通過wake_up喚醒。處于這種狀态的情況有很多，例如mutex鎖就可能會設定程序于該狀态，有時候程序在等待某種io資源就緒時(wait_event機制)會設定程序進入該狀态。一般情況下，程序處于該狀态的時間不會太久，但若io裝置出現故障或者出現程序死鎖等情況，程序就可能長期處于該狀态而無法再傳回到task_running态。是以，核心為了便于發現這類情況設計出了hung task機制專門用于檢測長期處于d狀态的程序并發出告警。本文分析核心hung task機制的源碼并給出一個示例示範。

一、hung task機制分析

核心在很早的版本中就已經引入了hung task機制，本文以較新的linux 4.1.15版本源碼為例進行分析，代碼量并不多，源代碼檔案為kernel/hung_task.c。

首先給出整體流程框圖和設計思想：

圖 d狀态死鎖流程圖

其核心思想為建立一個核心監測程序循環監測處于d狀态的每一個程序(任務)，統計它們在兩次檢測之間的排程次數，如果發現有任務在兩次監測之間沒有發生任何的排程則可判斷該程序一直處于d狀态，很有可能已經死鎖，是以觸發報警日志列印，輸出程序的基本資訊，棧回溯以及寄存器儲存資訊以供核心開發人員定位。

下面詳細分析實作方式：

[cpp] view plain copy 在code上檢視代碼片派生到我的代碼片

static int __init hung_task_init(void)

{

atomic_notifier_chain_register(&panic_notifier_list, &panic_block);

watchdog_task = kthread_run(watchdog, null, "khungtaskd");

return 0;

}

subsys_initcall(hung_task_init);

首先，若在核心配置中啟用了該機制，在核心的subsys初始化階段就會調用hung_task_init()函數啟用功能，首先向核心的panic_notifier_list通知鍊注冊回調：

static struct notifier_block panic_block = {

.notifier_call = hung_task_panic,

};

在核心觸發panic時就會調用該hung_task_panic()函數，這個函數的作用稍後再看。繼續往下初始化，調用kthread_run()函數建立了一個名為khungtaskd的線程，執行watchdog()函數，立即嘗試排程執行。該線程就是專用于檢測d狀态死鎖程序的背景核心線程。

* kthread which checks for tasks stuck in d state

static int watchdog(void *dummy)

set_user_nice(current, 0);

for ( ; ; ) {

unsigned long timeout = sysctl_hung_task_timeout_secs;

while (schedule_timeout_interruptible(timeout_jiffies(timeout)))

timeout = sysctl_hung_task_timeout_secs;

if (atomic_xchg(&reset_hung_task, 0))

continue;

check_hung_uninterruptible_tasks(timeout);

}

本程序首先設定優先級為0，即一般優先級，不影響其他程序。然後進入主循環(每隔timeout時間執行一次)，首先讓程序睡眠，設定的睡眠時間為

config_default_hung_task_timeout，可以通過核心配置選項修改，預設值為120s，睡眠結束被喚醒後判斷原子變量辨別reset_hung_task，若被置位則跳過本輪監測，同時會清除該辨別。該辨別通過reset_hung_task_detector()函數設定(目前核心中尚無其他程式使用該接口)：

void reset_hung_task_detector(void)

atomic_set(&reset_hung_task, 1);

export_symbol_gpl(reset_hung_task_detector);

接下來循環的最後即為監測函數check_hung_uninterruptible_tasks()，函數入參為監測逾時時間。

* check whether a task_uninterruptible does not get woken up for

* a really long time (120 seconds). if that happens, print out

* a warning.

static void check_hung_uninterruptible_tasks(unsigned long timeout)

int max_count = sysctl_hung_task_check_count;

int batch_count = hung_task_batching;

struct task_struct *g, *t;

* if the system crashed already then all bets are off,

* do not report extra hung tasks:

if (test_taint(taint_die) || did_panic)

return;

rcu_read_lock();

for_each_process_thread(g, t) {

if (!max_count--)

goto unlock;

if (!--batch_count) {

batch_count = hung_task_batching;

if (!rcu_lock_break(g, t))

goto unlock;

}

/* use "==" to skip the task_killable tasks waiting on nfs */

if (t->state == task_uninterruptible)

check_hung_task(t, timeout);

unlock:

rcu_read_unlock();

首先檢測核心是否已經die了或者已經panic了，如果是則表明核心已經crash了，無需再進行監測了，直接傳回即可。注意這裡的did_panic辨別在前文中的panic通知鍊回調函數中hung_task_panic()置位：

static int

hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)

did_panic = 1;

return notify_done;

接下去若尚無觸發核心crash，則進入監測流程并逐一檢測核心中的所有程序(任務task)，該過程在rcu加鎖的狀态下進行，是以為了避免在程序較多的情況下加鎖時間過長，這裡設定了一個batch_count，一次最多檢測hung_task_batching個程序。于此同時使用者也可以設定最大的檢測個數max_count=sysctl_hung_task_check_count，預設值為最大pid個數pid_max_limit(通過sysctl指令設定)。

函數調用for_each_process_thread()函數輪詢核心中的所有程序(任務task)，僅對狀态處于task_uninterruptible狀态的程序進行逾時判斷，調用check_hung_task()函數，入參為task_struct結構和逾時時間(120s)：

static void check_hung_task(struct task_struct *t, unsigned long timeout)

unsigned long switch_count = t->nvcsw + t->nivcsw;

* ensure the task is not frozen.

* also, skip vfork and any other user process that freezer should skip.

if (unlikely(t->flags & (pf_frozen | pf_freezer_skip)))

* when a freshly created task is scheduled once, changes its state to

* task_uninterruptible without having ever been switched out once, it

* musn't be checked.

if (unlikely(!switch_count))

if (switch_count != t->last_switch_count) {

t->last_switch_count = switch_count;

trace_sched_process_hang(t);

if (!sysctl_hung_task_warnings)

if (sysctl_hung_task_warnings > 0)

sysctl_hung_task_warnings--;

首先通過t->nvcsw和t->nivcsw的計數累加表示程序從建立開始至今的排程次數總和，其中t->nvcsw表示程序主動放棄cpu的次數，t->nivcsw表示被強制搶占的次數。随後函數判斷幾個辨別：(1)如果程序被frozen了那就跳過檢測;(2)排程次數為0的不檢測。

接下來判斷從上一次檢測時儲存的程序排程次數和本次是否相同，若不相同則表明這輪timeout(120s)時間内程序發生了排程，則更新該排程值傳回，否則則表明該程序已經有timeout(120s)時間沒有得到排程了，一直處于d狀态。接下來的trace_sched_process_hang()暫不清楚作用，然後判斷sysctl_hung_task_warnings辨別，它表示需要觸發報警的次數，使用者也可以通過sysctl指令配置，預設值為10，即若目前檢測的程序一直處于d狀态，預設情況下此處每2分鐘發出一次告警，一共發出10次，之後不再發出告警。下面來看告警代碼：

* ok, the task did not get scheduled for more than 2 minutes,

* complain:

pr_err("info: task %s:%d blocked for more than %ld seconds.\n",

t->comm, t->pid, timeout);

pr_err(" %s %s %.*s\n",

print_tainted(), init_utsname()->release,

(int)strcspn(init_utsname()->version, " "),

init_utsname()->version);

pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""

" disables this message.\n");

sched_show_task(t);

debug_show_held_locks(t);

touch_nmi_watchdog();

這裡會在控制台和日志中列印死鎖任務的名稱、pid号、逾時時間、核心tainted資訊、sysinfo、核心棧barktrace以及寄存器資訊等。如果開啟了debug lock則列印鎖占用的情況，并touch nmi_watchdog以防止nmi_watchdog逾時(對于我的arm環境無需考慮nmi_watchdog)。

if (sysctl_hung_task_panic) {

trigger_all_cpu_backtrace();

panic("hung_task: blocked tasks");

最後如果設定了sysctl_hung_task_panic辨別則直接觸發panic(該值可通過核心配置檔案配置也可以通過sysctl設定)。

二、示例示範

示範環境：樹莓派b(linux 4.1.15)

1、首先确認核心配置選項以确認開啟hung stak機制

#include

define_mutex(dlock);

static int __init dlock_init(void)

mutex_lock(&dlock);

static void __exit dlock_exit(void)

return;

module_init(dlock_init);

module_exit(dlock_exit);

module_license("gpl");

本示例程式定義了一個mutex鎖，然後在子產品的init函數中重複加鎖，人為造成死鎖現象(mutex_lock()函數會調用__mutex_lock_slowpath()将程序設定為task_uninterruptible狀态)，程序進入d狀态後是無法退出的。可以通過ps指令來檢視：

root@apple:~# busybox ps

pid user time command

......

521 root 0:00 insmod dlock.ko

然後檢視該程序的狀态，可見已經進入了d狀态。

root@apple:~# cat /proc/521/status

name: insmod

state: d (disk sleep)

tgid: 521

ngid: 0

pid: 521

至此在等待兩分鐘後調試序列槽就會輸出以下資訊，可見每兩分鐘就會輸出一次：

[ 360.625466] info: task insmod:521 blocked for more than 120 seconds.

[ 360.631878] tainted: g o 4.1.15 #5

[ 360.637042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 360.644986] [] (__schedule) from [] (schedule+0x40/0xa4)

[ 360.652129] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c)

[ 360.660570] [] (schedule_preempt_disabled) from [] (__mutex_lock_slowpath+0x6c/0xe4)

[ 360.670142] [] (__mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48)

[ 360.678432] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock])

[ 360.686480] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8)

[ 360.694976] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0)

[ 360.703170] [] (do_init_module) from [] (load_module+0x1690/0x1d34)

[ 360.711284] [] (load_module) from [] (sys_init_module+0xdc/0x130)

[ 360.719239] [] (sys_init_module) from [] (ret_fast_syscall+0x0/0x54)

[ 480.725351] info: task insmod:521 blocked for more than 120 seconds.

[ 480.731759] tainted: g o 4.1.15 #5

[ 480.736917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 480.744842] [] (__schedule) from [] (schedule+0x40/0xa4)

[ 480.752029] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c)

[ 480.760479] [] (schedule_preempt_disabled) from [] (__mutex_lock_slowpath+0x6c/0xe4)

[ 480.770066] [] (__mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48)

[ 480.778363] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock])

[ 480.786402] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8)

[ 480.794897] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0)

[ 480.803085] [] (do_init_module) from [] (load_module+0x1690/0x1d34)

[ 480.811188] [] (load_module) from [] (sys_init_module+0xdc/0x130)

[ 480.819113] [] (sys_init_module) from [] (ret_fast_syscall+0x0/0x54)

[ 600.825353] info: task insmod:521 blocked for more than 120 seconds.

[ 600.831759] tainted: g o 4.1.15 #5

[ 600.836916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 600.844865] [] (__schedule) from [] (schedule+0x40/0xa4)

[ 600.852005] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c)

[ 600.860445] [] (schedule_preempt_disabled) from [] (__mutex_lock_slowpath+0x6c/0xe4)

[ 600.870014] [] (__mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48)

[ 600.878303] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock])

[ 600.886339] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8)

[ 600.894835] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0)

[ 600.903023] [] (do_init_module) from [] (load_module+0x1690/0x1d34)

[ 600.911133] [] (load_module) from [] (sys_init_module+0xdc/0x130)

[ 600.919059] [] (sys_init_module) from [] (ret_fast_syscall+0x0/0x54)

三、總結

d狀态死鎖一般在驅動開發的過程中比較常見，且不太容易定位，核心提供這種hung task機制，開發人員隻需要将這些輸出的定位資訊抓取并保留下來就可以快速的進行定位。

作者：圍城

來源：51cto

Linux核心調試技術——程序D狀态死鎖檢測

繼續閱讀

Ubuntu14.04 LTS下安裝mongodb

C++實作簡單順序表

httpd服務的部署、啟動、配置和簡單優化一、部署二、啟動三、配置檔案

配置網頁内容通路

手動安裝Intel network I217-LM網卡的Linux驅動

禁止ubuntu系統彈出報錯界面

Ubuntu Linux下Apache的配置檔案

C經典書籍筆記——C陷阱與缺陷②(文法陷阱之優先級)一、錯誤案列二、優先級規律

線性表之順序表的實作

samba伺服器的功能

C++判斷素數、求最大公約數代碼判斷一個數是否為素數求兩個數的最大公約數

【Linux】UDP廣播封包接收速率問題

SequoiaDB巨杉資料庫C++驅動概述

Linux裝置模型（中）之上層容器

PowerPC平台 Linux移植三