部落格:blog.focus-linux.net linuxfocus.blog.chinaunix.net
微網誌:weibo.com/glinuxer
QQ技術群:4367710
本文的copyleft歸[email protected]所有,使用GPL釋出,可以自由拷貝,轉載。但轉載請保持文檔的完整性,注明原作者及原連結,嚴禁用于任何商業用途。
========================================================================================================
上文書說到,epoll是如何加到每個監控描述符的wait queue中,這隻是第一步。上次也提過,epoll實際上也是一個阻塞操作,隻不過是可以同時監控多個檔案描述符。下面看一下epoll_wait->ep_poll的實作。
epoll既然是阻塞的,必然需要wait queue。但是這個不能使用監控的檔案描述符的wait queue,epoll自己本身也是一個虛拟的檔案系統。epoll_create的傳回值也是一個檔案描述符。Unix下,一切皆是檔案嘛。
是以epoll的實作代碼如下:
init_waitqueue_entry(&wait, current);
__add_wait_queue_exclusive(&ep->wq, &wait);
for (;;) {
/*
* We don't want to sleep if the ep_poll_callback() sends us
* a wakeup in between. That's why we set the task state
* to TASK_INTERRUPTIBLE before doing the checks.
*/
set_current_state(TASK_INTERRUPTIBLE);
if (ep_events_available(ep) || timed_out)
break;
if (signal_pending(current)) {
res = -EINTR;
}
spin_unlock_irqrestore(&ep->lock, flags);
if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
}
__remove_wait_queue(&ep->wq, &wait);
這裡epoll_wait是将目前程序添加到epoll自身的wait queue中。那麼問題來了,前文說到epoll已經将目前程序加到了各個監控描述符的wait queue中。現在這裡又有了一個epoll自身的wait queue。這是為什麼呢?
回答這個問題,需要我們再跳回ep_ptable_queue_proc——不記得這個函數的同學,請翻看前面的文章。這個函數調用init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);,将epoll目前程序的wait queue節點的回調函數設定為ep_poll_callback。對比epoll調用的init_waitqueue_entry函數,這個函數設定wait queue節點的回調函數為default_wake_function。
那麼當監控檔案描述符執行wakeup動作時,比如一個socket收到資料時,調用sk_data_ready->sock_def_readable->wake_up_interruptible_sync_poll->....最終會執行wait_queue節點的回調函數。對于epoll來說,即ep_poll_callback。
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
int pwake = 0;
unsigned long flags;
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;
spin_lock_irqsave(&ep->lock, flags);
/*
* If the event mask does not contain any poll(2) event, we consider the
* descriptor to be disabled. This condition is likely the effect of the
* EPOLLONESHOT bit that disables the descriptor when an event is received,
* until the next EPOLL_CTL_MOD will be issued.
*/
if (!(epi->event.events & ~EP_PRIVATE_BITS))
goto out_unlock;
* Check the events coming with the callback. At this stage, not
* every device reports the events in the "key" parameter of the
* callback. We need to be able to handle both cases here, hence the
* test for "key" != NULL before the event match test.
if (key && !((unsigned long) key & epi->event.events))
* If we are transferring events to userspace, we can hold no locks
* (because we're accessing user memory, and because of linux f_op->poll()
* semantics). All the events that happen during that period of time are
* chained in ep->ovflist and requeued later on.
if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
if (epi->next == EP_UNACTIVE_PTR) {
epi->next = ep->ovflist;
ep->ovflist = epi;
}
/* If this file is already in the ready list we exit soon */
if (!ep_is_linked(&epi->rdllink))
list_add_tail(&epi->rdllink, &ep->rdllist);
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
if (waitqueue_active(&ep->wq))
wake_up_locked(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
pwake++;
out_unlock:
spin_unlock_irqrestore(&ep->lock, flags);
/* We have to call this outside the lock */
if (pwake)
ep_poll_safewake(&ep->poll_wait);
return 1;
}
這個函數的注釋相當清楚,可以清晰的知道每一行代碼的用途。其中
if (waitqueue_active(&ep->wq))
這兩行代碼,檢測了epoll自身的wait queue上是否有等待的節點,如果有的話,就執行喚醒動作。對于epoll的使用者來說,如果使用者态正阻塞在epoll_wait中,那麼ep->wq一定不為空,這時就會被喚醒。将該程序移到就緒隊列中。
這兩篇文章基本上理清了epoll如何監控多個描述符及如何獲得通知的過程。對于如何監控來說,還欠缺了epoll内部結構,如何儲存的各個描述符,如何維護的資訊等。不過這樣的文章網上已經有了很多。也許以後我會針對這個問題,再寫兩篇文章吧。