目錄
一、mkdir Clientd端的處理
發送請求的流程
發送請求的内容
處理請求的流程
後記
二、mkdir MDS端的處理
MDS對于來自用戶端請求的通用處理
Locker::process_request_cap_release
Server::handle_client_mkdir
Server::rdlock_path_xlock_dentry
Locker::acquire_locks
Locker::handle_client_caps
Locker::acquire_locks
Server::prepare_new_inode
CInode::get_or_open_dirfrag
MDCache::predirty_journal_parents
Locker::issue_new_caps
Server::journal_and_reply
Client----------------------op:mkdir-------------->MDS
mkdir就是建立目錄,用戶端并不直接建立目錄,而是将mkdir的請求(op為CEPH_MDS_OP_MKDIR)發給MDS,然後MDS執行mkdir的操作,并傳回建立的目錄的中繼資料。用戶端無非就是發送請求和處理回複。
例子 mkdir /mnt/ceph-fuse/test
一、mkdir Clientd端的處理
發送請求的流程
發送請求的内容
兩類請求:MetaRequest,MClientRequest。
MetaRequest的大部分内容都是在make_request和send_request中填充,是以各種op填充的内容都差不多,隻研究不同的地方。
struct MetaRequest {
private:
InodeRef _inode, _old_inode, _other_inode; // _inode為建立目錄的父目錄的inode指針
// 這裡_inode->ino = 1
Dentry *_dentry; //associated with path, _dentry->dir是父目錄的Dir,_dentry->name = "test"
public:
ceph_mds_request_head head; // head.op = CEPH_MDS_OP_MKDIR
filepath path, path2; // path.ino = 0x1(父目錄的inode号), path.path = "test"
......
int dentry_drop, dentry_unless; // dentry_drop = CEPH_CAP_FILE_SHARED = "Fs",在send_request過程中,會釋放掉父目錄的Inode的caps的"Fs"權限
// dentry_unless = CEPH_CAP_FILE_EXCL = "Fx"
vector<MClientRequest::Release> cap_releases; // cap_releases.push_back(MClientRequest::Release(rel,""))。?
......
ceph::cref_t<MClientReply> reply; // the reply
//possible responses
bool got_unsafe; // 收到unsafe的回複時,got_unsafe為true
xlist<MetaRequest*>::item item; // 插入到session的requests連結清單中
xlist<MetaRequest*>::item unsafe_item; // 收到unsafe回複後,插入到session的unsafe_requests連結清單中。
xlist<MetaRequest*>::item unsafe_dir_item; // 收到unsafe回複且涉及到父目錄操作(在父目錄下建立/删除檔案/目錄),插入到父目錄Inode的unsafe_ops連結清單中
xlist<MetaRequest*>::item unsafe_target_item; // 收到unsafe回複且請求需要擷取目的inode資訊,插入到自己Inode的unsafe_ops連結清單中
// 上述4個連結清單節點,都在收到safe回複後,會将連結清單節點從各自的連結清單中删除
InodeRef target; // target是建立的目錄的Inode指針,從mds的回複中組裝而成。
}
MClientRequest的内容是在通用函數build_client_request和send_request函數中填充的,是以大部分内容都差不多
class MClientRequest : public Message {
public:
mutable struct ceph_mds_request_head head; // head.op = CEPH_MDS_OP_MKDIR
// head.flags = CEPH_MDS_FLAG_WANT_DENTRY
// path arguments
filepath path, path2; // path.ino = 0x1(父目錄的inode号), path.path = "test"
......
}
從代碼可以看出,發送給mds的請求最重要的就是兩個:
- op,不同的op,處理機制不同;
- filepath path,path.ino是父目錄的inode号,path.path就是需要建立的目錄名。
通過這兩個,mds就知道在哪個目錄下建立目錄。
處理請求的流程
class MClientReply : public Message {
public:
// reply data
struct ceph_mds_reply_head head {};
/* client reply */
struct ceph_mds_reply_head {
__le32 op;
__le32 result;
__le32 mdsmap_epoch;
__u8 safe; /* true if committed to disk; 用來判斷是否已經下刷了disk,或者不需要下刷時,safe就為1*/
__u8 is_dentry, is_target; /* true if dentry, target inode records are included with reply; is_dentry = 1, is_target = 1*/
}
bufferlist trace_bl; // trace_bl裡面存着真正的資訊,用于更新目的inode
}
最後"test"目錄inode的cap.issued == "pAsxLsXsxFsx", cap.implemented == "pAsxLsXsxFsx"
後記
在linux中同一目錄下的子目錄和檔案名是不能相同的,如test/目錄下就不能有"test1"的目錄和"text1"的檔案。這是為啥,在看過lookup之後,就知道答案了,舉例說明:比如我們要mkdir /test/test1: 先進行索引,即lookup 0x1/test,獲得test的inode,這裡假設test的inode号為0x2, 接下來再lookup 0x2/test1, 即擷取test目錄下"test1"的Dentry,然後從Dentry中獲得Inode,假設在mkdir /test/test1之前,已經有了一個test1的檔案,那麼這時lookup 0x2/test1會獲得test1檔案的Inode,lookup傳回的結果是0,這是mkdir就報錯:檔案或目錄已存在。
二、mkdir MDS端的處理
現在就研究下MDS這邊處理mkdir的流程。例子:mkdir /test/a
MDS對于來自用戶端請求的通用處理
通用處理流程
在上面的圖中可以看出,在正式處理mkdir請求之前,先處理了請求中附帶的cap_realse消息,即函數Locker::process_request_cap_release;
Locker::process_request_cap_release
process_request_cap_release用來處理請求中ceph_mds_request_release& item,item中的caps就是用戶端持有父目錄的caps(caps知識:javascript:void(0)),比如mkdir /test/a,caps就是用戶端持有a的父目錄"test"目錄的caps。用戶端在發送mkdir請求時,會丢掉自己持有的"Fs"權限:用戶端"test"的inode中caps為"pAsLsXsFs"。 丢掉"Fs",就是"pAsLsXs"。
process_request_cap_release的代碼簡略如下。
void Locker::process_request_cap_release(MDRequestRef& mdr, client_t client,
const ceph_mds_request_release& item, std::string_view dname)
{ // item就是從用戶端那邊傳過來的,dname = ""(用戶端傳的時候,并沒有給dname指派)
inodeno_t ino = (uint64_t)item.ino; // ino = "test"的inode号
uint64_t cap_id = item.cap_id;
int caps = item.caps; // caps = "pAsLsXs"
int wanted = item.wanted; // wanted = 0
int seq = item.seq;
int issue_seq = item.issue_seq;
int mseq = item.mseq;
CInode *in = mdcache->get_inode(ino); // 擷取"test"的CInode
Capability *cap = in->get_client_cap(client);
cap->confirm_receipt(seq, caps); // 将"test"的CInode的caps的_issued和_pending變成“pAsLsXs”
adjust_cap_wanted(cap, wanted, issue_seq); // 設定caps中的wanted
eval(in, CEPH_CAP_LOCKS);
......
}
void Locker::process_request_cap_release(MDRequestRef& mdr, client_t client,
const ceph_mds_request_release& item, std::string_view dname)
{ // item就是從用戶端那邊傳過來的,dname = ""(用戶端傳的時候,并沒有給dname指派)
inodeno_t ino = (uint64_t)item.ino; // ino = "test"的inode号
uint64_t cap_id = item.cap_id;
int caps = item.caps; // caps = "pAsLsXs"
int wanted = item.wanted; // wanted = 0
int seq = item.seq;
int issue_seq = item.issue_seq;
int mseq = item.mseq;
CInode *in = mdcache->get_inode(ino); // 擷取"test"的CInode
Capability *cap = in->get_client_cap(client);
cap->confirm_receipt(seq, caps); // 将"test"的CInode的caps的_issued和_pending變成“pAsLsXs”
adjust_cap_wanted(cap, wanted, issue_seq); // 設定caps中的wanted
eval(in, CEPH_CAP_LOCKS);
......
}
簡單來講就是将MDS緩存的"test"的CInode中的對應的用戶端的caps與用戶端保持一緻 (用戶端丢掉Fs,MDS緩存的"test"的CInode中的對應的用戶端的caps也丢掉),即cap中的_issued和_pending變成"pAsLsXs"。這樣做的目的就是在acquire_lock時避免向該用戶端發送revoke消息。
Server::handle_client_mkdir
cap_release消息處理完後,通過Server::dispatch_client_request分發請求,根據op執行Server::handle_client_mkdir,處理過程可以分為7個重要的流程:
步驟說明和代碼(本段末尾)如下:
1,擷取"a"目錄的CDentry以及需要加 上鎖的中繼資料的 lock(鎖頭,放入rdlocks, wrlocks, xlocks),具體函數為Server::rdlock_path_xlock_dentry
2,加上鎖,具體函數為Locker::acquire_locks,如果加 上鎖不成功,即某些用戶端持有的caps需要回收(其他用戶端占着本次請求的某些caps?),就建立C_MDS_RetryRequest,加入"test"的CInode的waiting隊列中,等待滿足加鎖條件後,再把請求拿出來處理。
3,如果加 上鎖成功,則繼續,建立"a"的CInode,具體函數為Server::prepare_new_inode
4,建立"a"的CDir,具體函數為CInode::get_or_open_dirfrag
5,更新"a"目錄到"/"根目錄的CDir和CInode中的中繼資料,填充"mkdir"事件,具體函數為MDCache::predirty_journal_parents
6,建立"a"的Capability,具體函數為Locker::issue_new_caps
7,記錄"mkdir"事件,進行第一次回複,送出日志,具體函數為Server::journal_and_reply。
void Server::handle_client_mkdir(MDRequestRef& mdr)
{
MClientRequest *req = mdr->client_request;
set<SimpleLock*> rdlocks, wrlocks, xlocks;
// 擷取"a"目錄的CDentry以及需要加鎖的中繼資料lock,填充rdlocks,wrlocks,xlocks,dn是"a"的CDentry
CDentry *dn = rdlock_path_xlock_dentry(mdr, 0, rdlocks, wrlocks, xlocks, false, false, false);
......
CDir *dir = dn->get_dir(); // dir是"test"的CDir
CInode *diri = dir->get_inode(); // diri是"test"的CInode
rdlocks.insert(&diri->authlock); // 将"test"的CInode的authlock加入rdlocks
// 去擷取鎖,由于有鎖未擷取到,是以直接傳回
if (!mds->locker->acquire_locks(mdr, rdlocks, wrlocks, xlocks))
return;
......
}
Server::rdlock_path_xlock_dentry
該函數具體做的事如下
1,擷取"a"的CDentry
2,rdlocks、wrlocks、xlocks 收集操作需要上鎖的各種鎖
rdlocks:"a"的CDentry中的lock
"/"、"test"的CInode的snaplocks(從根到父目錄)
wrlocks:"test"的CInode的filelock和nestlock
xlocks:"a"的CDentry中的lock(simplelock)
代碼如下
CDentry* Server::rdlock_path_xlock_dentry(MDRequestRef& mdr, int n, set<SimpleLock*>& rdlocks, set<SimpleLock*>& wrlocks, set<SimpleLock*>& xlocks,
bool okexist, bool mustexist, bool alwaysxlock, file_layout_t **layout)
{ // n = 0, rdlocks, wrlocks, xlocks都為空,okexist = mustexist = alwaysxlock = false,layout = 0
const filepath& refpath = n ? mdr->get_filepath2() : mdr->get_filepath(); // refpath = path: path.ino = 0x10000000001, path.path = "a"
client_t client = mdr->get_client();
CDir *dir = traverse_to_auth_dir(mdr, mdr->dn[n], refpath); // 擷取"test"的CDir
CInode *diri = dir->get_inode(); // 擷取"test"的CInode
std::string_view dname = refpath.last_dentry(); // dname = "a"
CDentry *dn;
if (mustexist) { ...... // mustexist = false
} else {
dn = prepare_null_dentry(mdr, dir, dname, okexist); // 擷取“a”的CDentry
if (!dn)
return 0;
}
mdr->dn[n].push_back(dn); // n = 0, 即mdr->dn[0][0] = dn;
CDentry::linkage_t *dnl = dn->get_linkage(client, mdr); // dnl中的remote_ino = 0 && inode = 0
mdr->in[n] = dnl->get_inode(); // mdr->in[0] = 0
// -- lock --
for (int i=0; i<(int)mdr->dn[n].size(); i++) // (int)mdr->dn[n].size() = 1
rdlocks.insert(&mdr->dn[n][i]->lock); // 将"a"的CDentry中的lock放入rdlocks
if (alwaysxlock || dnl->is_null()) // dnl->is_null()為真
xlocks.insert(&dn->lock); // new dn, xlock,将"a"的CDentry中的lock放入xlocks
else ......
// 下面是将"test"的CDir中的CInode的filelock和nestlock都放入wrlocks
wrlocks.insert(&dn->get_dir()->inode->filelock); // also, wrlock on dir mtime
wrlocks.insert(&dn->get_dir()->inode->nestlock); // also, wrlock on dir mtime
if (layout) ......
else
mds->locker->include_snap_rdlocks(rdlocks, dn->get_dir()->inode); // 将路徑上的CInode的snaplock全放入rdlocks中,即從"test"到“/”
return dn;
}
在prepare_null_dentry函數中會新生成"a"的CDentry,代碼如下
CDentry* Server::prepare_null_dentry(MDRequestRef& mdr, CDir *dir, std::string_view dname, bool okexist)
{ // dir是"test"的CDir,dname = "a"
// does it already exist?
CDentry *dn = dir->lookup(dname);
if (dn) {......} // dn沒有lookup到,是以為NULL
// create
dn = dir->add_null_dentry(dname, mdcache->get_global_snaprealm()->get_newest_seq() + 1); // 建立CDentry
dn->mark_new(); // 設定 state | 1
return dn;
}
即Server::prepare_null_dentry會先去父目錄"test"的CDir的items中去找有沒有"a"的CDentry,如果沒有找到就新生成一個CDentry。研究MDS,不去研究中繼資料細節,很容易迷失。下面就是CDentry的類定義,其中可以看到CDentry是繼承自LRUObject,因為CDentry是中繼資料緩存,得靠簡單的LRU算法來平衡緩存空間。先研究其中的成員變量的含義
class CDentry : public MDSCacheObject, public LRUObject, public Counter<CDentry> {
......
// 成員變量如下
public:
__u32 hash; // hash就是"a"通過ceph_str_hash_rjenkins函數算出來的hash值
snapid_t first, last;
elist<CDentry*>::item item_dirty, item_dir_dirty;
elist<CDentry*>::item item_stray;
// lock
static LockType lock_type; // LockType CDentry::lock_type(CEPH_LOCK_DN)
static LockType versionlock_type; // LockType CDentry::versionlock_type(CEPH_LOCK_DVERSION)
SimpleLock lock; // 初始化下lock.type->type = CEPH_LOCK_DN,lock.state = LOCK_SYNC
LocalLock versionlock; // 初始化下lock.type->type = CEPH_LOCK_DVERSION,lock.state = LOCK_LOCK
mempool::mds_co::map<client_t,ClientLease*> client_lease_map;
protected:
CDir *dir = nullptr; // dir是父目錄的CDir,即"test"的CDir
linkage_t linkage; // 裡面儲存了CInode,在mkdir時,由于CInode還沒有建立,是以linkage_t裡面的内容為空
mempool::mds_co::list<linkage_t> projected; // 修改CDentry中的linkage時,并不直接去修改linkage
// 而是先建立一個臨時的linkage_t用來儲存修改的值,并存放在peojected中
// 待日志下刷後,再将臨時值賦給linkage,并删掉臨時值
// 是以projected中存放linkage_t的修改值。
version_t version = 0;
version_t projected_version = 0; // what it will be when i unlock/commit.
private:
mempool::mds_co::string name; // 檔案或目錄名, name = "a"
public:
struct linkage_t { // linkage_t中主要存了CInode的指針
CInode *inode = nullptr;
inodeno_t remote_ino = 0;
unsigned char remote_d_type = 0;
......
};
}
接下來就是填充rdlocks,wrlocks,xlocks,然後根據填充的鎖set數組,去拿鎖,隻有拿到需要的鎖,才能去修改中繼資料。
Locker::acquire_locks
進行acquire_lock之前需要知道有哪些lock要去擷取,如下
對"a"的CDentry的lock進行rdlock和xlock(這裡有一個疑點,對lock 加上xlock後,其實就不需要再加rdlock,事實上接下來也隻加了xlock),是因為在接下來會對"a"的CDentry裡面的内容讀寫;
對"a"的父目錄"test"的filelock和nestlock加 上wrlock,是因為接下來要對"test"的CInode的inode裡面的dirstat和neststat進行修改;
對"test"的authlock加rdlock,是因為要讀取"test"的權限相關的内容(mode、uid、gid等);
剩下的就是snaplock,這個與快照有關,這裡暫不讨論快照。
這裡解釋下,為什麼要加這些鎖
1,對"test"的CInode的authlock加讀鎖,因為在Server::prepare_new_inode過程中會擷取"test"的CInode的mode内容,如下
if (diri->inode.mode & S_ISGID) {
dout(10) << " dir is sticky" << dendl;
in->inode.gid = diri->inode.gid;
if (S_ISDIR(mode)) {
dout(10) << " new dir also sticky" << dendl;
in->inode.mode |= S_ISGID;
}
2,對"test"的CInode的filelock和nestlock加wrlock,是因為之後在MDCache::predirty_journal_parents過程中會修改"test"的CInode中inode_t的dirstat和rstat:dirstat受filelock保護,rstat受nestlock保護。
3,對"a"的CDentry加xlock,是因為之後要去給CDentry中的linkage_t填充内容(CInode指針之類)
4,在之後也會去對CInode的versionlock加wrlock,是因為要去修改CInode中inode_t的version;對"/"的CInode的nestlock也加wrlock。
Locker::acquire_locks函數代碼有好幾百行,我把它分了3個步驟。
第一個步驟是整理xlocks、wrlock和rdlocks,因為這三個鎖容器裡面,可能有重複的lock,是以要把所有的lock放入一個整體的set中(sorted)。
先周遊xlocks,将"a"的CDentry中的lock放入sorted中,将"a"的CDentry放入mustpin中,并且将"a"的CDentry的versionlock放入wrlocks中;
接下來周遊wrlocks,将"a"的CDentry的versionlock和"test"的CInode的filelock和nestlock放入sorted中,并且将"test"的CInode放入mustpin中;
周遊rdlocks,将"a"CDentry的lock,"test"CInode的authlock、snaplock,和"/"的CInode的snaplock放入sorted中,并将"/"的CInode加入mustpin中。
代碼如下
bool Locker::acquire_locks(MDRequestRef& mdr, set<SimpleLock*> &rdlocks, set<SimpleLock*> &wrlocks, set<SimpleLock*> &xlocks,
map<SimpleLock*,mds_rank_t> *remote_wrlocks, CInode *auth_pin_freeze, bool auth_pin_nonblock)
{ // remote_wrlocks = NULL, auth_pin_freeze = NULL, auth_pin_nonblock = false
client_t client = mdr->get_client();
set<SimpleLock*, SimpleLock::ptr_lt> sorted; // sort everything we will lock
set<MDSCacheObject*> mustpin; // items to authpin
// xlocks,周遊xlocks,此時xlocks隻有一個,就是“a”的CDentry的lock
for (set<SimpleLock*>::iterator p = xlocks.begin(); p != xlocks.end(); ++p) {
sorted.insert(lock); // 将"a"的CDentry中的lock放入sorted中
mustpin.insert(lock->get_parent()); // 将CDentry放入mustpin中
// augment xlock with a versionlock?
if ((*p)->get_type() == CEPH_LOCK_DN) {
CDentry *dn = (CDentry*)lock->get_parent(); // dn就是"a"的CDentry
if (mdr->is_master()) {
// master. wrlock versionlock so we can pipeline dentry updates to journal.
wrlocks.insert(&dn->versionlock); // 将"a"的CDentry中的versionlock放入wrlocks中
} else { ...... }
} ......
}
// wrlocks,周遊wrlocks,此時wrlocks裡面有三個: "a"的CDentry的versionlock,
// “test”的CInode的filelock和nestlock
for (set<SimpleLock*>::iterator p = wrlocks.begin(); p != wrlocks.end(); ++p) {
MDSCacheObject *object = (*p)->get_parent();
sorted.insert(*p); // 将三個lock加入sorted中
if (object->is_auth())
mustpin.insert(object); // 将"test"的CInode加入mustpin中
else if ......
}
// rdlocks,rdlocks裡面有4個lock:"a"CDentry的lock,
// "test"CInode的authlock、snaplock,"/"的CInode的snaplock
for (set<SimpleLock*>::iterator p = rdlocks.begin();p != rdlocks.end();++p) {
MDSCacheObject *object = (*p)->get_parent();
sorted.insert(*p); // 将4個lock加入sorted中
if (object->is_auth())
mustpin.insert(object); // 将"/"的CInode加入mustpin中
else if ......
}
......
}
綜上述得:是以sorted中有7個lock:"a"的CDentry的lock和versionlock,"test"的CInode的filelock、nestlock、authlock、snaplock, 還有“/”目錄的snaplock。
第二個步驟是auth_pin住中繼資料,通過第一步,可以知道要auth_pin的MDSCacheObject:"a"的CDentry,"test"的CInode,"/"的CInode。先周遊這三個,去看看是否可以auth_pin,即判斷兩個部分:auth、pin。如果目前MDS持有的MDSCacheObject不是auth結點,則需要發給auth的MDS去auth_pin,如果目前的MDSCacheObject處于被當機,或當機中,則不能auth_pin,加入等待隊列,等待可以auth_pin;然後直接傳回false。如果可以auth_pin,下面才去auth_pin,将MDSCacheObject中的auth_pins++,代碼如下
bool Locker::acquire_locks(MDRequestRef& mdr, set<SimpleLock*> &rdlocks, set<SimpleLock*> &wrlocks, set<SimpleLock*> &xlocks,
map<SimpleLock*,mds_rank_t> *remote_wrlocks, CInode *auth_pin_freeze, bool auth_pin_nonblock)
{
......
// AUTH PINS
map<mds_rank_t, set<MDSCacheObject*> > mustpin_remote; // mds -> (object set)
// can i auth pin them all now?,看是否可以authpin
// 周遊mustpin,mustpin中含有三個元素:"a"的CDentry,"test"的CInode,"/"的CInode
marker.message = "failed to authpin local pins";
for (set<MDSCacheObject*>::iterator p = mustpin.begin();p != mustpin.end(); ++p) {
MDSCacheObject *object = *p;
if (mdr->is_auth_pinned(object)) {...... }// 即看mdr的auth_pins中是否有該MDSCacheObject,如果有,就表示已經auth_pin了
if (!object->is_auth()) { ...... } // 如果不是auth節點,将該CDentry/CInode加入mustpin_remote隊列,在下面去auth_pin時,發MMDSSlaveRequest消息給auth的mds去處理
// 并将該CDentry/CInode加入waiting_on_slave後,直接傳回
int err = 0;
if (!object->can_auth_pin(&err)) { // CDentry是否可以auth_pin,即看父目錄("test")的CDir是否可以can_auth_pin
// "test"的CDir是否是auth,且是否被當機frozen或者正在被當機frozing
// 如果不能auth_pin,則add_waiter,并傳回,等待下次喚醒重試。
//CInode是否可以auth_pin,得看CInode是否是auth,或者inode是否被當機,或者正在被當機,或者auth_pin被當機;
// 看CInode的CDentry是否可以can_auth_pin
if (err == MDSCacheObject::ERR_EXPORTING_TREE) {
marker.message = "failed to authpin, subtree is being exported";
} else if (err == MDSCacheObject::ERR_FRAGMENTING_DIR) {
marker.message = "failed to authpin, dir is being fragmented";
} else if (err == MDSCacheObject::ERR_EXPORTING_INODE) {
marker.message = "failed to authpin, inode is being exported";
}
object->add_waiter(MDSCacheObject::WAIT_UNFREEZE, new C_MDS_RetryRequest(mdcache, mdr));
......
return false;
}
}
// ok, grab local auth pins
for (set<MDSCacheObject*>::iterator p = mustpin.begin(); p != mustpin.end(); ++p) {
MDSCacheObject *object = *p;
if (mdr->is_auth_pinned(object)) { ...... }
else if (object->is_auth()) {
mdr->auth_pin(object); // 開始auth_pin,即将object中的auth_pins++
}
......
}
第三個步驟,正式開始加鎖,經過一系列操作,要加鎖的lock變化了,如下
wrlocks中多了"a"的CDentry的versionlock。
sorted中有7個lock:"a"的CDentry的versionlock和lock, “/”目錄的snaplock,"test"的CInode的snaplock、filelock、authlock、nestlock。
bool Locker::acquire_locks(MDRequestRef& mdr, set<SimpleLock*> &rdlocks, set<SimpleLock*> &wrlocks, set<SimpleLock*> &xlocks,
map<SimpleLock*,mds_rank_t> *remote_wrlocks, CInode *auth_pin_freeze, bool auth_pin_nonblock)
{
......
// caps i'll need to issue
set<CInode*> issue_set;
bool result = false;
// acquire locks.
// make sure they match currently acquired locks.
set<SimpleLock*, SimpleLock::ptr_lt>::iterator existing = mdr->locks.begin();
for (set<SimpleLock*, SimpleLock::ptr_lt>::iterator p = sorted.begin(); p != sorted.end(); ++p) {
bool need_wrlock = !!wrlocks.count(*p); // 先是"a"的CDentry的versionlock
bool need_remote_wrlock = !!(remote_wrlocks && remote_wrlocks->count(*p));
// lock
if (xlocks.count(*p)) {
marker.message = "failed to xlock, waiting";
// xlock_start "a"的CDentry的lock,lock狀态由LOCK_SYNC --> LOCK_SYNC_LOCK --> LOCK_LOCK (simple_lock) --> LOCK_LOCK_XLOCK --> LOCK_PEXLOCK(simple_xlock)
// --> LOCK_XLOCK (xlock_start)
if (!xlock_start(*p, mdr)) // 先進行xlock
goto out;
dout(10) << " got xlock on " << **p << " " << *(*p)->get_parent() << dendl;
} else if (need_wrlock || need_remote_wrlock) {
if (need_wrlock && !mdr->wrlocks.count(*p)) {
marker.message = "failed to wrlock, waiting";
// nowait if we have already gotten remote wrlock
if (!wrlock_start(*p, mdr, need_remote_wrlock)) // 進行wrlock
goto out;
dout(10) << " got wrlock on " << **p << " " << *(*p)->get_parent() << dendl;
}
} else {
marker.message = "failed to rdlock, waiting";
if (!rdlock_start(*p, mdr)) // 進行rdlock
goto out;
dout(10) << " got rdlock on " << **p << " " << *(*p)->get_parent() << dendl;
}
}
......
out:
issue_caps_set(issue_set);
return result;
}
開始周遊sorted。
- 對"a"的CDentry的versionlock加wrlock,看是否可以wrlock,即是否已經xlocked,這裡可以直接加wrlock。并沒有涉及到鎖的切換(versionlock 是locallock類型)。
bool can_wrlock() const {
return !is_xlocked();
}
- 對"a"的CDentry的lock(屬于simplelock)加xlock,即進行xlock_start,最初鎖的狀态為LOCK_SYNC,而這種狀态是不可以直接加xlock的,具體判斷這裡先不細講,後面研究lock時,再擴充。
bool can_xlock(client_t client) const {
return get_sm()->states[state].can_xlock == ANY ||
(get_sm()->states[state].can_xlock == AUTH && parent->is_auth()) ||
(get_sm()->states[state].can_xlock == XCL && client >= 0 && get_xlock_by_client() == client);
}
從locks.cc中定義的simplelock數組中可以查的get_sm()->states[state].can_xlock == 0不滿足上 xlock 條件(不等于0),是以要經過鎖切換。
先經過Locker::simple_lock,将鎖的狀态切換為LOCK_LOCK(過程):LOCK_SYNC --> LOCK_SYNC_LOCK -->LOCK_LOCK。在LOCK_SYNC_LOCK -->LOCK_LOCK的切換過程中,需要判斷是否滿足條件:即該lock是否leased;是否被rdlocked;該CDentry是否在别的MDS上有副本,如果有,則需要發送LOCK_AC_LOCK消息給擁有副本的MDS,也去對它加鎖。這裡都滿足,因為"a"目錄是正在建立的。但是LOCK_LOCK也不能xlock,是以還需要繼續切換,即通過Locker::simple_xlock,來切換鎖:LOCK_LOCK --> LOCK_LOCK_XLOCK --> LOCK_PEXLOCK。切換成LOCK_PEXLOCK後就可以加xlock了。最後将鎖狀态切換為LOCK_XLOCK。
- 對"/"和"test"的CInode的snaplock (是simple_lock類型)加rdlock,它們鎖的狀态都是LOCK_SYNC,是可以直接加rdlock。這裡沒有涉及到鎖的切換。
- 對"test"的CInode的filelock加wrlock,最初鎖的狀态為LOCK_SYNC,不滿足加wrlock條件,需要通過Locker::simple_lock對鎖進行切換。先将鎖切換為中間狀态LOCK_SYNC_LOCK,然後判斷是否可以切換成LOCK_LOCK狀态,在CInode::issued_caps_need_gather中,發現别的用戶端拿了"test"目錄inode的"Fs"權限(此時filelock的狀态為LOCK_SYNC_LOCK,而這種狀态的鎖,隻允許用戶端持有"Fc",其他與"F"有關的權限都不允許),是以"test"的CInode的filelock不能切換成LOCK_LOCK狀态。需要通過Locker::issue_caps去收回其他用戶端持有的"Fs"權限。
void Locker::simple_lock(SimpleLock *lock, bool *need_issue)
{ //need_issue = NULL
CInode *in = 0;
if (lock->get_cap_shift()) // 由于lock的type是CEPH_LOCK_IFILE,是以cap_shift為8
in = static_cast<CInode *>(lock->get_parent());
int old_state = lock->get_state(); // old_state = LOCK_SYNC
switch (lock->get_state()) {
case LOCK_SYNC: lock->set_state(LOCK_SYNC_LOCK); break;
......}
int gather = 0;
if (lock->is_leased()) { ...... }
if (lock->is_rdlocked()) gather++;
if (in && in->is_head()) {
if (in->issued_caps_need_gather(lock)) { // in->issued_caps_need_gather(lock) = true
if (need_issue) *need_issue = true;
else issue_caps(in);
gather++;
}
}
......
if (gather) {
lock->get_parent()->auth_pin(lock);
......
} else { ...... }
}
issue_caps代碼如下,即周遊"test"目錄的CInode中client_caps中儲存的各個用戶端的Capability,此時通過get_caps_allowed_by_type算出用戶端允許的caps為"pAsLsXsFc",而有用戶端持有"pAsLsXsFs",是以發送CEPH_CAP_OP_REVOKE消息給用戶端,讓用戶端釋放"Fs"權限。
bool Locker::issue_caps(CInode *in, Capability *only_cap)
{
// allowed caps are determined by the lock mode.
int all_allowed = in->get_caps_allowed_by_type(CAP_ANY); // all_allowed = "pAsLsXsFc"
int loner_allowed = in->get_caps_allowed_by_type(CAP_LONER); // loner_allowed = "pAsLsXsFc"
int xlocker_allowed = in->get_caps_allowed_by_type(CAP_XLOCKER); // xlocker_allowed = "pAsLsXsFc"
// count conflicts with
int nissued = 0;
// client caps
map<client_t, Capability>::iterator it;
if (only_cap) ...... // only_cap = NULL
else it = in->client_caps.begin();
for (; it != in->client_caps.end(); ++it) {
Capability *cap = &it->second;
if (cap->is_stale()) continue; // cap如果過期,就不需要周遊
// do not issue _new_ bits when size|mtime is projected
int allowed;
if (loner == it->first) ......
else allowed = all_allowed; // allowed = all_allowed = "pAsLsXsFc"
// add in any xlocker-only caps (for locks this client is the xlocker for)
allowed |= xlocker_allowed & in->get_xlocker_mask(it->first); // allowed |= 0
int pending = cap->pending(); // pending = "pAsLsXsFs"
int wanted = cap->wanted(); // wanted = "AsLsXsFsx"
// are there caps that the client _wants_ and can have, but aren't pending?
// or do we need to revoke?
if (((wanted & allowed) & ~pending) || // missing wanted+allowed caps
(pending & ~allowed)) { // need to revoke ~allowed caps. // (pending & ~allowed) = "Fs"
// issue
nissued++;
// include caps that clients generally like, while we're at it.
int likes = in->get_caps_liked(); // likes = "pAsxLsxXsxFsx"
int before = pending; // before = "pAsLsXsFs"
long seq;
if (pending & ~allowed)
// (wanted|likes) & allowed & pending = "AsLsXsFsx" | "pAsxLsxXsxFsx" & "pASLsXsFc" & "pASLsXsFs" = "pASLsXs"
seq = cap->issue((wanted|likes) & allowed & pending); // if revoking, don't issue anything new.
else ......
int after = cap->pending(); // after = "pAsLsXs"
if (cap->is_new()) { ......
} else {
int op = (before & ~after) ? CEPH_CAP_OP_REVOKE : CEPH_CAP_OP_GRANT; // op = CEPH_CAP_OP_REVOKE
if (op == CEPH_CAP_OP_REVOKE) {
revoking_caps.push_back(&cap->item_revoking_caps);
revoking_caps_by_client[cap->get_client()].push_back(&cap->item_client_revoking_caps);
cap->set_last_revoke_stamp(ceph_clock_now());
cap->reset_num_revoke_warnings();
}
auto m = MClientCaps::create(op, in->ino(), in->find_snaprealm()->inode->ino(),cap->get_cap_id(),
cap->get_last_seq(), after, wanted, 0, cap->get_mseq(), mds->get_osd_epoch_barrier());
in->encode_cap_message(m, cap);
mds->send_message_client_counted(m, it->first);
}
}
}
return (nissued == 0); // true if no re-issued, no callbacks
}
發送完revoke cap消息後,在Locker::wrlock_start中,跳出循環,生成 C_MDS_RetryRequest,加入等待隊列,等待lock狀态變成穩态後,再把請求拿出來執行。
bool Locker::wrlock_start(SimpleLock *lock, MDRequestRef& mut, bool nowait)
{ // nowait = false
......
while (1) {
// wrlock?
// ScatterLock中sm是sm_filelock,states是filelock,而此時CInode的filelock->state是LOCK_SYNC_LOCK, filelock[LOCK_SYNC_LOCK].can_wrlock == 0, 是以不可wrlock
if (lock->can_wrlock(client) && (!want_scatter || lock->get_state() == LOCK_MIX)) { ...... }
......
if (!lock->is_stable()) break; // 由于此時filelock->state是LOCK_SYNC_LOCK,不是stable的,是以跳出循環
......
}
if (!nowait) {
dout(7) << "wrlock_start waiting on " << *lock << " on " << *lock->get_parent() << dendl;
lock->add_waiter(SimpleLock::WAIT_STABLE, new C_MDS_RetryRequest(mdcache, mut)); // C_MDS_RetryRequest(mdcache, mut))加入等待隊列,等待“test”的CInode的filelock變為穩态
nudge_log(lock);
}
return false;
}
接下來用戶端會回複caps消息op為CEPH_CAP_OP_UPDATE。MDS通過Locker::handle_client_caps處理caps消息
Locker::handle_client_caps
void Locker::handle_client_caps(const MClientCaps::const_ref &m)
{
client_t client = m->get_source().num();
snapid_t follows = m->get_snap_follows(); // follows = 0
auto op = m->get_op(); // op = CEPH_CAP_OP_UPDATE
auto dirty = m->get_dirty(); // dirty = 0
Session *session = mds->get_session(m);
......
CInode *head_in = mdcache->get_inode(m->get_ino()); // head_in是"test"的CInode
Capability *cap = 0;
cap = head_in->get_client_cap(client); // 擷取該client的cap
bool need_unpin = false;
// flushsnap?
if (cap->get_cap_id() != m->get_cap_id()) { ...... }
else {
CInode *in = head_in;
// head inode, and cap
MClientCaps::ref ack;
int caps = m->get_caps(); // caps = "pAsLsXs"
cap->confirm_receipt(m->get_seq(), caps); // cap->_issued = "pAsLsXs",cap->_pending = "pAsLsXs"
// filter wanted based on what we could ever give out (given auth/replica status)
bool need_flush = m->flags & MClientCaps::FLAG_SYNC;
int new_wanted = m->get_wanted() & head_in->get_caps_allowed_ever(); // m->get_wanted() = 0
if (new_wanted != cap->wanted()) { // cap->wanted() = "AsLsXsFsx"
......
adjust_cap_wanted(cap, new_wanted, m->get_issue_seq()); // 将wanted設定為0
}
if (updated) { ...... }
else {
bool did_issue = eval(in, CEPH_CAP_LOCKS); //
......
}
if (need_flush)
mds->mdlog->flush();
}
out:
if (need_unpin)
head_in->auth_unpin(this);
}
bool Locker::eval(CInode *in, int mask, bool caps_imported)
{ //in是"test"目錄的CInode指針,mask = 2496, caps_imported = false
bool need_issue = caps_imported; // need_issue = false
MDSInternalContextBase::vec finishers;
retry:
if (mask & CEPH_LOCK_IFILE) // 此時filelock的state為LOCK_SYNC_LOCK,不是穩态
eval(&in->filelock, &need_issue, &finishers, caps_imported);
if (mask & CEPH_LOCK_IAUTH) // 此時authlock的狀态為LOCK_SYNC
eval(&in->authlock, &need_issue, &finishers, caps_imported);
if (mask & CEPH_LOCK_ILINK) // 此時linklock的狀态為LOCK_SYNC
eval(&in->linklock, &need_issue, &finishers, caps_imported);
if (mask & CEPH_LOCK_IXATTR) // 此時xattrlock的狀态為LOCK_SYNC
eval(&in->xattrlock, &need_issue, &finishers, caps_imported);
if (mask & CEPH_LOCK_INEST)
eval(&in->nestlock, &need_issue, &finishers, caps_imported);
if (mask & CEPH_LOCK_IFLOCK)
eval(&in->flocklock, &need_issue, &finishers, caps_imported);
if (mask & CEPH_LOCK_IPOLICY)
eval(&in->policylock, &need_issue, &finishers, caps_imported);
// drop loner?
......
finish_contexts(g_ceph_context, finishers);
if (need_issue && in->is_head())
issue_caps(in);
dout(10) << "eval(215, 58, 73); margin: 0px; padding: 0px; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0);"><< dendl;
return need_issue;
}
void Locker::eval(SimpleLock *lock, bool first, bool *pneed_issue, MDSInternalContextBase::vec *pfinishers)
{ // first = false
int next = lock->get_next_state(); // next = LOCK_LOCK
CInode *in = 0;
bool caps = lock->get_cap_shift(); // caps = 8
if (lock->get_type() != CEPH_LOCK_DN)
in = static_cast<CInode *>(lock->get_parent()); // 得到"test"的CInode
bool need_issue = false;
int loner_issued = 0, other_issued = 0, xlocker_issued = 0;
if (caps && in->is_head()) {
in->get_caps_issued(&loner_issued, &other_issued, &xlocker_issued, lock->get_cap_shift(), lock->get_cap_mask());
// 得到loner_issued = 0,other_issued = 0,xlocker_issued = 0
......
}
#define IS_TRUE_AND_LT_AUTH(x, auth) (x && ((auth && x <= AUTH) || (!auth && x < AUTH)))
bool auth = lock->get_parent()->is_auth();
if (!lock->is_gathering() && // gather_set為空,即其他mds并不需要擷取鎖,是以lock不處于gathering中,
(IS_TRUE_AND_LT_AUTH(lock->get_sm()->states[next].can_rdlock, auth) || !lock->is_rdlocked()) &&
(IS_TRUE_AND_LT_AUTH(lock->get_sm()->states[next].can_wrlock, auth) || !lock->is_wrlocked()) &&
(IS_TRUE_AND_LT_AUTH(lock->get_sm()->states[next].can_xlock, auth) || !lock->is_xlocked()) &&
(IS_TRUE_AND_LT_AUTH(lock->get_sm()->states[next].can_lease, auth) || !lock->is_leased()) &&
!(lock->get_parent()->is_auth() && lock->is_flushing()) && // i.e. wait for scatter_writebehind!
(!caps || ((~lock->gcaps_allowed(CAP_ANY, next) & other_issued) == 0 &&
(~lock->gcaps_allowed(CAP_LONER, next) & loner_issued) == 0 &&
(~lock->gcaps_allowed(CAP_XLOCKER, next) & xlocker_issued) == 0)) &&
lock->get_state() != LOCK_SYNC_MIX2 && // these states need an explicit trigger from the auth mds
lock->get_state() != LOCK_MIX_SYNC2
) {
if (!lock->get_parent()->is_auth()) { // 如果是副本,則發送消息給auth的mds, 讓auth的mds去加鎖
......
} else {
......
}
lock->set_state(next); // 将鎖轉換為LOCK_LOCK
if (lock->get_parent()->is_auth() && lock->is_stable())
lock->get_parent()->auth_unpin(lock);
// drop loner before doing waiters
if (pfinishers)
// 将之前的mkdir的C_MDS_RetryRequest取出,放入pfinishers中
lock->take_waiting(SimpleLock::WAIT_STABLE|SimpleLock::WAIT_WR|SimpleLock::WAIT_RD|SimpleLock::WAIT_XLOCK, *pfinishers);
...
if (caps && in->is_head()) need_issue = true;
if (lock->get_parent()->is_auth() && lock->is_stable()) try_eval(lock, &need_issue);
}
if (need_issue) {
if (pneed_issue)
*pneed_issue = true;
else if (in->is_head())
issue_caps(in);
}
}