SSD取代HDD已經是個必然的趨勢了。SSD的接口一路從sata變成pcie,上層協定又從ahci轉變為nvme,一次次的更新也帶來了一次次性能的提升。nvme實體層基于高速的pcie接口,pcie3.0一個lane就已經達到了8Gb/s的速度。x2 x4 x8 … N個lane在一起就是N倍的速度,非常強大。另外加上nvme協定本身的簡潔高效,使得協定層的消耗進一步降低,最終達到高速的傳輸效果。
QEMU + BUILDROOT
出于好奇和興趣,想要學習一下nvme。協定自然是看最新的pcie3.0和nvme1.2。但是隻停留在文檔和協定始終隻是紙上談兵,必須要進行一點實踐才能了解深刻。要進行實踐,無非是兩條路。一是從裝置端學習固件的nvme子產品,二就是從主機端學習nvme的驅動。如今沒再從事ssd的工作了,是以想走第一條路基本是不可能了,隻能在第二條路上做文章。要是手上有一個nvme的ssd,那對于實驗是非常不錯的,可惜我沒有,相信大多數的人也沒有。然而曲線也能救國,總有一條路是給我們準備的,那就是qemu了!
qemu是一個模拟器,可以模拟x86 arm powerpc等等等等,而且支援nvme裝置的模拟(不過目前貌似nvme裝置隻在x86下支援,我嘗試過ARM+PCI的環境,但是qemu表示不支援nvme裝置)。主機端的系統當然是選擇linux,因為我們可以很友善的獲得nvme相關的一切代碼。
那…接下來到底要怎麼上手哪?linux隻是一個核心,還需要rootfs,需要這需要那。我們在這裡想要學習的是nvme驅動,而不是如何從零開始搭建一個linux系統,是以我們需要一條快速而又便捷的道路。這時候,我就要給大家介紹下buildroot了。有了他,一切全搞定!^_^當然,前提是我們需要一個linux的主機(推薦ubuntu,個人比較喜歡,用的人也多,出了問題比較容易在網上找到攻略,最新的可以安裝16.04),并安裝了qemu。
buildroot最新的版本可以從這裡下載下傳:https://buildroot.org/downloads/buildroot-2016.05.tar.bz2
解壓後運作
make qemu_x86_64_defconfig
make
編譯完成後,根據board/qemu/x86_64/readme.txt裡描述的指令,可以在qemu裡運作起來剛才編譯好的linux系統。
qemu-system-x86_64 -M pc -kernel output/images/bzImage -drive file=output/images/rootfs.ext2,if=virtio,format=raw -append root=/dev/vda -net nic,model=virtio -net user
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsIiM3EjNzIDNwIzMwYDM2EDMy8CX0Vmbu4GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.jpg)
這個預設的系統log輸出被重定向到了虛拟機的“螢幕”上,而非shell上,不能復原,使得調試起來很不友善。我們需要修改一些東西把log重定向到linux的shell上。首先是編輯buildroot目錄下的.config檔案。
改成
然後重新編譯。等到編譯完成後運作下面修改過的指令,就得到我們想要的結果了。
make
qemu-system-x86_64 -M pc -kernel output/images/bzImage -drive file=output/images/rootfs.ext2,if=virtio,format=raw -append "console=ttyS0 root=/dev/vda" -net nic,model=virtio -net user -serial stdio
下面,我們再修改一些指令,加上nvme的支援。
qemu-img create -f raw nvme.img G
qemu-system-x86_64 -M pc -kernel output/images/bzImage -drive file=output/images/rootfs.ext2,if=virtio,format=raw -append "console=ttyS0 root=/dev/vda" -net nic,model=virtio -net user -serial stdio -drive file=nvme.img,if=none,format=raw,id=drv0 -device nvme,drive=drv0,serial=foo
linux系統起來後,我們可以在/dev下面檢視到nvme相關的裝置了。
# ls -l /dev
crw------- 1 root root 253, 0 Jun 3 13:00 nvme0
brw------- 1 root root 259, 0 Jun 3 13:00 nvme0n1
自此,我們的動手實踐稍作暫停,可以去學習下nvme的代碼了。在遇到問題的時候,我們可以修改代碼并在qemu裡運作檢視效果,真棒!
RTFSC - Read The Fucking Source Code
nvme驅動代碼的分析基于linux核心版本4.5.3,為什麼選擇這個版本?主要是因為buildroot-2016.05預設選擇的是這個版本的核心。我們也可以手動修改核心的版本,但這裡就不做詳述了。nvme的代碼位于drivers/nvme目錄内,檔案不多,主要就兩個檔案:core.c和pci.c。
分析驅動,首先是要找到這個驅動的入口。module_init把函數nvme_init聲明為這個驅動的入口,在linux加載過程中會自動被調用。
static int __init nvme_init(void)
{
int result;
init_waitqueue_head(&nvme_kthread_wait);
nvme_workq = alloc_workqueue("nvme", WQ_UNBOUND | WQ_MEM_RECLAIM, );
if (!nvme_workq)
return -ENOMEM;
result = nvme_core_init();
if (result < )
goto kill_workq;
result = pci_register_driver(&nvme_driver);
if (result)
goto core_exit;
return ;
core_exit:
nvme_core_exit();
kill_workq:
destroy_workqueue(nvme_workq);
return result;
}
static void __exit nvme_exit(void)
{
pci_unregister_driver(&nvme_driver);
nvme_core_exit();
destroy_workqueue(nvme_workq);
BUG_ON(nvme_thread && !IS_ERR(nvme_thread));
_nvme_check_size();
}
module_init(nvme_init);
module_exit(nvme_exit);
nvme_init流程分析:
- 建立一個全局的workqueue,有了這個workqueue之後,很多的work就可以丢到這個workqueue裡執行了。後面會提到的兩個很重要的scan_work和reset_work就是在這個workqueue裡被排程運作的。
- 調用nvme_core_init。
- 調用pci_register_driver。
int __init nvme_core_init(void)
{
int result;
result = register_blkdev(nvme_major, "nvme");
if (result < )
return result;
else if (result > )
nvme_major = result;
result = __register_chrdev(nvme_char_major, , NVME_MINORS, "nvme",
&nvme_dev_fops);
if (result < )
goto unregister_blkdev;
else if (result > )
nvme_char_major = result;
nvme_class = class_create(THIS_MODULE, "nvme");
if (IS_ERR(nvme_class)) {
result = PTR_ERR(nvme_class);
goto unregister_chrdev;
}
return ;
unregister_chrdev:
__unregister_chrdev(nvme_char_major, , NVME_MINORS, "nvme");
unregister_blkdev:
unregister_blkdev(nvme_major, "nvme");
return result;
}
nvme_core_init流程分析:
- 調用register_blkdev注冊一個名字叫nvme的塊裝置。
- 調用__register_chrdev注冊一個名字叫nvme的字元裝置。這些注冊的裝置并不會在/dev下出現,而是在/proc/devices下,代表某種裝置對應的主裝置号,而不是裝置的執行個體。這裡要注意一點就是,字元裝置和塊裝置的主裝置号是不相關的,是以完全可以一樣,代表的是完全無關的裝置。我們在這裡獲得的字元裝置和塊裝置的主裝置号就正好都是253。
# cat /proc/devices
Character devices:
mem
pty
ttyp
/dev/vc/
tty
ttyS
/dev/tty
/dev/console
/dev/ptmx
vcs
misc
input
fb
alsa
ptm
pts
usb
usb_device
drm
nvme
bsg
Block devices:
blkext
sd
sd
sd
sd
sd
sd
sd
sd
sd
sd
sd
sd
sd
sd
sd
sd
nvme
virtblk
細心的讀者可能會發現,剛才我們列出來的一個字元裝置nvme0,它的主裝置号确實是253。但是塊裝置nvme0n1的主裝置号是259,也就是blkext,這是為什麼那?先等等,到後面注冊這個裝置執行個體的時候我們再看。
回到nvme_init,pci_register_driver注冊了一個pci驅動。這裡有幾個重要的東西,一個是vendor id和device id,我們可以看到有一條是PCI_VDEVICE(INTEL, 0x5845),有了這個,這個驅動就能跟pci總線枚舉出來的裝置比對起來,進而正确的加載驅動了。
static const struct pci_device_id nvme_id_table[] = {
{ PCI_VDEVICE(INTEL, ),
.driver_data = NVME_QUIRK_STRIPE_SIZE, },
{ PCI_VDEVICE(INTEL, ), /* Qemu emulated controller */
.driver_data = NVME_QUIRK_IDENTIFY_CNS, },
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, ) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, ) },
{ , }
};
MODULE_DEVICE_TABLE(pci, nvme_id_table);
static struct pci_driver nvme_driver = {
.name = "nvme",
.id_table = nvme_id_table,
.probe = nvme_probe,
.remove = nvme_remove,
.shutdown = nvme_shutdown,
.driver = {
.pm = &nvme_dev_pm_ops,
},
.err_handler = &nvme_err_handler,
};
在linux裡我們通過lspci指令來檢視目前的pci裝置,發現nvme裝置的device id就是0x5845。
# lspci -k
00:00.0 Class 0600: 8086:1237
00:01.0 Class 0601: 8086:7000
00:01.1 Class 0101: 8086:7010 ata_piix
00:01.3 Class 0680: 8086:7113
00:02.0 Class 0300: 1234:1111 bochs-drm
00:03.0 Class 0200: 1af4:1000 virtio-pci
00:04.0 Class 0108: 8086:5845 nvme
00:05.0 Class 0100: 1af4:1001 virtio-pci
pci_register_driver還有一個重要的事情就是設定probe函數。有了probe函數,當裝置和驅動比對了之後,相應驅動的probe函數就會被調用,來實作驅動的加載。是以nvme_init傳回後,這個驅動就啥事不做了,直到pci總線枚舉出了這個nvme裝置,然後就會調用我們的nvme_probe。
static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
int node, result = -ENOMEM;
struct nvme_dev *dev;
node = dev_to_node(&pdev->dev);
if (node == NUMA_NO_NODE)
set_dev_node(&pdev->dev, );
dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
if (!dev)
return -ENOMEM;
dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
GFP_KERNEL, node);
if (!dev->entry)
goto free;
dev->queues = kzalloc_node((num_possible_cpus() + ) * sizeof(void *),
GFP_KERNEL, node);
if (!dev->queues)
goto free;
dev->dev = get_device(&pdev->dev);
pci_set_drvdata(pdev, dev);
result = nvme_dev_map(dev);
if (result)
goto free;
INIT_LIST_HEAD(&dev->node);
INIT_WORK(&dev->scan_work, nvme_dev_scan);
INIT_WORK(&dev->reset_work, nvme_reset_work);
INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
mutex_init(&dev->shutdown_lock);
init_completion(&dev->ioq_wait);
result = nvme_setup_prp_pools(dev);
if (result)
goto put_pci;
result = nvme_init_ctrl(&dev->ctrl, &pdev->dev, &nvme_pci_ctrl_ops,
id->driver_data);
if (result)
goto release_pools;
queue_work(nvme_workq, &dev->reset_work);
return ;
release_pools:
nvme_release_prp_pools(dev);
put_pci:
put_device(dev->dev);
nvme_dev_unmap(dev);
free:
kfree(dev->queues);
kfree(dev->entry);
kfree(dev);
return result;
}
nvme_probe流程分析:
- 為dev、dev->entry、dev->queues配置設定空間。entry儲存的是msi相關的資訊,為每個cpu core配置設定一個。queues為每個core配置設定一個io queue,所有的core共享一個admin queue。這裡的queue的概念,更嚴格的說,是一組submission queue和completion queue的集合。
- 調用nvme_dev_map。
- 初始化三個work變量,關聯回掉函數。
- 調用nvme_setup_prp_pools。
- 調用nvme_init_ctrl
- 通過workqueue排程dev->reset_work,也就是排程nvme_reset_work函數。
static int nvme_dev_map(struct nvme_dev *dev)
{
int bars;
struct pci_dev *pdev = to_pci_dev(dev->dev);
bars = pci_select_bars(pdev, IORESOURCE_MEM);
if (!bars)
return -ENODEV;
if (pci_request_selected_regions(pdev, bars, "nvme"))
return -ENODEV;
dev->bar = ioremap(pci_resource_start(pdev, ), );
if (!dev->bar)
goto release;
return ;
release:
pci_release_regions(pdev);
return -ENODEV;
}
nvme_dev_map流程分析:
- 調用pci_select_bars,這個函數的傳回值是一個mask值,每一位代表一個bar(base address register),哪一位被置位了,就代表哪一個bar為非零。這個涉及到pci的協定,pci協定裡規定了pci裝置的配置空間裡有6個32位的bar寄存器,代表了pci裝置上的一段記憶體空間(memory、io)。 在代碼中我們可以嘗試着增加一些調試代碼,以便我們更好的了解。在修改好代碼後,我們需要在buildroot目錄下重新編譯核心,這樣可以很快速的得到一個新的核心,然後運作看結果。我檢視過pci_select_bars的傳回值,是0x11,代表bar0和bar4是非零的。
linux裡的nvme驅動代碼分析(加載初始化)
-
調用pci_request_selected_regions,這個函數的一個參數就是之前調用pci_select_bars傳回的mask值,作用就是把對應的這個幾個bar保留起來,不讓别人使用。
不調用pci_request_selected_regions的話/proc/iomem如下
# cat /proc/iomem
-febfffff : PCI Bus :
fd000000-fdffffff : ::
fd000000-fdffffff : bochs-drm
feb80000-febbffff : ::
febc0000-febcffff : ::
febd0000-febd1fff : ::
febd2000-febd2fff : ::
febd2000-febd2fff : bochs-drm
febd3000-febd3fff : ::
febd4000-febd4fff : ::
febd5000-febd5fff : ::
調用pci_request_selected_regions的話/proc/iomem如下,會多出兩項nvme,bar0對應的實體位址就是0xfebd0000,bar4對應的是0xfebd4000。
# cat /proc/iomem
-febfffff : PCI Bus :
fd000000-fdffffff : ::
fd000000-fdffffff : bochs-drm
feb80000-febbffff : ::
febc0000-febcffff : ::
febd0000-febd1fff : ::
febd0000-febd1fff : nvme
febd2000-febd2fff : ::
febd2000-febd2fff : bochs-drm
febd3000-febd3fff : ::
febd4000-febd4fff : ::
febd4000-febd4fff : nvme
febd5000-febd5fff : ::
- 調用ioremap。前面說到bar0對應的實體位址是0xfebd0000,在linux中我們無法直接通路實體位址,需要映射到虛拟位址,ioremap就是這個作用。映射完後,我們通路dev->bar就可以直接操作nvme裝置上的寄存器了。但是代碼中,并沒有根據pci_select_bars的傳回值來決定映射哪個bar,而是直接hard code成映射bar0,原因是nvme協定中強制規定了bar0就是記憶體映射的基址。而bar4是自定義用途,暫時還不确定有什麼用。
static int nvme_setup_prp_pools(struct nvme_dev *dev)
{
dev->prp_page_pool = dma_pool_create("prp list page", dev->dev,
PAGE_SIZE, PAGE_SIZE, );
if (!dev->prp_page_pool)
return -ENOMEM;
/* Optimisation for I/Os between 4k and 128k */
dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,
, , );
if (!dev->prp_small_pool) {
dma_pool_destroy(dev->prp_page_pool);
return -ENOMEM;
}
return ;
}
回到nvme_probe來看nvme_setup_prp_pools,主要是建立dma pool,後面就可以通過其他dma函數從dma pool中獲得memory了。這裡一個pool裡提供的是256位元組大小的記憶體,一個提供的是4K大小的,主要是為了對于不一樣長度的prp list來做優化。
int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
const struct nvme_ctrl_ops *ops, unsigned long quirks)
{
int ret;
INIT_LIST_HEAD(&ctrl->namespaces);
mutex_init(&ctrl->namespaces_mutex);
kref_init(&ctrl->kref);
ctrl->dev = dev;
ctrl->ops = ops;
ctrl->quirks = quirks;
ret = nvme_set_instance(ctrl);
if (ret)
goto out;
ctrl->device = device_create_with_groups(nvme_class, ctrl->dev,
MKDEV(nvme_char_major, ctrl->instance),
dev, nvme_dev_attr_groups,
"nvme%d", ctrl->instance);
if (IS_ERR(ctrl->device)) {
ret = PTR_ERR(ctrl->device);
goto out_release_instance;
}
get_device(ctrl->device);
dev_set_drvdata(ctrl->device, ctrl);
ida_init(&ctrl->ns_ida);
spin_lock(&dev_list_lock);
list_add_tail(&ctrl->node, &nvme_ctrl_list);
spin_unlock(&dev_list_lock);
return ;
out_release_instance:
nvme_release_instance(ctrl);
out:
return ret;
}
回到nvme_probe,nvme_init_ctrl裡主要做的事情就是通過device_create_with_groups建立一個名字叫nvme0的字元裝置,也就是我們之前見到的。
有了這個字元裝置之後,我們就可以通過open、ioctl之類的接口去操作它了。
static const struct file_operations nvme_dev_fops = {
.owner = THIS_MODULE,
.open = nvme_dev_open,
.release = nvme_dev_release,
.unlocked_ioctl = nvme_dev_ioctl,
.compat_ioctl = nvme_dev_ioctl,
};
而nvme0的這個0是通過nvme_set_instance得到的。這裡面主要是通過ida_get_new可以得到一個唯一的索引值。
static int nvme_set_instance(struct nvme_ctrl *ctrl)
{
int instance, error;
do {
if (!ida_pre_get(&nvme_instance_ida, GFP_KERNEL))
return -ENODEV;
spin_lock(&dev_list_lock);
error = ida_get_new(&nvme_instance_ida, &instance);
spin_unlock(&dev_list_lock);
} while (error == -EAGAIN);
if (error)
return -ENODEV;
ctrl->instance = instance;
return ;
}
再次回到nvme_probe,dev->reset_work被排程,也就是nvme_reset_work被調用了。
nvme_reset_work — 一個很長的work
static void nvme_reset_work(struct work_struct *work)
{
struct nvme_dev *dev = container_of(work, struct nvme_dev, reset_work);
int result = -ENODEV;
if (WARN_ON(test_bit(NVME_CTRL_RESETTING, &dev->flags)))
goto out;
/*
* If we're called to reset a live controller first shut it down before
* moving on.
*/
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
nvme_dev_disable(dev, false);
set_bit(NVME_CTRL_RESETTING, &dev->flags);
result = nvme_pci_enable(dev);
if (result)
goto out;
result = nvme_configure_admin_queue(dev);
if (result)
goto out;
nvme_init_queue(dev->queues[], );
result = nvme_alloc_admin_tags(dev);
if (result)
goto out;
result = nvme_init_identify(&dev->ctrl);
if (result)
goto out;
result = nvme_setup_io_queues(dev);
if (result)
goto out;
dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
result = nvme_dev_list_add(dev);
if (result)
goto out;
/*
* Keep the controller around but remove all namespaces if we don't have
* any working I/O queue.
*/
if (dev->online_queues < ) {
dev_warn(dev->dev, "IO queues not created\n");
nvme_remove_namespaces(&dev->ctrl);
} else {
nvme_start_queues(&dev->ctrl);
nvme_dev_add(dev);
}
clear_bit(NVME_CTRL_RESETTING, &dev->flags);
return;
out:
nvme_remove_dead_ctrl(dev, result);
}
nvme_reset_work流程分析:
- 首先通過NVME_CTRL_RESETTING标志來確定nvme_reset_work不會被重複進入。
- 調用nvme_pci_enable。
- 調用nvme_configure_admin_queue。
- 調用nvme_init_queue
- 調用nvme_alloc_admin_tags
- 調用nvme_init_identify
- 調用nvme_setup_io_queues
- 調用nvme_dev_list_add
static int nvme_pci_enable(struct nvme_dev *dev)
{
u64 cap;
int result = -ENOMEM;
struct pci_dev *pdev = to_pci_dev(dev->dev);
if (pci_enable_device_mem(pdev))
return result;
dev->entry[].vector = pdev->irq;
pci_set_master(pdev);
if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK()) &&
dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK()))
goto disable;
if (readl(dev->bar + NVME_REG_CSTS) == -) {
result = -ENODEV;
goto disable;
}
/*
* Some devices don't advertse INTx interrupts, pre-enable a single
* MSIX vec for setup. We'll adjust this later.
*/
if (!pdev->irq) {
result = pci_enable_msix(pdev, dev->entry, );
if (result < )
goto disable;
}
cap = lo_hi_readq(dev->bar + NVME_REG_CAP);
dev->q_depth = min_t(int, NVME_CAP_MQES(cap) + , NVME_Q_DEPTH);
dev->db_stride = << NVME_CAP_STRIDE(cap);
dev->dbs = dev->bar + ;
/*
* Temporary fix for the Apple controller found in the MacBook8, and
* some MacBook7, to avoid controller resets and data loss.
*/
if (pdev->vendor == PCI_VENDOR_ID_APPLE && pdev->device == ) {
dev->q_depth = ;
dev_warn(dev->dev, "detected Apple NVMe controller, set "
"queue depth=%u to work around controller resets\n",
dev->q_depth);
}
if (readl(dev->bar + NVME_REG_VS) >= NVME_VS(, ))
dev->cmb = nvme_map_cmb(dev);
pci_enable_pcie_error_reporting(pdev);
pci_save_state(pdev);
return ;
disable:
pci_disable_device(pdev);
return result;
}
nvme_pci_enable流程分析:
- 調用pci_enable_device_mem來使能nvme裝置的記憶體空間,也就是之前映射的bar0空間。
- 之後就可以通過readl(dev->bar + NVME_REG_CSTS)來直接操作nvme裝置上的控制寄存器了,也就是nvme協定中的如下這個表。
linux裡的nvme驅動代碼分析(加載初始化) - pci有兩種中斷模式,一種是INT,另一種是MSI。假如不支援INT模式的話,就使能MSI模式。在這裡使用的是INT模式,irq号為11。但是這裡還是為admin queue的msi中斷号dev->entry[0].vector附了值,看來後面是可能會用到msi的。
# cat /proc/interrupts
CPU0
: IO-APIC -edge timer
: IO-APIC -edge i8042
: IO-APIC -edge serial
: IO-APIC -fasteoi acpi
: IO-APIC -fasteoi virtio1
: IO-APIC -fasteoi virtio0, nvme0q0, nvme0q1
: IO-APIC -edge i8042
: IO-APIC -edge ata_piix
: IO-APIC -edge ata_piix
- 從CAP寄存器中獲得一些配置參數,并把dev->dbs設定成dev->bar+4096。4096的由來是上面表裡doorbell寄存器的起始位址是0x1000。
- 假如nvme協定的版本大于等于1.2的話,需要調用nvme_map_cmb映射controller memory buffer。但是現在2.5版的qemu實作的nvme是1.1版的,是以這些不被支援。但是作為1.2版本的一個新加功能,我在這裡還是分析一下。CMB的主要作用是把SQ/CQ存儲的位置從host memory搬到device memory來提升性能,改善延時。其實這個函數裡做的事情和之前做的pci映射工作沒有太大差別,但是有一點需要注意,這裡映射用的是ioremap_wc而不是ioremap。
static void __iomem *nvme_map_cmb(struct nvme_dev *dev)
{
u64 szu, size, offset;
u32 cmbloc;
resource_size_t bar_size;
struct pci_dev *pdev = to_pci_dev(dev->dev);
void __iomem *cmb;
dma_addr_t dma_addr;
if (!use_cmb_sqes)
return NULL;
dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
if (!(NVME_CMB_SZ(dev->cmbsz)))
return NULL;
cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
szu = (u64) << ( + * NVME_CMB_SZU(dev->cmbsz));
size = szu * NVME_CMB_SZ(dev->cmbsz);
offset = szu * NVME_CMB_OFST(cmbloc);
bar_size = pci_resource_len(pdev, NVME_CMB_BIR(cmbloc));
if (offset > bar_size)
return NULL;
/*
* Controllers may support a CMB size larger than their BAR,
* for example, due to being behind a bridge. Reduce the CMB to
* the reported size of the BAR
*/
if (size > bar_size - offset)
size = bar_size - offset;
dma_addr = pci_resource_start(pdev, NVME_CMB_BIR(cmbloc)) + offset;
cmb = ioremap_wc(dma_addr, size);
if (!cmb)
return NULL;
dev->cmb_dma_addr = dma_addr;
dev->cmb_size = size;
return cmb;
}
我們先來看看ioremap,其實它就等價于ioremap_nocache。因為我們重映射很多情況下映射的是寄存器之類的東西,要是使用的cache的話,會産生不可預測的問題,是以預設情況下cache是不能使用的。而ioremap_wc使用了一種叫write combining的cache機制。據我所知,這是arm中沒有的功能,而在x86裡有實作。首先要明确的是,設定成WC之後,就和L1,L2,L3的cache沒有任何關系了。在x86裡,有另外一塊存儲區域叫WC buffer(通常是一個buffer是64位元組大小,不同平台有不同數量個buffer)。設定成WC的記憶體,對其進行讀操作時,直接從記憶體裡讀取,繞過cache。但是當對其進行寫操作的時候,差別就來了。新的資料不會被直接寫進記憶體,而是緩存在WC buffer裡,等到WC buffer滿了或者執行了某些指令後,WC buffer裡的資料會一下子被寫進記憶體裡去。不管怎麼說,這還是一種cache機制,在這裡能使用的主要原因是因為SQ/CQ的内容對寫進去的順序是沒有要求的,直到最後的doorbell寄存器被修改,或者裝置發出一個中斷後,才有可能有人去讀取,是以可以用到這種優化。
/*
* The default ioremap() behavior is non-cached:
*/
static inline void __iomem *ioremap(resource_size_t offset, unsigned long size)
{
return ioremap_nocache(offset, size);
}
回到nvme_reset_work分析nvme_configure_admin_queue。
static int nvme_configure_admin_queue(struct nvme_dev *dev)
{
int result;
u32 aqa;
u64 cap = lo_hi_readq(dev->bar + NVME_REG_CAP);
struct nvme_queue *nvmeq;
dev->subsystem = readl(dev->bar + NVME_REG_VS) >= NVME_VS(, ) ?
NVME_CAP_NSSRC(cap) : ;
if (dev->subsystem &&
(readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_NSSRO))
writel(NVME_CSTS_NSSRO, dev->bar + NVME_REG_CSTS);
result = nvme_disable_ctrl(&dev->ctrl, cap);
if (result < )
return result;
nvmeq = dev->queues[];
if (!nvmeq) {
nvmeq = nvme_alloc_queue(dev, , NVME_AQ_DEPTH);
if (!nvmeq)
return -ENOMEM;
}
aqa = nvmeq->q_depth - ;
aqa |= aqa << ;
writel(aqa, dev->bar + NVME_REG_AQA);
lo_hi_writeq(nvmeq->sq_dma_addr, dev->bar + NVME_REG_ASQ);
lo_hi_writeq(nvmeq->cq_dma_addr, dev->bar + NVME_REG_ACQ);
result = nvme_enable_ctrl(&dev->ctrl, cap);
if (result)
goto free_nvmeq;
nvmeq->cq_vector = ;
result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
if (result) {
nvmeq->cq_vector = -;
goto free_nvmeq;
}
return result;
free_nvmeq:
nvme_free_queues(dev, );
return result;
}
nvme_configure_admin_queue流程分析:
- 從CAP寄存器中獲悉對Subsystem Reset的支援
- 調用nvme_disable_ctrl
- 調用nvme_alloc_queue
- 調用nvme_enable_ctrl
- 調用queue_request_irq
int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap)
{
int ret;
ctrl->ctrl_config &= ~NVME_CC_SHN_MASK;
ctrl->ctrl_config &= ~NVME_CC_ENABLE;
ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
if (ret)
return ret;
return nvme_wait_ready(ctrl, cap, false);
}
這裡的ctrl->ops就是之前nvme_init_ctrl時傳進去的nvme_pci_ctrl_ops,reg_write32通過NVME_REG_CC寄存器disable裝置。
static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
{
*val = readl(to_nvme_dev(ctrl)->bar + off);
return ;
}
static int nvme_pci_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val)
{
writel(val, to_nvme_dev(ctrl)->bar + off);
return ;
}
static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.reg_read32 = nvme_pci_reg_read32,
.reg_write32 = nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
.io_incapable = nvme_pci_io_incapable,
.reset_ctrl = nvme_pci_reset_ctrl,
.free_ctrl = nvme_pci_free_ctrl,
};
然後通過讀取狀态寄存器NVME_REG_CSTS來等待裝置真正停止。逾時上限是根據CAP寄存器的Timeout域來計算出來的,每個機關代表500ms。
static int nvme_wait_ready(struct nvme_ctrl *ctrl, u64 cap, bool enabled)
{
unsigned long timeout =
((NVME_CAP_TIMEOUT(cap) + ) * HZ / ) + jiffies;
u32 csts, bit = enabled ? NVME_CSTS_RDY : ;
int ret;
while ((ret = ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &csts)) == ) {
if ((csts & NVME_CSTS_RDY) == bit)
break;
msleep();
if (fatal_signal_pending(current))
return -EINTR;
if (time_after(jiffies, timeout)) {
dev_err(ctrl->dev,
"Device not ready; aborting %s\n", enabled ?
"initialisation" : "reset");
return -ENODEV;
}
}
return ret;
}
回到nvme_configure_admin_queue分析nvme_alloc_queue。
static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
int depth)
{
struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq), GFP_KERNEL);
if (!nvmeq)
return NULL;
nvmeq->cqes = dma_zalloc_coherent(dev->dev, CQ_SIZE(depth),
&nvmeq->cq_dma_addr, GFP_KERNEL);
if (!nvmeq->cqes)
goto free_nvmeq;
if (nvme_alloc_sq_cmds(dev, nvmeq, qid, depth))
goto free_cqdma;
nvmeq->q_dmadev = dev->dev;
nvmeq->dev = dev;
snprintf(nvmeq->irqname, sizeof(nvmeq->irqname), "nvme%dq%d",
dev->ctrl.instance, qid);
spin_lock_init(&nvmeq->q_lock);
nvmeq->cq_head = ;
nvmeq->cq_phase = ;
nvmeq->q_db = &dev->dbs[qid * * dev->db_stride];
nvmeq->q_depth = depth;
nvmeq->qid = qid;
nvmeq->cq_vector = -;
dev->queues[qid] = nvmeq;
/* make sure queue descriptor is set before queue count, for kthread */
mb();
dev->queue_count++;
return nvmeq;
free_cqdma:
dma_free_coherent(dev->dev, CQ_SIZE(depth), (void *)nvmeq->cqes,
nvmeq->cq_dma_addr);
free_nvmeq:
kfree(nvmeq);
return NULL;
}
nvme_alloc_queue流程分析:
- 調用dma_zalloc_coherent為completion queue配置設定記憶體以供DMA使用。nvmeq->cqes為申請到的記憶體的虛拟位址,供核心使用。而nvmeq->cq_dma_addr就是這塊記憶體的實體位址,供DMA控制器使用。
- nvmeq->irqname是用來注冊中斷的時候的名字,從nvme%dq%d可以看到,就是最後生成的nvme0q0和nvme0q1,一個是給admin queue的,一個是給io queue的。
- 調用nvme_alloc_sq_cmd來處理submission queue,假如nvme版本是1.2或者以上的,并且cmb支援submission queue,那就使用cmb。要不然就和completion queue一樣使用dma_alloc_coherent來配置設定記憶體。
static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
int qid, int depth)
{
if (qid && dev->cmb && use_cmb_sqes && NVME_CMB_SQS(dev->cmbsz)) {
unsigned offset = (qid - ) * roundup(SQ_SIZE(depth),
dev->ctrl.page_size);
nvmeq->sq_dma_addr = dev->cmb_dma_addr + offset;
nvmeq->sq_cmds_io = dev->cmb + offset;
} else {
nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
&nvmeq->sq_dma_addr, GFP_KERNEL);
if (!nvmeq->sq_cmds)
return -ENOMEM;
}
return ;
}
再次回到nvme_configure_admin_queue,看看nvme_enable_ctrl。這個函數并沒有太多特别的,可以簡單了解為前面分析過的nvme_disable_ctrl的逆向操作。
int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap)
{
/*
* Default to a 4K page size, with the intention to update this
* path in the future to accomodate architectures with differing
* kernel and IO page sizes.
*/
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + , page_shift = ;
int ret;
if (page_shift < dev_page_min) {
dev_err(ctrl->dev,
"Minimum device page size %u too large for host (%u)\n",
<< dev_page_min, << page_shift);
return -ENODEV;
}
ctrl->page_size = << page_shift;
ctrl->ctrl_config = NVME_CC_CSS_NVM;
ctrl->ctrl_config |= (page_shift - ) << NVME_CC_MPS_SHIFT;
ctrl->ctrl_config |= NVME_CC_ARB_RR | NVME_CC_SHN_NONE;
ctrl->ctrl_config |= NVME_CC_IOSQES | NVME_CC_IOCQES;
ctrl->ctrl_config |= NVME_CC_ENABLE;
ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
if (ret)
return ret;
return nvme_wait_ready(ctrl, cap, true);
}
回到nvme_configure_admin_queue,看最後一個函數queue_request_irq。這個函數主要的工作是設定中斷處理函數,預設情況下不使用線程化的中斷處理,而是使用中斷上下文的中斷處理。
static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
const char *name)
{
if (use_threaded_interrupts)
return request_threaded_irq(dev->entry[nvmeq->cq_vector].vector,
nvme_irq_check, nvme_irq, IRQF_SHARED,
name, nvmeq);
return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
IRQF_SHARED, name, nvmeq);
}
一路傳回到nvme_reset_work,分析nvme_init_queue。根據傳到nvme_init_queue的參數可知,這裡初始化的是queue 0,也就是admin queue。
static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
{
struct nvme_dev *dev = nvmeq->dev;
spin_lock_irq(&nvmeq->q_lock);
nvmeq->sq_tail = ;
nvmeq->cq_head = ;
nvmeq->cq_phase = ;
nvmeq->q_db = &dev->dbs[qid * * dev->db_stride];
memset((void *)nvmeq->cqes, , CQ_SIZE(nvmeq->q_depth));
dev->online_queues++;
spin_unlock_irq(&nvmeq->q_lock);
}
這個函數做的事情不多,主要是對nvme_queue的成員變量進行一些初始化。q_db指向這個queue對應的doorbell寄存器的位址。
回到nvme_reset_work,分析nvme_alloc_admin_tags。
static int nvme_alloc_admin_tags(struct nvme_dev *dev)
{
if (!dev->ctrl.admin_q) {
dev->admin_tagset.ops = &nvme_mq_admin_ops;
dev->admin_tagset.nr_hw_queues = ;
/*
* Subtract one to leave an empty queue entry for 'Full Queue'
* condition. See NVM-Express 1.2 specification, section 4.1.2.
*/
dev->admin_tagset.queue_depth = NVME_AQ_BLKMQ_DEPTH - ;
dev->admin_tagset.timeout = ADMIN_TIMEOUT;
dev->admin_tagset.numa_node = dev_to_node(dev->dev);
dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
dev->admin_tagset.driver_data = dev;
if (blk_mq_alloc_tag_set(&dev->admin_tagset))
return -ENOMEM;
dev->ctrl.admin_q = blk_mq_init_queue(&dev->admin_tagset);
if (IS_ERR(dev->ctrl.admin_q)) {
blk_mq_free_tag_set(&dev->admin_tagset);
return -ENOMEM;
}
if (!blk_get_queue(dev->ctrl.admin_q)) {
nvme_dev_remove_admin(dev);
dev->ctrl.admin_q = NULL;
return -ENODEV;
}
} else
blk_mq_start_stopped_hw_queues(dev->ctrl.admin_q, true);
return ;
}
nvme_alloc_admin_tags流程分析:
- blk_mq_alloc_tag_set申請tag set并和request_queue關聯起來,并且會對queue_depth(254)個request(index 0-253)做初始化。初始化函數就是nvme_mq_admin_ops傳進去的nvme_admin_init_request。
static struct blk_mq_ops nvme_mq_admin_ops = {
.queue_rq = nvme_queue_rq,
.complete = nvme_complete_rq,
.map_queue = blk_mq_map_queue,
.init_hctx = nvme_admin_init_hctx,
.exit_hctx = nvme_admin_exit_hctx,
.init_request = nvme_admin_init_request,
.timeout = nvme_timeout,
};
static int nvme_admin_init_request(void *data, struct request *req,
unsigned int hctx_idx, unsigned int rq_idx,
unsigned int numa_node)
{
struct nvme_dev *dev = data;
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct nvme_queue *nvmeq = dev->queues[];
BUG_ON(!nvmeq);
iod->nvmeq = nvmeq;
return ;
}
- blk_mq_init_queue初始化request_queue并指派給dev->ctrl.admin_q,會調用nvme_admin_init_hctx,并會調用nvme_admin_init_request初始化index 254的request,這點很奇怪。
static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
unsigned int hctx_idx)
{
struct nvme_dev *dev = data;
struct nvme_queue *nvmeq = dev->queues[];
WARN_ON(hctx_idx != );
WARN_ON(dev->admin_tagset.tags[] != hctx->tags);
WARN_ON(nvmeq->tags);
hctx->driver_data = nvmeq;
nvmeq->tags = &dev->admin_tagset.tags[];
return ;
}
回到nvme_reset_work,分析nvme_init_identify。
int nvme_init_identify(struct nvme_ctrl *ctrl)
{
struct nvme_id_ctrl *id;
u64 cap;
int ret, page_shift;
ret = ctrl->ops->reg_read32(ctrl, NVME_REG_VS, &ctrl->vs);
if (ret) {
dev_err(ctrl->dev, "Reading VS failed (%d)\n", ret);
return ret;
}
ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &cap);
if (ret) {
dev_err(ctrl->dev, "Reading CAP failed (%d)\n", ret);
return ret;
}
page_shift = NVME_CAP_MPSMIN(cap) + ;
if (ctrl->vs >= NVME_VS(, ))
ctrl->subsystem = NVME_CAP_NSSRC(cap);
ret = nvme_identify_ctrl(ctrl, &id);
if (ret) {
dev_err(ctrl->dev, "Identify Controller failed (%d)\n", ret);
return -EIO;
}
ctrl->oncs = le16_to_cpup(&id->oncs);
atomic_set(&ctrl->abort_limit, id->acl + );
ctrl->vwc = id->vwc;
memcpy(ctrl->serial, id->sn, sizeof(id->sn));
memcpy(ctrl->model, id->mn, sizeof(id->mn));
memcpy(ctrl->firmware_rev, id->fr, sizeof(id->fr));
if (id->mdts)
ctrl->max_hw_sectors = << (id->mdts + page_shift - );
else
ctrl->max_hw_sectors = UINT_MAX;
if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) && id->vs[]) {
unsigned int max_hw_sectors;
ctrl->stripe_size = << (id->vs[] + page_shift);
max_hw_sectors = ctrl->stripe_size >> (page_shift - );
if (ctrl->max_hw_sectors) {
ctrl->max_hw_sectors = min(max_hw_sectors,
ctrl->max_hw_sectors);
} else {
ctrl->max_hw_sectors = max_hw_sectors;
}
}
nvme_set_queue_limits(ctrl, ctrl->admin_q);
kfree(id);
return ;
}
nvme_init_identify流程分析:
- 調用nvme_identify_ctrl
- 調用nvme_set_queue_limits
先來分析nvme_identify_ctrl
int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
{
struct nvme_command c = { };
int error;
/* gcc-4.4.4 (at least) has issues with initializers and anon unions */
c.identify.opcode = nvme_admin_identify;
c.identify.cns = cpu_to_le32();
*id = kmalloc(sizeof(struct nvme_id_ctrl), GFP_KERNEL);
if (!*id)
return -ENOMEM;
error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
sizeof(struct nvme_id_ctrl));
if (error)
kfree(*id);
return error;
}
nvme_identify_ctrl流程分析:
- 建立一個opcode為nvme_admin_identify(0x6)的command。在這裡我們會看到cpu_to_le32這樣的函數,這些函數的主要用途是因為在nvme協定裡規定的一些消息格式都是按照小端存儲的,但是我們的主機可能是小端的x86,也可能是大端的arm或者其他類型,用了這樣的函數就可以做到主機格式和小端之間的轉換,讓代碼更好得跨平台。
- 通過nvme_submit_sync_cmd把指令發送給nvme裝置
int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
void *buffer, unsigned bufflen, u32 *result, unsigned timeout)
{
struct request *req;
int ret;
req = nvme_alloc_request(q, cmd, );
if (IS_ERR(req))
return PTR_ERR(req);
req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
if (buffer && bufflen) {
ret = blk_rq_map_kern(q, req, buffer, bufflen, GFP_KERNEL);
if (ret)
goto out;
}
blk_execute_rq(req->q, NULL, req, 0);
if (result)
*result = (u32)(uintptr_t)req->special;
ret = req->errors;
out:
blk_mq_free_request(req);
return ret;
}
__nvme_submit_sync_cmd流程分析:
- 調用nvme_alloc_request
- 調用blk_rq_map_kern
- 調用blk_execute_rq,其中會調用函數指針queue_rq指向的函數nvme_queue_rq。
先看下nvme_alloc_request,blk_mq_alloc_request從request_queue中申請一個request,然後初始化這個request的一些屬性。這些屬性都很重要,很多會直接影響到後續代碼的執行流程。
struct request *nvme_alloc_request(struct request_queue *q,
struct nvme_command *cmd, unsigned int flags)
{
bool write = cmd->common.opcode & ;
struct request *req;
req = blk_mq_alloc_request(q, write, flags);
if (IS_ERR(req))
return req;
req->cmd_type = REQ_TYPE_DRV_PRIV;
req->cmd_flags |= REQ_FAILFAST_DRIVER;
req->__data_len = ;
req->__sector = (sector_t) -;
req->bio = req->biotail = NULL;
req->cmd = (unsigned char *)cmd;
req->cmd_len = sizeof(struct nvme_command);
req->special = (void *);
return req;
}
假如傳進__nvme_submit_sync_cmd的buffer和bufflen參數都不為空的話,blk_rq_map_kern會被執行。在之前的nvme_alloc_request裡,req->__data_len被設定成0。但是隻要blk_rq_map_kern執行過之後,req->__data_len就會變成非零值,也就是映射的區域的大小。實測下來應該是以頁(4096B)為機關的大小。
再分析下nvme_queue_rq。這個函數十分重要,實作了最終的指令的發送。
static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
struct nvme_ns *ns = hctx->queue->queuedata;
struct nvme_queue *nvmeq = hctx->driver_data;
struct nvme_dev *dev = nvmeq->dev;
struct request *req = bd->rq;
struct nvme_command cmnd;
int ret = BLK_MQ_RQ_QUEUE_OK;
/*
* If formated with metadata, require the block layer provide a buffer
* unless this namespace is formated such that the metadata can be
* stripped/generated by the controller with PRACT=.
*/
if (ns && ns->ms && !blk_integrity_rq(req)) {
if (!(ns->pi_type && ns->ms == ) &&
req->cmd_type != REQ_TYPE_DRV_PRIV) {
blk_mq_end_request(req, -EFAULT);
return BLK_MQ_RQ_QUEUE_OK;
}
}
ret = nvme_init_iod(req, dev);
if (ret)
return ret;
if (req->cmd_flags & REQ_DISCARD) {
ret = nvme_setup_discard(nvmeq, ns, req, &cmnd);
} else {
if (req->cmd_type == REQ_TYPE_DRV_PRIV)
memcpy(&cmnd, req->cmd, sizeof(cmnd));
else if (req->cmd_flags & REQ_FLUSH)
nvme_setup_flush(ns, &cmnd);
else
nvme_setup_rw(ns, req, &cmnd);
if (req->nr_phys_segments)
ret = nvme_map_data(dev, req, &cmnd);
}
if (ret)
goto out;
cmnd.common.command_id = req->tag;
blk_mq_start_request(req);
spin_lock_irq(&nvmeq->q_lock);
if (unlikely(nvmeq->cq_vector < )) {
if (ns && !test_bit(NVME_NS_DEAD, &ns->flags))
ret = BLK_MQ_RQ_QUEUE_BUSY;
else
ret = BLK_MQ_RQ_QUEUE_ERROR;
spin_unlock_irq(&nvmeq->q_lock);
goto out;
}
__nvme_submit_cmd(nvmeq, &cmnd);
nvme_process_cq(nvmeq);
spin_unlock_irq(&nvmeq->q_lock);
return BLK_MQ_RQ_QUEUE_OK;
out:
nvme_free_iod(dev, req);
return ret;
}
nvme_queue_rq流程分析:
- 調用nvme_init_iod
- 調用nvme_map_data
- 調用blk_mq_start_request
- 調用__nvme_submit_cmd
- 調用nvme_process_cq
static int nvme_init_iod(struct request *rq, struct nvme_dev *dev)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(rq);
int nseg = rq->nr_phys_segments;
unsigned size;
if (rq->cmd_flags & REQ_DISCARD)
size = sizeof(struct nvme_dsm_range);
else
size = blk_rq_bytes(rq);
if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) {
iod->sg = kmalloc(nvme_iod_alloc_size(dev, size, nseg), GFP_ATOMIC);
if (!iod->sg)
return BLK_MQ_RQ_QUEUE_BUSY;
} else {
iod->sg = iod->inline_sg;
}
iod->aborted = ;
iod->npages = -;
iod->nents = ;
iod->length = size;
return ;
}
看一下nvme_init_iod,第一句blk_mq_rq_to_pdu就非常讓人不解。
/*
* Driver command data is immediately after the request. So subtract request
* size to get back to the original request, add request size to get the PDU.
*/
static inline struct request *blk_mq_rq_from_pdu(void *pdu)
{
return pdu - sizeof(struct request);
}
static inline void *blk_mq_rq_to_pdu(struct request *rq)
{
return rq + ;
}
讀一下注釋,再結合nvme_alloc_admin_tags裡的dev->admin_tagset.cmd_size = nvme_cmd_size(dev);可知,配置設定空間的時候,每個request都是配備有額外的空間的,大小就是通過cmd_size來指定的。是以在request之後緊跟着的就是額外的空間。這裡可以看到,額外的空間還不隻是nvme_iod的大小。
區域1 | 區域2 | 區域3 |
---|---|---|
sizeof(struct nvme_iod) | sizeof(struct scatterlist) * nseg | sizeof(__le64 * ) * nvme_npages(size, dev) |
static int nvme_npages(unsigned size, struct nvme_dev *dev)
{
unsigned nprps = DIV_ROUND_UP(size + dev->ctrl.page_size,
dev->ctrl.page_size);
return DIV_ROUND_UP( * nprps, PAGE_SIZE - );
}
static unsigned int nvme_iod_alloc_size(struct nvme_dev *dev,
unsigned int size, unsigned int nseg)
{
return sizeof(__le64 *) * nvme_npages(size, dev) +
sizeof(struct scatterlist) * nseg;
}
static unsigned int nvme_cmd_size(struct nvme_dev *dev)
{
return sizeof(struct nvme_iod) +
nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
}
由于前面設定了req->cmd_flags為REQ_TYPE_DRV_PRIV,是以command直接通過memcpy來拷貝。然後根據:
/*
* Max size of iod being embedded in the request payload
*/
#define NVME_INT_PAGES 2
#define NVME_INT_BYTES(dev) (NVME_INT_PAGES * (dev)->ctrl.page_size)
假如這個request需要傳輸的資料的段數目大于2,或者總長度大于2個nvme page的話,就為iod->sg另行配置設定空間,要不然就直接指向struct nvme_iod(區域1)的尾部,也就是struct scatterlist * nseg(區域2)的前端。
回到nvme_queue_rq,假如req->nr_phys_segments不為0,nvme_map_data會被調用。
static int nvme_map_data(struct nvme_dev *dev, struct request *req,
struct nvme_command *cmnd)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct request_queue *q = req->q;
enum dma_data_direction dma_dir = rq_data_dir(req) ?
DMA_TO_DEVICE : DMA_FROM_DEVICE;
int ret = BLK_MQ_RQ_QUEUE_ERROR;
sg_init_table(iod->sg, req->nr_phys_segments);
iod->nents = blk_rq_map_sg(q, req, iod->sg);
if (!iod->nents)
goto out;
ret = BLK_MQ_RQ_QUEUE_BUSY;
if (!dma_map_sg(dev->dev, iod->sg, iod->nents, dma_dir))
goto out;
if (!nvme_setup_prps(dev, req, blk_rq_bytes(req)))
goto out_unmap;
ret = BLK_MQ_RQ_QUEUE_ERROR;
if (blk_integrity_rq(req)) {
if (blk_rq_count_integrity_sg(q, req->bio) != )
goto out_unmap;
sg_init_table(&iod->meta_sg, );
if (blk_rq_map_integrity_sg(q, req->bio, &iod->meta_sg) != )
goto out_unmap;
if (rq_data_dir(req))
nvme_dif_remap(req, nvme_dif_prep);
if (!dma_map_sg(dev->dev, &iod->meta_sg, , dma_dir))
goto out_unmap;
}
/* 把配置好的prp entry寫入兩個prp寄存器,詳解請看nvme_setup_prps */
cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
if (blk_integrity_rq(req))
cmnd->rw.metadata = cpu_to_le64(sg_dma_address(&iod->meta_sg));
return BLK_MQ_RQ_QUEUE_OK;
out_unmap:
dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
out:
return ret;
}
nvme_map_data流程分析:
- 調用sg_init_table,初始化的就是區域2中存放的scatterlist。
- 調用blk_rq_map_sg
- 調用dma_map_sg
- 調用nvme_setup_prps,一個沒有一行注釋的長函數~其實這個函數分成3個部分,對應設定prp的三種不同情況。第一種情況是一個prp entry就能描述所有的傳輸,第二種情況是需要兩個prp entry才能描述所有的傳輸,第三種情況是需要prp list才能描述所有的傳輸。我會在代碼中加入注釋來說明。
static bool nvme_setup_prps(struct nvme_dev *dev, struct request *req,
int total_len)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
int length = total_len; /* length就是全部prp需要傳輸的位元組長度 */
struct scatterlist *sg = iod->sg;
int dma_len = sg_dma_len(sg); /* 第一個sg描述的資料長度 */
u64 dma_addr = sg_dma_address(sg); /* 第一個sg描述的資料實體位址 */
u32 page_size = dev->ctrl.page_size; /* nvme的頁尺寸 */
int offset = dma_addr & (page_size - ); /* 第一個sg描述的資料實體位址在nvme頁當中的偏移 */
__le64 *prp_list;
__le64 **list = iod_list(req);
dma_addr_t prp_dma;
int nprps, i;
/* 一個prp entry就能描述所有的傳輸 */
length -= (page_size - offset);
if (length <= )
return true;
dma_len -= (page_size - offset);
if (dma_len) {
dma_addr += (page_size - offset);
} else {
/* dma_len為0,也就是第一個sg描述的資料全部在一個nvme頁中,
可以用一個prp來表示,是以跳到下一個sg進行處理。
sg = sg_next(sg);
dma_addr = sg_dma_address(sg);
dma_len = sg_dma_len(sg);
}
/* 兩個prp entry才能描述所有的傳輸 */
if (length <= page_size) {
iod->first_dma = dma_addr;
return true;
}
/* 這裡開始就是prp list的處理了,一個prp entry 8個位元組,是以根據需要使用的prp entry的個數來決定使用哪個dma pool更好 */
nprps = DIV_ROUND_UP(length, page_size);
if (nprps <= ( / )) {
pool = dev->prp_small_pool;
iod->npages = ;
} else {
pool = dev->prp_page_pool;
iod->npages = ;
}
prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
if (!prp_list) {
iod->first_dma = dma_addr;
iod->npages = -;
return false;
}
list[] = prp_list;
iod->first_dma = prp_dma;
i = ;
for (;;) {
/* 一個dma memory用完了,需要從pool裡再申請一個來存放其他的prp entry */
if (i == page_size >> ) {
__le64 *old_prp_list = prp_list;
prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
if (!prp_list)
return false;
list[iod->npages++] = prp_list;
prp_list[] = old_prp_list[i - ];
old_prp_list[i - ] = cpu_to_le64(prp_dma);
i = ;
}
/* 存放prp list中的entry */
prp_list[i++] = cpu_to_le64(dma_addr);
dma_len -= page_size;
dma_addr += page_size;
length -= page_size;
if (length <= )
break;
if (dma_len > )
continue;
BUG_ON(dma_len < );
sg = sg_next(sg);
dma_addr = sg_dma_address(sg);
dma_len = sg_dma_len(sg);
}
return true;
}
回到nvme_queue_rq,調用__nvme_submit_cmd。這個函數很簡單,單很重要,就是把指令複制到submission queue中,然後把最後一個指令的索引寫入doorbell的寄存器通知裝置去處理。
static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
struct nvme_command *cmd)
{
u16 tail = nvmeq->sq_tail;
if (nvmeq->sq_cmds_io)
memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
else
memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
if (++tail == nvmeq->q_depth)
tail = ;
writel(tail, nvmeq->q_db);
nvmeq->sq_tail = tail;
}
來看nvme_queue_rq裡最後一個調用的函數__nvme_process_cq,從名字上就很容易知道是用來處理completion queue的。
static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
{
u16 head, phase;
head = nvmeq->cq_head;
phase = nvmeq->cq_phase;
for (;;) {
struct nvme_completion cqe = nvmeq->cqes[head];
u16 status = le16_to_cpu(cqe.status);
struct request *req;
if ((status & ) != phase)
break;
nvmeq->sq_head = le16_to_cpu(cqe.sq_head);
if (++head == nvmeq->q_depth) {
head = ;
phase = !phase;
}
if (tag && *tag == cqe.command_id)
*tag = -;
if (unlikely(cqe.command_id >= nvmeq->q_depth)) {
dev_warn(nvmeq->q_dmadev,
"invalid id %d completed on queue %d\n",
cqe.command_id, le16_to_cpu(cqe.sq_id));
continue;
}
/*
* AEN requests are special as they don't time out and can
* survive any kind of queue freeze and often don't respond to
* aborts. We don't even bother to allocate a struct request
* for them but rather special case them here.
*/
if (unlikely(nvmeq->qid == &&
cqe.command_id >= NVME_AQ_BLKMQ_DEPTH)) {
nvme_complete_async_event(nvmeq->dev, &cqe);
continue;
}
req = blk_mq_tag_to_rq(*nvmeq->tags, cqe.command_id);
if (req->cmd_type == REQ_TYPE_DRV_PRIV) {
u32 result = le32_to_cpu(cqe.result);
req->special = (void *)(uintptr_t)result;
}
blk_mq_complete_request(req, status >> );
}
/* If the controller ignores the cq head doorbell and continuously
* writes to the queue, it is theoretically possible to wrap around
* the queue twice and mistakenly return IRQ_NONE. Linux only
* requires that % of your interrupts are handled, so this isn't
* a big problem.
*/
if (head == nvmeq->cq_head && phase == nvmeq->cq_phase)
return;
if (likely(nvmeq->cq_vector >= ))
writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
nvmeq->cq_head = head;
nvmeq->cq_phase = phase;
nvmeq->cqe_seen = ;
}
先看一下nvme協定裡規定的completion entry的格式,每個16位元組。
我們知道,不管是admin queue還是io queue,都有若幹個submission queue和一個completion queue。而不管是submission queue還是completion queue,都是通過head和tail變量來管理的。主機端負責更新submission queue的tail來表示新任務的添加。裝置端從submission queue裡拿出任務處理,并修改head值。但是這個head怎麼通知給主機那?答案就在completion entry的SQ Head Pointer中。對于completion queue,是反過來的,裝置負責往裡面添加元素,然後修改tail值。這個tail值是怎麼傳遞給主機的那?這裡就用到了另外一個機制,就是在Status Field中的PP标志位。
P标志位全稱是Phase Tag,在completion queue剛建立的時候,全部初始化成0。當裝置第一遍往completion queue裡添加元素的時候,就把對應的entry的Phase Tag設定成1。這樣,主機端通過Phase Tag的值就能夠知道有多少個新的元素被添加到了completion queue中。當裝置添加元素到了隊列的底部,就要重新回到索引0處,這也就是completion queue的第二遍更新,這次裝置又會把Phase Tag全部設定成0。就這樣,一遍1一遍0再一遍1再一遍0…最後,主機處理好completion之後,更新completion queue的head值,并通過doorbell告訴裝置。
__nvme_process_cq的主要機制分析了,但是為什麼要在nvme_queue_rq的最後調用__nvme_process_cq,有點疑惑。因為即使不顯示調用__nvme_process_cq,裝置在處理完submission之後,也會發送中斷(INT/MSI)給主機,在主機的中斷處理函數中,也會進行__nvme_process_cq的調用。
static irqreturn_t nvme_irq(int irq, void *data)
{
irqreturn_t result;
struct nvme_queue *nvmeq = data;
spin_lock(&nvmeq->q_lock);
nvme_process_cq(nvmeq);
result = nvmeq->cqe_seen ? IRQ_HANDLED : IRQ_NONE;
nvmeq->cqe_seen = ;
spin_unlock(&nvmeq->q_lock);
return result;
}
在__nvme_process_cq中,對于處理好了的request,會調用blk_mq_complete_request。這會觸發我們之前注冊的nvme_complete_rq被調用。
static void nvme_complete_rq(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct nvme_dev *dev = iod->nvmeq->dev;
int error = ;
nvme_unmap_data(dev, req);
if (unlikely(req->errors)) {
if (nvme_req_needs_retry(req, req->errors)) {
nvme_requeue_req(req);
return;
}
if (req->cmd_type == REQ_TYPE_DRV_PRIV)
error = req->errors;
else
error = nvme_error_status(req->errors);
}
if (unlikely(iod->aborted)) {
dev_warn(dev->dev,
"completing aborted command with status: %04x\n",
req->errors);
}
blk_mq_end_request(req, error);
}
分析了那麼一大串,nvme_identify_ctrl終于講完了。往上到nvme_init_identify,也已經基本沒什麼事情要做了。再往上回到nvme_reset_work,看看nvme_init_identify之後的nvme_setup_io_queues做了些什麼。
static int nvme_setup_io_queues(struct nvme_dev *dev)
{
struct nvme_queue *adminq = dev->queues[];
struct pci_dev *pdev = to_pci_dev(dev->dev);
int result, i, vecs, nr_io_queues, size;
nr_io_queues = num_possible_cpus();
result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
if (result < )
return result;
/*
* Degraded controllers might return an error when setting the queue
* count. We still want to be able to bring them online and offer
* access to the admin queue, as that might be only way to fix them up.
*/
if (result > ) {
dev_err(dev->dev, "Could not set queue count (%d)\n", result);
nr_io_queues = ;
result = ;
}
if (dev->cmb && NVME_CMB_SQS(dev->cmbsz)) {
result = nvme_cmb_qdepth(dev, nr_io_queues,
sizeof(struct nvme_command));
if (result > )
dev->q_depth = result;
else
nvme_release_cmb(dev);
}
size = db_bar_size(dev, nr_io_queues);
if (size > ) {
iounmap(dev->bar);
do {
dev->bar = ioremap(pci_resource_start(pdev, ), size);
if (dev->bar)
break;
if (!--nr_io_queues)
return -ENOMEM;
size = db_bar_size(dev, nr_io_queues);
} while ();
dev->dbs = dev->bar + ;
adminq->q_db = dev->dbs;
}
/* Deregister the admin queue's interrupt */
free_irq(dev->entry[].vector, adminq);
/*
* If we enable msix early due to not intx, disable it again before
* setting up the full range we need.
*/
if (!pdev->irq)
pci_disable_msix(pdev);
for (i = ; i < nr_io_queues; i++)
dev->entry[i].entry = i;
vecs = pci_enable_msix_range(pdev, dev->entry, , nr_io_queues);
if (vecs < ) {
vecs = pci_enable_msi_range(pdev, , min(nr_io_queues, ));
if (vecs < ) {
vecs = ;
} else {
for (i = ; i < vecs; i++)
dev->entry[i].vector = i + pdev->irq;
}
}
/*
* Should investigate if there's a performance win from allocating
* more queues than interrupt vectors; it might allow the submission
* path to scale better, even if the receive path is limited by the
* number of interrupts.
*/
nr_io_queues = vecs;
dev->max_qid = nr_io_queues;
result = queue_request_irq(dev, adminq, adminq->irqname);
if (result) {
adminq->cq_vector = -;
goto free_queues;
}
/* Free previously allocated queues that are no longer usable */
nvme_free_queues(dev, nr_io_queues + );
return nvme_create_io_queues(dev);
free_queues:
nvme_free_queues(dev, );
return result;
}
nvme_setup_io_queues流程分析:
- 調用nvme_set_queue_count
- 調用nvme_create_io_queues
nvme_set_queue_count發送了一個set features的指令,feature id為0x7,設定io queue的數量。
int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
{
u32 q_count = (*count - ) | ((*count - ) << );
u32 result;
int status, nr_io_queues;
status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, ,
&result);
if (status)
return status;
nr_io_queues = min(result & , result >> ) + ;
*count = min(*count, nr_io_queues);
return ;
}
queue的大小存放在set features指令的dword 11中。
int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
dma_addr_t dma_addr, u32 *result)
{
struct nvme_command c;
memset(&c, , sizeof(c));
c.features.opcode = nvme_admin_set_features;
c.features.prp1 = cpu_to_le64(dma_addr);
c.features.fid = cpu_to_le32(fid);
c.features.dword11 = cpu_to_le32(dword11);
return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, , result, );
}
回到nvme_setup_io_queues,分析nvme_create_io_queues。
static int nvme_create_io_queues(struct nvme_dev *dev)
{
unsigned i;
int ret = ;
for (i = dev->queue_count; i <= dev->max_qid; i++) {
if (!nvme_alloc_queue(dev, i, dev->q_depth)) {
ret = -ENOMEM;
break;
}
}
for (i = dev->online_queues; i <= dev->queue_count - ; i++) {
ret = nvme_create_queue(dev->queues[i], i);
if (ret) {
nvme_free_queues(dev, i);
break;
}
}
/*
* Ignore failing Create SQ/CQ commands, we can continue with less
* than the desired aount of queues, and even a controller without
* I/O queues an still be used to issue admin commands. This might
* be useful to upgrade a buggy firmware for example.
*/
return ret >= ? : ret;
}
nvme_alloc_queue在之前分析過就不再累贅了,主要看下nvme_create_queue。
static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
{
struct nvme_dev *dev = nvmeq->dev;
int result;
nvmeq->cq_vector = qid - ;
result = adapter_alloc_cq(dev, qid, nvmeq);
if (result < )
return result;
result = adapter_alloc_sq(dev, qid, nvmeq);
if (result < )
goto release_cq;
result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
if (result < )
goto release_sq;
nvme_init_queue(nvmeq, qid);
return result;
release_sq:
adapter_delete_sq(dev, qid);
release_cq:
adapter_delete_cq(dev, qid);
return result;
}
adapter_alloc_cq發送opcode為0x5的create io completion queue指令。
static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
struct nvme_queue *nvmeq)
{
struct nvme_command c;
int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;
/*
* Note: we (ab)use the fact the the prp fields survive if no data
* is attached to the request.
*/
memset(&c, , sizeof(c));
c.create_cq.opcode = nvme_admin_create_cq;
c.create_cq.prp1 = cpu_to_le64(nvmeq->cq_dma_addr);
c.create_cq.cqid = cpu_to_le16(qid);
c.create_cq.qsize = cpu_to_le16(nvmeq->q_depth - );
c.create_cq.cq_flags = cpu_to_le16(flags);
c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);
return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, );
}
adapter_alloc_sq發送opcode為0x1的create io submission queue指令。
static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
struct nvme_queue *nvmeq)
{
struct nvme_command c;
int flags = NVME_QUEUE_PHYS_CONTIG | NVME_SQ_PRIO_MEDIUM;
/*
* Note: we (ab)use the fact the the prp fields survive if no data
* is attached to the request.
*/
memset(&c, , sizeof(c));
c.create_sq.opcode = nvme_admin_create_sq;
c.create_sq.prp1 = cpu_to_le64(nvmeq->sq_dma_addr);
c.create_sq.sqid = cpu_to_le16(qid);
c.create_sq.qsize = cpu_to_le16(nvmeq->q_depth - );
c.create_sq.sq_flags = cpu_to_le16(flags);
c.create_sq.cqid = cpu_to_le16(qid);
return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, );
}
最後注冊中斷,就是之前我們列出來過的nvme0q1。
再次回到nvme_reset_work,分析nvme_dev_list_add。這個函數主要做的是啟動了一個核心線程nvme_kthread,這個函數我們後面再分析。
static int nvme_dev_list_add(struct nvme_dev *dev)
{
bool start_thread = false;
spin_lock(&dev_list_lock);
if (list_empty(&dev_list) && IS_ERR_OR_NULL(nvme_thread)) {
start_thread = true;
nvme_thread = NULL;
}
list_add(&dev->node, &dev_list);
spin_unlock(&dev_list_lock);
if (start_thread) {
nvme_thread = kthread_run(nvme_kthread, NULL, "nvme");
wake_up_all(&nvme_kthread_wait);
} else
wait_event_killable(nvme_kthread_wait, nvme_thread);
if (IS_ERR_OR_NULL(nvme_thread))
return nvme_thread ? PTR_ERR(nvme_thread) : -EINTR;
return ;
}
繼續回到nvme_reset_work,分析nvme_start_queues。
void nvme_start_queues(struct nvme_ctrl *ctrl)
{
struct nvme_ns *ns;
mutex_lock(&ctrl->namespaces_mutex);
list_for_each_entry(ns, &ctrl->namespaces, list) {
queue_flag_clear_unlocked(QUEUE_FLAG_STOPPED, ns->queue);
blk_mq_start_stopped_hw_queues(ns->queue, true);
blk_mq_kick_requeue_list(ns->queue);
}
mutex_unlock(&ctrl->namespaces_mutex);
}
繼續回到nvme_reset_work,分析最後一個函數nvme_dev_add。blk_mq_alloc_tag_set在之前調用過一次,是為admin queue配置設定tag set。這次則是為了io queue配置設定tag set。
static int nvme_dev_add(struct nvme_dev *dev)
{
if (!dev->ctrl.tagset) {
dev->tagset.ops = &nvme_mq_ops;
dev->tagset.nr_hw_queues = dev->online_queues - ;
dev->tagset.timeout = NVME_IO_TIMEOUT;
dev->tagset.numa_node = dev_to_node(dev->dev);
dev->tagset.queue_depth =
min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH) - ;
dev->tagset.cmd_size = nvme_cmd_size(dev);
dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
dev->tagset.driver_data = dev;
if (blk_mq_alloc_tag_set(&dev->tagset))
return ;
dev->ctrl.tagset = &dev->tagset;
}
nvme_queue_scan(dev);
return ;
}
最後,調用nvme_queue_scan排程了另一個work,也就是函數nvme_dev_scan。
static void nvme_queue_scan(struct nvme_dev *dev)
{
/*
* Do not queue new scan work when a controller is reset during
* removal.
*/
if (test_bit(NVME_CTRL_REMOVING, &dev->flags))
return;
queue_work(nvme_workq, &dev->scan_work);
}
至此,nvme_reset_work完全結束。讓我們進入到下一個work裡,去征服另一個山頭吧!
nvme_dev_scan — 又一個冗長的work
static void nvme_dev_scan(struct work_struct *work)
{
struct nvme_dev *dev = container_of(work, struct nvme_dev, scan_work);
if (!dev->tagset.tags)
return;
nvme_scan_namespaces(&dev->ctrl);
nvme_set_irq_hints(dev);
}
調用nvme_scan_namespaces。
void nvme_scan_namespaces(struct nvme_ctrl *ctrl)
{
struct nvme_id_ctrl *id;
unsigned nn;
if (nvme_identify_ctrl(ctrl, &id))
return;
mutex_lock(&ctrl->namespaces_mutex);
nn = le32_to_cpu(id->nn);
if (ctrl->vs >= NVME_VS(, ) &&
!(ctrl->quirks & NVME_QUIRK_IDENTIFY_CNS)) {
if (!nvme_scan_ns_list(ctrl, nn))
goto done;
}
__nvme_scan_namespaces(ctrl, le32_to_cpup(&id->nn));
done:
list_sort(NULL, &ctrl->namespaces, ns_cmp);
mutex_unlock(&ctrl->namespaces_mutex);
kfree(id);
}
先調用nvme_identify_ctrl給裝置發一個identify指令,然後把取得的namespace number傳給__nvme_scan_namespaces。
static void __nvme_scan_namespaces(struct nvme_ctrl *ctrl, unsigned nn)
{
struct nvme_ns *ns, *next;
unsigned i;
lockdep_assert_held(&ctrl->namespaces_mutex);
for (i = ; i <= nn; i++)
nvme_validate_ns(ctrl, i);
list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
if (ns->ns_id > nn)
nvme_ns_remove(ns);
}
}
為每個namespace調用一次nvme_validate_ns。nvme協定規定,namespace号0表示不使用namespace功能,0xFFFFFFFF表示比對任何namespace号。是以正常的namespace号是從1開始的。
static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
{
struct nvme_ns *ns;
ns = nvme_find_ns(ctrl, nsid);
if (ns) {
if (revalidate_disk(ns->disk))
nvme_ns_remove(ns);
} else
nvme_alloc_ns(ctrl, nsid);
}
先查找某個namespace是否已經存在,初始化的時候,namespace都還沒有建立,是以傳回的肯定是空。
static struct nvme_ns *nvme_find_ns(struct nvme_ctrl *ctrl, unsigned nsid)
{
struct nvme_ns *ns;
lockdep_assert_held(&ctrl->namespaces_mutex);
list_for_each_entry(ns, &ctrl->namespaces, list) {
if (ns->ns_id == nsid)
return ns;
if (ns->ns_id > nsid)
break;
}
return NULL;
}
既然找不到相應的namespace,那就要建立它。
static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
{
struct nvme_ns *ns;
struct gendisk *disk;
int node = dev_to_node(ctrl->dev);
lockdep_assert_held(&ctrl->namespaces_mutex);
/* 為nvme_ns結構體配置設定空間 */
ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
if (!ns)
return;
/* 為ns配置設定索引号,ida機制之前有講過 */
ns->instance = ida_simple_get(&ctrl->ns_ida, , , GFP_KERNEL);
if (ns->instance < )
goto out_free_ns;
/* 建立一個request queue */
ns->queue = blk_mq_init_queue(ctrl->tagset);
if (IS_ERR(ns->queue))
goto out_release_instance;
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
ns->queue->queuedata = ns;
ns->ctrl = ctrl;
/* 建立一個gendisk */
disk = alloc_disk_node(, node);
if (!disk)
goto out_free_queue;
kref_init(&ns->kref);
ns->ns_id = nsid;
ns->disk = disk;
ns->lba_shift = ; /* set to a default value for 512 until disk is validated */
blk_queue_logical_block_size(ns->queue, << ns->lba_shift);
nvme_set_queue_limits(ctrl, ns->queue);
/* 為gendisk做初始化 */
disk->major = nvme_major;
disk->first_minor = ;
disk->fops = &nvme_fops;
disk->private_data = ns;
disk->queue = ns->queue;
disk->driverfs_dev = ctrl->device;
disk->flags = GENHD_FL_EXT_DEVT;
/* /dev下的nvme0n1 */
sprintf(disk->disk_name, "nvme%dn%d", ctrl->instance, ns->instance);
if (nvme_revalidate_disk(ns->disk))
goto out_free_disk;
/* 把這個建立的namespace加入到ctrl->namespaces連結清單裡去 */
list_add_tail(&ns->list, &ctrl->namespaces);
kref_get(&ctrl->kref);
if (ns->type == NVME_NS_LIGHTNVM)
return;
/* alloc_disk_node隻是建立一個disk,而add_disk才是真正把它加入系統,可以進行操作。
這裡建立的就是/dev/nvme0n1,可是為什麼主裝置号是259那?在add_disk裡面調用了一個函數blk_alloc_devt,
在這個函數裡,會有一段邏輯來判斷是用我們自己設定的主裝置号來注冊裝置,還是用BLOCK_EXT_MAJOR來注冊。
但是究竟為什麼要這樣做,我還沒有搞明白,希望有誰知道的不吝賜教。 */
add_disk(ns->disk);
if (sysfs_create_group(&disk_to_dev(ns->disk)->kobj,
&nvme_ns_attr_group))
pr_warn("%s: failed to create sysfs group for identification\n",
ns->disk->disk_name);
return;
out_free_disk:
kfree(disk);
out_free_queue:
blk_cleanup_queue(ns->queue);
out_release_instance:
ida_simple_remove(&ctrl->ns_ida, ns->instance);
out_free_ns:
kfree(ns);
}
nvme_alloc_ns中有個重要函數nvme_revalidate_disk需要跟進去。
static int nvme_revalidate_disk(struct gendisk *disk)
{
struct nvme_ns *ns = disk->private_data;
struct nvme_id_ns *id;
u8 lbaf, pi_type;
u16 old_ms;
unsigned short bs;
if (test_bit(NVME_NS_DEAD, &ns->flags)) {
set_capacity(disk, );
return -ENODEV;
}
if (nvme_identify_ns(ns->ctrl, ns->ns_id, &id)) {
dev_warn(ns->ctrl->dev, "%s: Identify failure nvme%dn%d\n",
__func__, ns->ctrl->instance, ns->ns_id);
return -ENODEV;
}
if (id->ncap == ) {
kfree(id);
return -ENODEV;
}
if (nvme_nvm_ns_supported(ns, id) && ns->type != NVME_NS_LIGHTNVM) {
if (nvme_nvm_register(ns->queue, disk->disk_name)) {
dev_warn(ns->ctrl->dev,
"%s: LightNVM init failure\n", __func__);
kfree(id);
return -ENODEV;
}
ns->type = NVME_NS_LIGHTNVM;
}
if (ns->ctrl->vs >= NVME_VS(, ))
memcpy(ns->eui, id->eui64, sizeof(ns->eui));
if (ns->ctrl->vs >= NVME_VS(, ))
memcpy(ns->uuid, id->nguid, sizeof(ns->uuid));
old_ms = ns->ms;
lbaf = id->flbas & NVME_NS_FLBAS_LBA_MASK;
ns->lba_shift = id->lbaf[lbaf].ds;
ns->ms = le16_to_cpu(id->lbaf[lbaf].ms);
ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT);
/*
* If identify namespace failed, use default 512 byte block size so
* block layer can use before failing read/write for capacity.
*/
if (ns->lba_shift == )
ns->lba_shift = ;
bs = << ns->lba_shift;
/* XXX: PI implementation requires metadata equal t10 pi tuple size */
pi_type = ns->ms == sizeof(struct t10_pi_tuple) ?
id->dps & NVME_NS_DPS_PI_MASK : ;
blk_mq_freeze_queue(disk->queue);
if (blk_get_integrity(disk) && (ns->pi_type != pi_type ||
ns->ms != old_ms ||
bs != queue_logical_block_size(disk->queue) ||
(ns->ms && ns->ext)))
blk_integrity_unregister(disk);
ns->pi_type = pi_type;
blk_queue_logical_block_size(ns->queue, bs);
if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
nvme_init_integrity(ns);
if (ns->ms && !(ns->ms == && ns->pi_type) && !blk_get_integrity(disk))
set_capacity(disk, );
else
/* id->nsze是Namespace Size,即這個namespace所包含的所有LBA的總數,調試發現是。
而ns->lba_shift為,表示一個扇區個位元組。
是以*=GB,也就是我們剛開始使用qemu-img指令生成的磁盤鏡像的大小。 */
set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - ));
if (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM)
nvme_config_discard(ns);
blk_mq_unfreeze_queue(disk->queue);
kfree(id);
return ;
}
回到nvme_alloc_ns,執行完add_disk之後,注冊在disk->fops中的nvme_open和nvme_revalidate_disk會被調用。
static const struct block_device_operations nvme_fops = {
.owner = THIS_MODULE,
.ioctl = nvme_ioctl,
.compat_ioctl = nvme_compat_ioctl,
.open = nvme_open,
.release = nvme_release,
.getgeo = nvme_getgeo,
.revalidate_disk= nvme_revalidate_disk,
.pr_ops = &nvme_pr_ops,
};
static int nvme_open(struct block_device *bdev, fmode_t mode)
{
return nvme_get_ns_from_disk(bdev->bd_disk) ? : -ENXIO;
}
static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk)
{
struct nvme_ns *ns;
spin_lock(&dev_list_lock);
ns = disk->private_data;
if (ns && !kref_get_unless_zero(&ns->kref))
ns = NULL;
spin_unlock(&dev_list_lock);
return ns;
}
回到nvme_dev_scan,調用nvme_set_irq_hints。主要做一些中斷親和性的優化工作。
static void nvme_set_irq_hints(struct nvme_dev *dev)
{
struct nvme_queue *nvmeq;
int i;
for (i = ; i < dev->online_queues; i++) {
nvmeq = dev->queues[i];
if (!nvmeq->tags || !(*nvmeq->tags))
continue;
irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
blk_mq_tags_cpumask(*nvmeq->tags));
}
}
至此,nvme_dev_scan也結束了,整個nvme驅動的初始化也基本完成了。初始化完成之後,留給我們的就是/dev下的nvme0和nvme0n1兩個供應用層操作的接口,以及一個默默無聞在那裡1秒輪詢一次的核心線程nvme_kthread。這個線程做的事情很有限,檢測是否需要重新開機,是否有沒有處理的completion消息。但是我沒有明白,像處理CQ這種事情不都是有中斷驅動的嗎,為何還需要去輪詢?
static int nvme_kthread(void *data)
{
struct nvme_dev *dev, *next;
while (!kthread_should_stop()) {
set_current_state(TASK_INTERRUPTIBLE);
spin_lock(&dev_list_lock);
list_for_each_entry_safe(dev, next, &dev_list, node) {
int i;
u32 csts = readl(dev->bar + NVME_REG_CSTS);
/*
* Skip controllers currently under reset.
*/
if (work_pending(&dev->reset_work) || work_busy(&dev->reset_work))
continue;
if ((dev->subsystem && (csts & NVME_CSTS_NSSRO)) ||
csts & NVME_CSTS_CFS) {
if (queue_work(nvme_workq, &dev->reset_work)) {
dev_warn(dev->dev,
"Failed status: %x, reset controller\n",
readl(dev->bar + NVME_REG_CSTS));
}
continue;
}
for (i = ; i < dev->queue_count; i++) {
struct nvme_queue *nvmeq = dev->queues[i];
if (!nvmeq)
continue;
spin_lock_irq(&nvmeq->q_lock);
nvme_process_cq(nvmeq);
while (i == && dev->ctrl.event_limit > )
nvme_submit_async_event(dev);
spin_unlock_irq(&nvmeq->q_lock);
}
}
spin_unlock(&dev_list_lock);
schedule_timeout(round_jiffies_relative(HZ));
}
return ;
}
由于篇幅有限,有關初始化後的讀寫打算放在另一篇文章裡寫。
當然,由于對nvme的學習還處在入門階段,代碼分析總會出現不夠詳細甚至是錯誤的地方,希望大家對于錯誤的地方積極指出,我好确認後改掉,謝謝。