天天看點

linux裡的nvme驅動代碼分析(加載初始化)

SSD取代HDD已經是個必然的趨勢了。SSD的接口一路從sata變成pcie,上層協定又從ahci轉變為nvme,一次次的更新也帶來了一次次性能的提升。nvme實體層基于高速的pcie接口,pcie3.0一個lane就已經達到了8Gb/s的速度。x2 x4 x8 … N個lane在一起就是N倍的速度,非常強大。另外加上nvme協定本身的簡潔高效,使得協定層的消耗進一步降低,最終達到高速的傳輸效果。

QEMU + BUILDROOT

出于好奇和興趣,想要學習一下nvme。協定自然是看最新的pcie3.0和nvme1.2。但是隻停留在文檔和協定始終隻是紙上談兵,必須要進行一點實踐才能了解深刻。要進行實踐,無非是兩條路。一是從裝置端學習固件的nvme子產品,二就是從主機端學習nvme的驅動。如今沒再從事ssd的工作了,是以想走第一條路基本是不可能了,隻能在第二條路上做文章。要是手上有一個nvme的ssd,那對于實驗是非常不錯的,可惜我沒有,相信大多數的人也沒有。然而曲線也能救國,總有一條路是給我們準備的,那就是qemu了!

qemu是一個模拟器,可以模拟x86 arm powerpc等等等等,而且支援nvme裝置的模拟(不過目前貌似nvme裝置隻在x86下支援,我嘗試過ARM+PCI的環境,但是qemu表示不支援nvme裝置)。主機端的系統當然是選擇linux,因為我們可以很友善的獲得nvme相關的一切代碼。

那…接下來到底要怎麼上手哪?linux隻是一個核心,還需要rootfs,需要這需要那。我們在這裡想要學習的是nvme驅動,而不是如何從零開始搭建一個linux系統,是以我們需要一條快速而又便捷的道路。這時候,我就要給大家介紹下buildroot了。有了他,一切全搞定!^_^當然,前提是我們需要一個linux的主機(推薦ubuntu,個人比較喜歡,用的人也多,出了問題比較容易在網上找到攻略,最新的可以安裝16.04),并安裝了qemu。

buildroot最新的版本可以從這裡下載下傳:https://buildroot.org/downloads/buildroot-2016.05.tar.bz2

解壓後運作

make qemu_x86_64_defconfig
make
           

編譯完成後,根據board/qemu/x86_64/readme.txt裡描述的指令,可以在qemu裡運作起來剛才編譯好的linux系統。

qemu-system-x86_64 -M pc -kernel output/images/bzImage -drive file=output/images/rootfs.ext2,if=virtio,format=raw -append root=/dev/vda -net nic,model=virtio -net user
           
linux裡的nvme驅動代碼分析(加載初始化)

這個預設的系統log輸出被重定向到了虛拟機的“螢幕”上,而非shell上,不能復原,使得調試起來很不友善。我們需要修改一些東西把log重定向到linux的shell上。首先是編輯buildroot目錄下的.config檔案。

改成

然後重新編譯。等到編譯完成後運作下面修改過的指令,就得到我們想要的結果了。

make

qemu-system-x86_64 -M pc -kernel output/images/bzImage -drive file=output/images/rootfs.ext2,if=virtio,format=raw -append "console=ttyS0 root=/dev/vda" -net nic,model=virtio -net user -serial stdio
           
linux裡的nvme驅動代碼分析(加載初始化)

下面,我們再修改一些指令,加上nvme的支援。

qemu-img create -f raw nvme.img G

qemu-system-x86_64 -M pc -kernel output/images/bzImage -drive file=output/images/rootfs.ext2,if=virtio,format=raw -append "console=ttyS0 root=/dev/vda" -net nic,model=virtio -net user -serial stdio -drive file=nvme.img,if=none,format=raw,id=drv0 -device nvme,drive=drv0,serial=foo
           

linux系統起來後,我們可以在/dev下面檢視到nvme相關的裝置了。

# ls -l /dev
crw-------    1 root     root      253,   0 Jun  3 13:00 nvme0
brw-------    1 root     root      259,   0 Jun  3 13:00 nvme0n1
           

自此,我們的動手實踐稍作暫停,可以去學習下nvme的代碼了。在遇到問題的時候,我們可以修改代碼并在qemu裡運作檢視效果,真棒!

RTFSC - Read The Fucking Source Code

nvme驅動代碼的分析基于linux核心版本4.5.3,為什麼選擇這個版本?主要是因為buildroot-2016.05預設選擇的是這個版本的核心。我們也可以手動修改核心的版本,但這裡就不做詳述了。nvme的代碼位于drivers/nvme目錄内,檔案不多,主要就兩個檔案:core.c和pci.c。

分析驅動,首先是要找到這個驅動的入口。module_init把函數nvme_init聲明為這個驅動的入口,在linux加載過程中會自動被調用。

static int __init nvme_init(void)
{
    int result;

    init_waitqueue_head(&nvme_kthread_wait);

    nvme_workq = alloc_workqueue("nvme", WQ_UNBOUND | WQ_MEM_RECLAIM, );
    if (!nvme_workq)
        return -ENOMEM;

    result = nvme_core_init();
    if (result < )
        goto kill_workq;

    result = pci_register_driver(&nvme_driver);
    if (result)
        goto core_exit;
    return ;

 core_exit:
    nvme_core_exit();
 kill_workq:
    destroy_workqueue(nvme_workq);
    return result;
}

static void __exit nvme_exit(void)
{
    pci_unregister_driver(&nvme_driver);
    nvme_core_exit();
    destroy_workqueue(nvme_workq);
    BUG_ON(nvme_thread && !IS_ERR(nvme_thread));
    _nvme_check_size();
}

module_init(nvme_init);
module_exit(nvme_exit);
           

nvme_init流程分析:

  • 建立一個全局的workqueue,有了這個workqueue之後,很多的work就可以丢到這個workqueue裡執行了。後面會提到的兩個很重要的scan_work和reset_work就是在這個workqueue裡被排程運作的。
  • 調用nvme_core_init。
  • 調用pci_register_driver。
int __init nvme_core_init(void)
{
    int result;

    result = register_blkdev(nvme_major, "nvme");
    if (result < )
        return result;
    else if (result > )
        nvme_major = result;

    result = __register_chrdev(nvme_char_major, , NVME_MINORS, "nvme",
                            &nvme_dev_fops);
    if (result < )
        goto unregister_blkdev;
    else if (result > )
        nvme_char_major = result;

    nvme_class = class_create(THIS_MODULE, "nvme");
    if (IS_ERR(nvme_class)) {
        result = PTR_ERR(nvme_class);
        goto unregister_chrdev;
    }

    return ;

 unregister_chrdev:
    __unregister_chrdev(nvme_char_major, , NVME_MINORS, "nvme");
 unregister_blkdev:
    unregister_blkdev(nvme_major, "nvme");
    return result;
}
           

nvme_core_init流程分析:

  • 調用register_blkdev注冊一個名字叫nvme的塊裝置。
  • 調用__register_chrdev注冊一個名字叫nvme的字元裝置。這些注冊的裝置并不會在/dev下出現,而是在/proc/devices下,代表某種裝置對應的主裝置号,而不是裝置的執行個體。這裡要注意一點就是,字元裝置和塊裝置的主裝置号是不相關的,是以完全可以一樣,代表的是完全無關的裝置。我們在這裡獲得的字元裝置和塊裝置的主裝置号就正好都是253。
# cat /proc/devices 
Character devices:
   mem
   pty
   ttyp
   /dev/vc/
   tty
   ttyS
   /dev/tty
   /dev/console
   /dev/ptmx
   vcs
  misc
  input
  fb
 alsa
 ptm
 pts
 usb
 usb_device
 drm
 nvme
 bsg

Block devices:
 blkext
   sd
  sd
  sd
  sd
  sd
  sd
  sd
  sd
 sd
 sd
 sd
 sd
 sd
 sd
 sd
 sd
 nvme
 virtblk
           

細心的讀者可能會發現,剛才我們列出來的一個字元裝置nvme0,它的主裝置号确實是253。但是塊裝置nvme0n1的主裝置号是259,也就是blkext,這是為什麼那?先等等,到後面注冊這個裝置執行個體的時候我們再看。

回到nvme_init,pci_register_driver注冊了一個pci驅動。這裡有幾個重要的東西,一個是vendor id和device id,我們可以看到有一條是PCI_VDEVICE(INTEL, 0x5845),有了這個,這個驅動就能跟pci總線枚舉出來的裝置比對起來,進而正确的加載驅動了。

static const struct pci_device_id nvme_id_table[] = {
    { PCI_VDEVICE(INTEL, ),
        .driver_data = NVME_QUIRK_STRIPE_SIZE, },
    { PCI_VDEVICE(INTEL, ),   /* Qemu emulated controller */
        .driver_data = NVME_QUIRK_IDENTIFY_CNS, },
    { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, ) },
    { PCI_DEVICE(PCI_VENDOR_ID_APPLE, ) },
    { , }
};
MODULE_DEVICE_TABLE(pci, nvme_id_table);

static struct pci_driver nvme_driver = {
    .name       = "nvme",
    .id_table   = nvme_id_table,
    .probe      = nvme_probe,
    .remove     = nvme_remove,
    .shutdown   = nvme_shutdown,
    .driver     = {
        .pm = &nvme_dev_pm_ops,
    },
    .err_handler    = &nvme_err_handler,
};
           

在linux裡我們通過lspci指令來檢視目前的pci裝置,發現nvme裝置的device id就是0x5845。

# lspci -k
00:00.0 Class 0600: 8086:1237
00:01.0 Class 0601: 8086:7000
00:01.1 Class 0101: 8086:7010 ata_piix
00:01.3 Class 0680: 8086:7113
00:02.0 Class 0300: 1234:1111 bochs-drm
00:03.0 Class 0200: 1af4:1000 virtio-pci
00:04.0 Class 0108: 8086:5845 nvme
00:05.0 Class 0100: 1af4:1001 virtio-pci
           

pci_register_driver還有一個重要的事情就是設定probe函數。有了probe函數,當裝置和驅動比對了之後,相應驅動的probe函數就會被調用,來實作驅動的加載。是以nvme_init傳回後,這個驅動就啥事不做了,直到pci總線枚舉出了這個nvme裝置,然後就會調用我們的nvme_probe。

static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
    int node, result = -ENOMEM;
    struct nvme_dev *dev;

    node = dev_to_node(&pdev->dev);
    if (node == NUMA_NO_NODE)
        set_dev_node(&pdev->dev, );

    dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
    if (!dev)
        return -ENOMEM;
    dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
                            GFP_KERNEL, node);
    if (!dev->entry)
        goto free;
    dev->queues = kzalloc_node((num_possible_cpus() + ) * sizeof(void *),
                            GFP_KERNEL, node);
    if (!dev->queues)
        goto free;

    dev->dev = get_device(&pdev->dev);
    pci_set_drvdata(pdev, dev);

    result = nvme_dev_map(dev);
    if (result)
        goto free;

    INIT_LIST_HEAD(&dev->node);
    INIT_WORK(&dev->scan_work, nvme_dev_scan);
    INIT_WORK(&dev->reset_work, nvme_reset_work);
    INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
    mutex_init(&dev->shutdown_lock);
    init_completion(&dev->ioq_wait);

    result = nvme_setup_prp_pools(dev);
    if (result)
        goto put_pci;

    result = nvme_init_ctrl(&dev->ctrl, &pdev->dev, &nvme_pci_ctrl_ops,
            id->driver_data);
    if (result)
        goto release_pools;

    queue_work(nvme_workq, &dev->reset_work);
    return ;

 release_pools:
    nvme_release_prp_pools(dev);
 put_pci:
    put_device(dev->dev);
    nvme_dev_unmap(dev);
 free:
    kfree(dev->queues);
    kfree(dev->entry);
    kfree(dev);
    return result;
}
           

nvme_probe流程分析:

  • 為dev、dev->entry、dev->queues配置設定空間。entry儲存的是msi相關的資訊,為每個cpu core配置設定一個。queues為每個core配置設定一個io queue,所有的core共享一個admin queue。這裡的queue的概念,更嚴格的說,是一組submission queue和completion queue的集合。
  • 調用nvme_dev_map。
  • 初始化三個work變量,關聯回掉函數。
  • 調用nvme_setup_prp_pools。
  • 調用nvme_init_ctrl
  • 通過workqueue排程dev->reset_work,也就是排程nvme_reset_work函數。
static int nvme_dev_map(struct nvme_dev *dev)
{
    int bars;
    struct pci_dev *pdev = to_pci_dev(dev->dev);

    bars = pci_select_bars(pdev, IORESOURCE_MEM);
    if (!bars)
        return -ENODEV;
    if (pci_request_selected_regions(pdev, bars, "nvme"))
        return -ENODEV;

    dev->bar = ioremap(pci_resource_start(pdev, ), );
    if (!dev->bar)
        goto release;

       return ;
  release:
       pci_release_regions(pdev);
       return -ENODEV;
}
           

nvme_dev_map流程分析:

  • 調用pci_select_bars,這個函數的傳回值是一個mask值,每一位代表一個bar(base address register),哪一位被置位了,就代表哪一個bar為非零。這個涉及到pci的協定,pci協定裡規定了pci裝置的配置空間裡有6個32位的bar寄存器,代表了pci裝置上的一段記憶體空間(memory、io)。
    linux裡的nvme驅動代碼分析(加載初始化)
    在代碼中我們可以嘗試着增加一些調試代碼,以便我們更好的了解。在修改好代碼後,我們需要在buildroot目錄下重新編譯核心,這樣可以很快速的得到一個新的核心,然後運作看結果。我檢視過pci_select_bars的傳回值,是0x11,代表bar0和bar4是非零的。
  • 調用pci_request_selected_regions,這個函數的一個參數就是之前調用pci_select_bars傳回的mask值,作用就是把對應的這個幾個bar保留起來,不讓别人使用。

    不調用pci_request_selected_regions的話/proc/iomem如下

# cat /proc/iomem 
-febfffff : PCI Bus :
  fd000000-fdffffff : ::
    fd000000-fdffffff : bochs-drm
  feb80000-febbffff : ::
  febc0000-febcffff : ::
  febd0000-febd1fff : ::
  febd2000-febd2fff : ::
    febd2000-febd2fff : bochs-drm
  febd3000-febd3fff : ::
  febd4000-febd4fff : ::
  febd5000-febd5fff : ::
           

調用pci_request_selected_regions的話/proc/iomem如下,會多出兩項nvme,bar0對應的實體位址就是0xfebd0000,bar4對應的是0xfebd4000。

# cat /proc/iomem 
-febfffff : PCI Bus :
  fd000000-fdffffff : ::
    fd000000-fdffffff : bochs-drm
  feb80000-febbffff : ::
  febc0000-febcffff : ::
  febd0000-febd1fff : ::
    febd0000-febd1fff : nvme
  febd2000-febd2fff : ::
    febd2000-febd2fff : bochs-drm
  febd3000-febd3fff : ::
  febd4000-febd4fff : ::
    febd4000-febd4fff : nvme
  febd5000-febd5fff : ::
           
  • 調用ioremap。前面說到bar0對應的實體位址是0xfebd0000,在linux中我們無法直接通路實體位址,需要映射到虛拟位址,ioremap就是這個作用。映射完後,我們通路dev->bar就可以直接操作nvme裝置上的寄存器了。但是代碼中,并沒有根據pci_select_bars的傳回值來決定映射哪個bar,而是直接hard code成映射bar0,原因是nvme協定中強制規定了bar0就是記憶體映射的基址。而bar4是自定義用途,暫時還不确定有什麼用。
linux裡的nvme驅動代碼分析(加載初始化)
static int nvme_setup_prp_pools(struct nvme_dev *dev)
{
    dev->prp_page_pool = dma_pool_create("prp list page", dev->dev,
                        PAGE_SIZE, PAGE_SIZE, );
    if (!dev->prp_page_pool)
        return -ENOMEM;

    /* Optimisation for I/Os between 4k and 128k */
    dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,
                        , , );
    if (!dev->prp_small_pool) {
        dma_pool_destroy(dev->prp_page_pool);
        return -ENOMEM;
    }
    return ;
}
           

回到nvme_probe來看nvme_setup_prp_pools,主要是建立dma pool,後面就可以通過其他dma函數從dma pool中獲得memory了。這裡一個pool裡提供的是256位元組大小的記憶體,一個提供的是4K大小的,主要是為了對于不一樣長度的prp list來做優化。

int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
        const struct nvme_ctrl_ops *ops, unsigned long quirks)
{
    int ret;

    INIT_LIST_HEAD(&ctrl->namespaces);
    mutex_init(&ctrl->namespaces_mutex);
    kref_init(&ctrl->kref);
    ctrl->dev = dev;
    ctrl->ops = ops;
    ctrl->quirks = quirks;

    ret = nvme_set_instance(ctrl);
    if (ret)
        goto out;

    ctrl->device = device_create_with_groups(nvme_class, ctrl->dev,
                MKDEV(nvme_char_major, ctrl->instance),
                dev, nvme_dev_attr_groups,
                "nvme%d", ctrl->instance);
    if (IS_ERR(ctrl->device)) {
        ret = PTR_ERR(ctrl->device);
        goto out_release_instance;
    }
    get_device(ctrl->device);
    dev_set_drvdata(ctrl->device, ctrl);
    ida_init(&ctrl->ns_ida);

    spin_lock(&dev_list_lock);
    list_add_tail(&ctrl->node, &nvme_ctrl_list);
    spin_unlock(&dev_list_lock);

    return ;
out_release_instance:
    nvme_release_instance(ctrl);
out:
    return ret;
}
           

回到nvme_probe,nvme_init_ctrl裡主要做的事情就是通過device_create_with_groups建立一個名字叫nvme0的字元裝置,也就是我們之前見到的。

有了這個字元裝置之後,我們就可以通過open、ioctl之類的接口去操作它了。

static const struct file_operations nvme_dev_fops = {
    .owner      = THIS_MODULE,
    .open       = nvme_dev_open,
    .release    = nvme_dev_release,
    .unlocked_ioctl = nvme_dev_ioctl,
    .compat_ioctl   = nvme_dev_ioctl,
};
           

而nvme0的這個0是通過nvme_set_instance得到的。這裡面主要是通過ida_get_new可以得到一個唯一的索引值。

static int nvme_set_instance(struct nvme_ctrl *ctrl)
{
    int instance, error;

    do {
        if (!ida_pre_get(&nvme_instance_ida, GFP_KERNEL))
            return -ENODEV;

        spin_lock(&dev_list_lock);
        error = ida_get_new(&nvme_instance_ida, &instance);
        spin_unlock(&dev_list_lock);
    } while (error == -EAGAIN);

    if (error)
        return -ENODEV;

    ctrl->instance = instance;
    return ;
}
           

再次回到nvme_probe,dev->reset_work被排程,也就是nvme_reset_work被調用了。

nvme_reset_work — 一個很長的work

static void nvme_reset_work(struct work_struct *work)
{
    struct nvme_dev *dev = container_of(work, struct nvme_dev, reset_work);
    int result = -ENODEV;

    if (WARN_ON(test_bit(NVME_CTRL_RESETTING, &dev->flags)))
        goto out;

    /*
     * If we're called to reset a live controller first shut it down before
     * moving on.
     */
    if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
        nvme_dev_disable(dev, false);

    set_bit(NVME_CTRL_RESETTING, &dev->flags);

    result = nvme_pci_enable(dev);
    if (result)
        goto out;

    result = nvme_configure_admin_queue(dev);
    if (result)
        goto out;

    nvme_init_queue(dev->queues[], );
    result = nvme_alloc_admin_tags(dev);
    if (result)
        goto out;

    result = nvme_init_identify(&dev->ctrl);
    if (result)
        goto out;

    result = nvme_setup_io_queues(dev);
    if (result)
        goto out;

    dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;

    result = nvme_dev_list_add(dev);
    if (result)
        goto out;

    /*
     * Keep the controller around but remove all namespaces if we don't have
     * any working I/O queue.
     */
    if (dev->online_queues < ) {
        dev_warn(dev->dev, "IO queues not created\n");
        nvme_remove_namespaces(&dev->ctrl);
    } else {
        nvme_start_queues(&dev->ctrl);
        nvme_dev_add(dev);
    }

    clear_bit(NVME_CTRL_RESETTING, &dev->flags);
    return;

 out:
    nvme_remove_dead_ctrl(dev, result);
}
           

nvme_reset_work流程分析:

  • 首先通過NVME_CTRL_RESETTING标志來確定nvme_reset_work不會被重複進入。
  • 調用nvme_pci_enable。
  • 調用nvme_configure_admin_queue。
  • 調用nvme_init_queue
  • 調用nvme_alloc_admin_tags
  • 調用nvme_init_identify
  • 調用nvme_setup_io_queues
  • 調用nvme_dev_list_add
static int nvme_pci_enable(struct nvme_dev *dev)
{
    u64 cap;
    int result = -ENOMEM;
    struct pci_dev *pdev = to_pci_dev(dev->dev);

    if (pci_enable_device_mem(pdev))
        return result;

    dev->entry[].vector = pdev->irq;
    pci_set_master(pdev);

    if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK()) &&
        dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK()))
        goto disable;

    if (readl(dev->bar + NVME_REG_CSTS) == -) {
        result = -ENODEV;
        goto disable;
    }

    /*
     * Some devices don't advertse INTx interrupts, pre-enable a single
     * MSIX vec for setup. We'll adjust this later.
     */
    if (!pdev->irq) {
        result = pci_enable_msix(pdev, dev->entry, );
        if (result < )
            goto disable;
    }

    cap = lo_hi_readq(dev->bar + NVME_REG_CAP);

    dev->q_depth = min_t(int, NVME_CAP_MQES(cap) + , NVME_Q_DEPTH);
    dev->db_stride =  << NVME_CAP_STRIDE(cap);
    dev->dbs = dev->bar + ;

    /*
     * Temporary fix for the Apple controller found in the MacBook8, and
     * some MacBook7, to avoid controller resets and data loss.
     */
    if (pdev->vendor == PCI_VENDOR_ID_APPLE && pdev->device == ) {
        dev->q_depth = ;
        dev_warn(dev->dev, "detected Apple NVMe controller, set "
            "queue depth=%u to work around controller resets\n",
            dev->q_depth);
    }

    if (readl(dev->bar + NVME_REG_VS) >= NVME_VS(, ))
        dev->cmb = nvme_map_cmb(dev);

    pci_enable_pcie_error_reporting(pdev);
    pci_save_state(pdev);
    return ;

 disable:
    pci_disable_device(pdev);
    return result;
}
           

nvme_pci_enable流程分析:

  • 調用pci_enable_device_mem來使能nvme裝置的記憶體空間,也就是之前映射的bar0空間。
  • 之後就可以通過readl(dev->bar + NVME_REG_CSTS)來直接操作nvme裝置上的控制寄存器了,也就是nvme協定中的如下這個表。
    linux裡的nvme驅動代碼分析(加載初始化)
  • pci有兩種中斷模式,一種是INT,另一種是MSI。假如不支援INT模式的話,就使能MSI模式。在這裡使用的是INT模式,irq号為11。但是這裡還是為admin queue的msi中斷号dev->entry[0].vector附了值,看來後面是可能會用到msi的。
# cat /proc/interrupts 
           CPU0       
  :            IO-APIC   -edge      timer
  :             IO-APIC   -edge      i8042
  :           IO-APIC   -edge      serial
  :             IO-APIC   -fasteoi   acpi
 :           IO-APIC  -fasteoi   virtio1
 :            IO-APIC  -fasteoi   virtio0, nvme0q0, nvme0q1
 :           IO-APIC  -edge      i8042
 :             IO-APIC  -edge      ata_piix
 :             IO-APIC  -edge      ata_piix
           
  • 從CAP寄存器中獲得一些配置參數,并把dev->dbs設定成dev->bar+4096。4096的由來是上面表裡doorbell寄存器的起始位址是0x1000。
  • 假如nvme協定的版本大于等于1.2的話,需要調用nvme_map_cmb映射controller memory buffer。但是現在2.5版的qemu實作的nvme是1.1版的,是以這些不被支援。但是作為1.2版本的一個新加功能,我在這裡還是分析一下。CMB的主要作用是把SQ/CQ存儲的位置從host memory搬到device memory來提升性能,改善延時。其實這個函數裡做的事情和之前做的pci映射工作沒有太大差別,但是有一點需要注意,這裡映射用的是ioremap_wc而不是ioremap。
static void __iomem *nvme_map_cmb(struct nvme_dev *dev)
{
    u64 szu, size, offset;
    u32 cmbloc;
    resource_size_t bar_size;
    struct pci_dev *pdev = to_pci_dev(dev->dev);
    void __iomem *cmb;
    dma_addr_t dma_addr;

    if (!use_cmb_sqes)
        return NULL;

    dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
    if (!(NVME_CMB_SZ(dev->cmbsz)))
        return NULL;

    cmbloc = readl(dev->bar + NVME_REG_CMBLOC);

    szu = (u64) << ( +  * NVME_CMB_SZU(dev->cmbsz));
    size = szu * NVME_CMB_SZ(dev->cmbsz);
    offset = szu * NVME_CMB_OFST(cmbloc);
    bar_size = pci_resource_len(pdev, NVME_CMB_BIR(cmbloc));

    if (offset > bar_size)
        return NULL;

    /*
     * Controllers may support a CMB size larger than their BAR,
     * for example, due to being behind a bridge. Reduce the CMB to
     * the reported size of the BAR
     */
    if (size > bar_size - offset)
        size = bar_size - offset;

    dma_addr = pci_resource_start(pdev, NVME_CMB_BIR(cmbloc)) + offset;
    cmb = ioremap_wc(dma_addr, size);
    if (!cmb)
        return NULL;

    dev->cmb_dma_addr = dma_addr;
    dev->cmb_size = size;
    return cmb;
}
           

我們先來看看ioremap,其實它就等價于ioremap_nocache。因為我們重映射很多情況下映射的是寄存器之類的東西,要是使用的cache的話,會産生不可預測的問題,是以預設情況下cache是不能使用的。而ioremap_wc使用了一種叫write combining的cache機制。據我所知,這是arm中沒有的功能,而在x86裡有實作。首先要明确的是,設定成WC之後,就和L1,L2,L3的cache沒有任何關系了。在x86裡,有另外一塊存儲區域叫WC buffer(通常是一個buffer是64位元組大小,不同平台有不同數量個buffer)。設定成WC的記憶體,對其進行讀操作時,直接從記憶體裡讀取,繞過cache。但是當對其進行寫操作的時候,差別就來了。新的資料不會被直接寫進記憶體,而是緩存在WC buffer裡,等到WC buffer滿了或者執行了某些指令後,WC buffer裡的資料會一下子被寫進記憶體裡去。不管怎麼說,這還是一種cache機制,在這裡能使用的主要原因是因為SQ/CQ的内容對寫進去的順序是沒有要求的,直到最後的doorbell寄存器被修改,或者裝置發出一個中斷後,才有可能有人去讀取,是以可以用到這種優化。

/*
 * The default ioremap() behavior is non-cached:
 */
static inline void __iomem *ioremap(resource_size_t offset, unsigned long size)
{
    return ioremap_nocache(offset, size);
}
           

回到nvme_reset_work分析nvme_configure_admin_queue。

static int nvme_configure_admin_queue(struct nvme_dev *dev)
{
    int result;
    u32 aqa;
    u64 cap = lo_hi_readq(dev->bar + NVME_REG_CAP);
    struct nvme_queue *nvmeq;

    dev->subsystem = readl(dev->bar + NVME_REG_VS) >= NVME_VS(, ) ?
                        NVME_CAP_NSSRC(cap) : ;

    if (dev->subsystem &&
        (readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_NSSRO))
        writel(NVME_CSTS_NSSRO, dev->bar + NVME_REG_CSTS);

    result = nvme_disable_ctrl(&dev->ctrl, cap);
    if (result < )
        return result;

    nvmeq = dev->queues[];
    if (!nvmeq) {
        nvmeq = nvme_alloc_queue(dev, , NVME_AQ_DEPTH);
        if (!nvmeq)
            return -ENOMEM;
    }

    aqa = nvmeq->q_depth - ;
    aqa |= aqa << ;

    writel(aqa, dev->bar + NVME_REG_AQA);
    lo_hi_writeq(nvmeq->sq_dma_addr, dev->bar + NVME_REG_ASQ);
    lo_hi_writeq(nvmeq->cq_dma_addr, dev->bar + NVME_REG_ACQ);

    result = nvme_enable_ctrl(&dev->ctrl, cap);
    if (result)
        goto free_nvmeq;

    nvmeq->cq_vector = ;
    result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
    if (result) {
        nvmeq->cq_vector = -;
        goto free_nvmeq;
    }

    return result;

 free_nvmeq:
    nvme_free_queues(dev, );
    return result;
}
           

nvme_configure_admin_queue流程分析:

  • 從CAP寄存器中獲悉對Subsystem Reset的支援
  • 調用nvme_disable_ctrl
  • 調用nvme_alloc_queue
  • 調用nvme_enable_ctrl
  • 調用queue_request_irq
int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap)
{
    int ret;

    ctrl->ctrl_config &= ~NVME_CC_SHN_MASK;
    ctrl->ctrl_config &= ~NVME_CC_ENABLE;

    ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
    if (ret)
        return ret;
    return nvme_wait_ready(ctrl, cap, false);
}
           

這裡的ctrl->ops就是之前nvme_init_ctrl時傳進去的nvme_pci_ctrl_ops,reg_write32通過NVME_REG_CC寄存器disable裝置。

static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
{
    *val = readl(to_nvme_dev(ctrl)->bar + off);
    return ;
}

static int nvme_pci_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val)
{
    writel(val, to_nvme_dev(ctrl)->bar + off);
    return ;
}

static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
    .reg_read32     = nvme_pci_reg_read32,
    .reg_write32        = nvme_pci_reg_write32,
    .reg_read64     = nvme_pci_reg_read64,
    .io_incapable       = nvme_pci_io_incapable,
    .reset_ctrl     = nvme_pci_reset_ctrl,
    .free_ctrl      = nvme_pci_free_ctrl,
};
           

然後通過讀取狀态寄存器NVME_REG_CSTS來等待裝置真正停止。逾時上限是根據CAP寄存器的Timeout域來計算出來的,每個機關代表500ms。

static int nvme_wait_ready(struct nvme_ctrl *ctrl, u64 cap, bool enabled)
{
    unsigned long timeout =
        ((NVME_CAP_TIMEOUT(cap) + ) * HZ / ) + jiffies;
    u32 csts, bit = enabled ? NVME_CSTS_RDY : ;
    int ret;

    while ((ret = ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &csts)) == ) {
        if ((csts & NVME_CSTS_RDY) == bit)
            break;

        msleep();
        if (fatal_signal_pending(current))
            return -EINTR;
        if (time_after(jiffies, timeout)) {
            dev_err(ctrl->dev,
                "Device not ready; aborting %s\n", enabled ?
                        "initialisation" : "reset");
            return -ENODEV;
        }
    }

    return ret;
}
           
linux裡的nvme驅動代碼分析(加載初始化)

回到nvme_configure_admin_queue分析nvme_alloc_queue。

static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
                            int depth)
{
    struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq), GFP_KERNEL);
    if (!nvmeq)
        return NULL;

    nvmeq->cqes = dma_zalloc_coherent(dev->dev, CQ_SIZE(depth),
                      &nvmeq->cq_dma_addr, GFP_KERNEL);
    if (!nvmeq->cqes)
        goto free_nvmeq;

    if (nvme_alloc_sq_cmds(dev, nvmeq, qid, depth))
        goto free_cqdma;

    nvmeq->q_dmadev = dev->dev;
    nvmeq->dev = dev;
    snprintf(nvmeq->irqname, sizeof(nvmeq->irqname), "nvme%dq%d",
            dev->ctrl.instance, qid);
    spin_lock_init(&nvmeq->q_lock);
    nvmeq->cq_head = ;
    nvmeq->cq_phase = ;
    nvmeq->q_db = &dev->dbs[qid *  * dev->db_stride];
    nvmeq->q_depth = depth;
    nvmeq->qid = qid;
    nvmeq->cq_vector = -;
    dev->queues[qid] = nvmeq;

    /* make sure queue descriptor is set before queue count, for kthread */
    mb();
    dev->queue_count++;

    return nvmeq;

 free_cqdma:
    dma_free_coherent(dev->dev, CQ_SIZE(depth), (void *)nvmeq->cqes,
                            nvmeq->cq_dma_addr);
 free_nvmeq:
    kfree(nvmeq);
    return NULL;
}
           

nvme_alloc_queue流程分析:

  • 調用dma_zalloc_coherent為completion queue配置設定記憶體以供DMA使用。nvmeq->cqes為申請到的記憶體的虛拟位址,供核心使用。而nvmeq->cq_dma_addr就是這塊記憶體的實體位址,供DMA控制器使用。
  • nvmeq->irqname是用來注冊中斷的時候的名字,從nvme%dq%d可以看到,就是最後生成的nvme0q0和nvme0q1,一個是給admin queue的,一個是給io queue的。
  • 調用nvme_alloc_sq_cmd來處理submission queue,假如nvme版本是1.2或者以上的,并且cmb支援submission queue,那就使用cmb。要不然就和completion queue一樣使用dma_alloc_coherent來配置設定記憶體。
static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
                int qid, int depth)
{
    if (qid && dev->cmb && use_cmb_sqes && NVME_CMB_SQS(dev->cmbsz)) {
        unsigned offset = (qid - ) * roundup(SQ_SIZE(depth),
                              dev->ctrl.page_size);
        nvmeq->sq_dma_addr = dev->cmb_dma_addr + offset;
        nvmeq->sq_cmds_io = dev->cmb + offset;
    } else {
        nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
                    &nvmeq->sq_dma_addr, GFP_KERNEL);
        if (!nvmeq->sq_cmds)
            return -ENOMEM;
    }

    return ;
}
           
linux裡的nvme驅動代碼分析(加載初始化)

再次回到nvme_configure_admin_queue,看看nvme_enable_ctrl。這個函數并沒有太多特别的,可以簡單了解為前面分析過的nvme_disable_ctrl的逆向操作。

int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap)
{
    /*
     * Default to a 4K page size, with the intention to update this
     * path in the future to accomodate architectures with differing
     * kernel and IO page sizes.
     */
    unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + , page_shift = ;
    int ret;

    if (page_shift < dev_page_min) {
        dev_err(ctrl->dev,
            "Minimum device page size %u too large for host (%u)\n",
             << dev_page_min,  << page_shift);
        return -ENODEV;
    }

    ctrl->page_size =  << page_shift;

    ctrl->ctrl_config = NVME_CC_CSS_NVM;
    ctrl->ctrl_config |= (page_shift - ) << NVME_CC_MPS_SHIFT;
    ctrl->ctrl_config |= NVME_CC_ARB_RR | NVME_CC_SHN_NONE;
    ctrl->ctrl_config |= NVME_CC_IOSQES | NVME_CC_IOCQES;
    ctrl->ctrl_config |= NVME_CC_ENABLE;

    ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
    if (ret)
        return ret;
    return nvme_wait_ready(ctrl, cap, true);
}
           

回到nvme_configure_admin_queue,看最後一個函數queue_request_irq。這個函數主要的工作是設定中斷處理函數,預設情況下不使用線程化的中斷處理,而是使用中斷上下文的中斷處理。

static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
                            const char *name)
{
    if (use_threaded_interrupts)
        return request_threaded_irq(dev->entry[nvmeq->cq_vector].vector,
                    nvme_irq_check, nvme_irq, IRQF_SHARED,
                    name, nvmeq);
    return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
                IRQF_SHARED, name, nvmeq);
}
           

一路傳回到nvme_reset_work,分析nvme_init_queue。根據傳到nvme_init_queue的參數可知,這裡初始化的是queue 0,也就是admin queue。

static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
{
    struct nvme_dev *dev = nvmeq->dev;

    spin_lock_irq(&nvmeq->q_lock);
    nvmeq->sq_tail = ;
    nvmeq->cq_head = ;
    nvmeq->cq_phase = ;
    nvmeq->q_db = &dev->dbs[qid *  * dev->db_stride];
    memset((void *)nvmeq->cqes, , CQ_SIZE(nvmeq->q_depth));
    dev->online_queues++;
    spin_unlock_irq(&nvmeq->q_lock);
}
           

這個函數做的事情不多,主要是對nvme_queue的成員變量進行一些初始化。q_db指向這個queue對應的doorbell寄存器的位址。

回到nvme_reset_work,分析nvme_alloc_admin_tags。

static int nvme_alloc_admin_tags(struct nvme_dev *dev)
{
    if (!dev->ctrl.admin_q) {
        dev->admin_tagset.ops = &nvme_mq_admin_ops;
        dev->admin_tagset.nr_hw_queues = ;

        /*
         * Subtract one to leave an empty queue entry for 'Full Queue'
         * condition. See NVM-Express 1.2 specification, section 4.1.2.
         */
        dev->admin_tagset.queue_depth = NVME_AQ_BLKMQ_DEPTH - ;
        dev->admin_tagset.timeout = ADMIN_TIMEOUT;
        dev->admin_tagset.numa_node = dev_to_node(dev->dev);
        dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
        dev->admin_tagset.driver_data = dev;

        if (blk_mq_alloc_tag_set(&dev->admin_tagset))
            return -ENOMEM;

        dev->ctrl.admin_q = blk_mq_init_queue(&dev->admin_tagset);
        if (IS_ERR(dev->ctrl.admin_q)) {
            blk_mq_free_tag_set(&dev->admin_tagset);
            return -ENOMEM;
        }
        if (!blk_get_queue(dev->ctrl.admin_q)) {
            nvme_dev_remove_admin(dev);
            dev->ctrl.admin_q = NULL;
            return -ENODEV;
        }
    } else
        blk_mq_start_stopped_hw_queues(dev->ctrl.admin_q, true);

    return ;
}
           

nvme_alloc_admin_tags流程分析:

  • blk_mq_alloc_tag_set申請tag set并和request_queue關聯起來,并且會對queue_depth(254)個request(index 0-253)做初始化。初始化函數就是nvme_mq_admin_ops傳進去的nvme_admin_init_request。
static struct blk_mq_ops nvme_mq_admin_ops = {
    .queue_rq   = nvme_queue_rq,
    .complete   = nvme_complete_rq,
    .map_queue  = blk_mq_map_queue,
    .init_hctx  = nvme_admin_init_hctx,
    .exit_hctx      = nvme_admin_exit_hctx,
    .init_request   = nvme_admin_init_request,
    .timeout    = nvme_timeout,
};

static int nvme_admin_init_request(void *data, struct request *req,
                unsigned int hctx_idx, unsigned int rq_idx,
                unsigned int numa_node)
{
    struct nvme_dev *dev = data;
    struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
    struct nvme_queue *nvmeq = dev->queues[];

    BUG_ON(!nvmeq);
    iod->nvmeq = nvmeq;
    return ;
}
           
  • blk_mq_init_queue初始化request_queue并指派給dev->ctrl.admin_q,會調用nvme_admin_init_hctx,并會調用nvme_admin_init_request初始化index 254的request,這點很奇怪。
static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
                unsigned int hctx_idx)
{
    struct nvme_dev *dev = data;
    struct nvme_queue *nvmeq = dev->queues[];

    WARN_ON(hctx_idx != );
    WARN_ON(dev->admin_tagset.tags[] != hctx->tags);
    WARN_ON(nvmeq->tags);

    hctx->driver_data = nvmeq;
    nvmeq->tags = &dev->admin_tagset.tags[];
    return ;
}
           

回到nvme_reset_work,分析nvme_init_identify。

int nvme_init_identify(struct nvme_ctrl *ctrl)
{
    struct nvme_id_ctrl *id;
    u64 cap;
    int ret, page_shift;

    ret = ctrl->ops->reg_read32(ctrl, NVME_REG_VS, &ctrl->vs);
    if (ret) {
        dev_err(ctrl->dev, "Reading VS failed (%d)\n", ret);
        return ret;
    }

    ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &cap);
    if (ret) {
        dev_err(ctrl->dev, "Reading CAP failed (%d)\n", ret);
        return ret;
    }
    page_shift = NVME_CAP_MPSMIN(cap) + ;

    if (ctrl->vs >= NVME_VS(, ))
        ctrl->subsystem = NVME_CAP_NSSRC(cap);

    ret = nvme_identify_ctrl(ctrl, &id);
    if (ret) {
        dev_err(ctrl->dev, "Identify Controller failed (%d)\n", ret);
        return -EIO;
    }

    ctrl->oncs = le16_to_cpup(&id->oncs);
    atomic_set(&ctrl->abort_limit, id->acl + );
    ctrl->vwc = id->vwc;
    memcpy(ctrl->serial, id->sn, sizeof(id->sn));
    memcpy(ctrl->model, id->mn, sizeof(id->mn));
    memcpy(ctrl->firmware_rev, id->fr, sizeof(id->fr));
    if (id->mdts)
        ctrl->max_hw_sectors =  << (id->mdts + page_shift - );
    else
        ctrl->max_hw_sectors = UINT_MAX;

    if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) && id->vs[]) {
        unsigned int max_hw_sectors;

        ctrl->stripe_size =  << (id->vs[] + page_shift);
        max_hw_sectors = ctrl->stripe_size >> (page_shift - );
        if (ctrl->max_hw_sectors) {
            ctrl->max_hw_sectors = min(max_hw_sectors,
                            ctrl->max_hw_sectors);
        } else {
            ctrl->max_hw_sectors = max_hw_sectors;
        }
    }

    nvme_set_queue_limits(ctrl, ctrl->admin_q);

    kfree(id);
    return ;
}
           

nvme_init_identify流程分析:

  • 調用nvme_identify_ctrl
  • 調用nvme_set_queue_limits

先來分析nvme_identify_ctrl

int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
{
    struct nvme_command c = { };
    int error;

    /* gcc-4.4.4 (at least) has issues with initializers and anon unions */
    c.identify.opcode = nvme_admin_identify;
    c.identify.cns = cpu_to_le32();

    *id = kmalloc(sizeof(struct nvme_id_ctrl), GFP_KERNEL);
    if (!*id)
        return -ENOMEM;

    error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
            sizeof(struct nvme_id_ctrl));
    if (error)
        kfree(*id);
    return error;
}
           

nvme_identify_ctrl流程分析:

  • 建立一個opcode為nvme_admin_identify(0x6)的command。在這裡我們會看到cpu_to_le32這樣的函數,這些函數的主要用途是因為在nvme協定裡規定的一些消息格式都是按照小端存儲的,但是我們的主機可能是小端的x86,也可能是大端的arm或者其他類型,用了這樣的函數就可以做到主機格式和小端之間的轉換,讓代碼更好得跨平台。
  • 通過nvme_submit_sync_cmd把指令發送給nvme裝置
int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
        void *buffer, unsigned bufflen, u32 *result, unsigned timeout)
{
    struct request *req;
    int ret;

    req = nvme_alloc_request(q, cmd, );
    if (IS_ERR(req))
        return PTR_ERR(req);

    req->timeout = timeout ? timeout : ADMIN_TIMEOUT;

    if (buffer && bufflen) {
        ret = blk_rq_map_kern(q, req, buffer, bufflen, GFP_KERNEL);
        if (ret)
            goto out;
    }

    blk_execute_rq(req->q, NULL, req, 0);
    if (result)
        *result = (u32)(uintptr_t)req->special;
    ret = req->errors;
 out:
    blk_mq_free_request(req);
    return ret;
}
           

__nvme_submit_sync_cmd流程分析:

  • 調用nvme_alloc_request
  • 調用blk_rq_map_kern
  • 調用blk_execute_rq,其中會調用函數指針queue_rq指向的函數nvme_queue_rq。

先看下nvme_alloc_request,blk_mq_alloc_request從request_queue中申請一個request,然後初始化這個request的一些屬性。這些屬性都很重要,很多會直接影響到後續代碼的執行流程。

struct request *nvme_alloc_request(struct request_queue *q,
        struct nvme_command *cmd, unsigned int flags)
{
    bool write = cmd->common.opcode & ;
    struct request *req;

    req = blk_mq_alloc_request(q, write, flags);
    if (IS_ERR(req))
        return req;

    req->cmd_type = REQ_TYPE_DRV_PRIV;
    req->cmd_flags |= REQ_FAILFAST_DRIVER;
    req->__data_len = ;
    req->__sector = (sector_t) -;
    req->bio = req->biotail = NULL;

    req->cmd = (unsigned char *)cmd;
    req->cmd_len = sizeof(struct nvme_command);
    req->special = (void *);

    return req;
}
           

假如傳進__nvme_submit_sync_cmd的buffer和bufflen參數都不為空的話,blk_rq_map_kern會被執行。在之前的nvme_alloc_request裡,req->__data_len被設定成0。但是隻要blk_rq_map_kern執行過之後,req->__data_len就會變成非零值,也就是映射的區域的大小。實測下來應該是以頁(4096B)為機關的大小。

再分析下nvme_queue_rq。這個函數十分重要,實作了最終的指令的發送。

static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
             const struct blk_mq_queue_data *bd)
{
    struct nvme_ns *ns = hctx->queue->queuedata;
    struct nvme_queue *nvmeq = hctx->driver_data;
    struct nvme_dev *dev = nvmeq->dev;
    struct request *req = bd->rq;
    struct nvme_command cmnd;
    int ret = BLK_MQ_RQ_QUEUE_OK;

    /*
     * If formated with metadata, require the block layer provide a buffer
     * unless this namespace is formated such that the metadata can be
     * stripped/generated by the controller with PRACT=.
     */
    if (ns && ns->ms && !blk_integrity_rq(req)) {
        if (!(ns->pi_type && ns->ms == ) &&
                    req->cmd_type != REQ_TYPE_DRV_PRIV) {
            blk_mq_end_request(req, -EFAULT);
            return BLK_MQ_RQ_QUEUE_OK;
        }
    }

    ret = nvme_init_iod(req, dev);
    if (ret)
        return ret;

    if (req->cmd_flags & REQ_DISCARD) {
        ret = nvme_setup_discard(nvmeq, ns, req, &cmnd);
    } else {
        if (req->cmd_type == REQ_TYPE_DRV_PRIV)
            memcpy(&cmnd, req->cmd, sizeof(cmnd));
        else if (req->cmd_flags & REQ_FLUSH)
            nvme_setup_flush(ns, &cmnd);
        else
            nvme_setup_rw(ns, req, &cmnd);

        if (req->nr_phys_segments)
            ret = nvme_map_data(dev, req, &cmnd);
    }

    if (ret)
        goto out;

    cmnd.common.command_id = req->tag;
    blk_mq_start_request(req);

    spin_lock_irq(&nvmeq->q_lock);
    if (unlikely(nvmeq->cq_vector < )) {
        if (ns && !test_bit(NVME_NS_DEAD, &ns->flags))
            ret = BLK_MQ_RQ_QUEUE_BUSY;
        else
            ret = BLK_MQ_RQ_QUEUE_ERROR;
        spin_unlock_irq(&nvmeq->q_lock);
        goto out;
    }
    __nvme_submit_cmd(nvmeq, &cmnd);
    nvme_process_cq(nvmeq);
    spin_unlock_irq(&nvmeq->q_lock);
    return BLK_MQ_RQ_QUEUE_OK;
out:
    nvme_free_iod(dev, req);
    return ret;
}
           

nvme_queue_rq流程分析:

  • 調用nvme_init_iod
  • 調用nvme_map_data
  • 調用blk_mq_start_request
  • 調用__nvme_submit_cmd
  • 調用nvme_process_cq
static int nvme_init_iod(struct request *rq, struct nvme_dev *dev)
{
    struct nvme_iod *iod = blk_mq_rq_to_pdu(rq);
    int nseg = rq->nr_phys_segments;
    unsigned size;

    if (rq->cmd_flags & REQ_DISCARD)
        size = sizeof(struct nvme_dsm_range);
    else
        size = blk_rq_bytes(rq);

    if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) {
        iod->sg = kmalloc(nvme_iod_alloc_size(dev, size, nseg), GFP_ATOMIC);
        if (!iod->sg)
            return BLK_MQ_RQ_QUEUE_BUSY;
    } else {
        iod->sg = iod->inline_sg;
    }

    iod->aborted = ;
    iod->npages = -;
    iod->nents = ;
    iod->length = size;
    return ;
}
           

看一下nvme_init_iod,第一句blk_mq_rq_to_pdu就非常讓人不解。

/*
 * Driver command data is immediately after the request. So subtract request
 * size to get back to the original request, add request size to get the PDU.
 */
static inline struct request *blk_mq_rq_from_pdu(void *pdu)
{
    return pdu - sizeof(struct request);
}
static inline void *blk_mq_rq_to_pdu(struct request *rq)
{
    return rq + ;
}
           

讀一下注釋,再結合nvme_alloc_admin_tags裡的dev->admin_tagset.cmd_size = nvme_cmd_size(dev);可知,配置設定空間的時候,每個request都是配備有額外的空間的,大小就是通過cmd_size來指定的。是以在request之後緊跟着的就是額外的空間。這裡可以看到,額外的空間還不隻是nvme_iod的大小。

區域1 區域2 區域3
sizeof(struct nvme_iod) sizeof(struct scatterlist) * nseg sizeof(__le64 * ) * nvme_npages(size, dev)
static int nvme_npages(unsigned size, struct nvme_dev *dev)
{
    unsigned nprps = DIV_ROUND_UP(size + dev->ctrl.page_size,
                      dev->ctrl.page_size);
    return DIV_ROUND_UP( * nprps, PAGE_SIZE - );
}

static unsigned int nvme_iod_alloc_size(struct nvme_dev *dev,
        unsigned int size, unsigned int nseg)
{
    return sizeof(__le64 *) * nvme_npages(size, dev) +
            sizeof(struct scatterlist) * nseg;
}

static unsigned int nvme_cmd_size(struct nvme_dev *dev)
{
    return sizeof(struct nvme_iod) +
        nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
}
           

由于前面設定了req->cmd_flags為REQ_TYPE_DRV_PRIV,是以command直接通過memcpy來拷貝。然後根據:

/*
 * Max size of iod being embedded in the request payload
 */
#define NVME_INT_PAGES      2
#define NVME_INT_BYTES(dev) (NVME_INT_PAGES * (dev)->ctrl.page_size)
           

假如這個request需要傳輸的資料的段數目大于2,或者總長度大于2個nvme page的話,就為iod->sg另行配置設定空間,要不然就直接指向struct nvme_iod(區域1)的尾部,也就是struct scatterlist * nseg(區域2)的前端。

回到nvme_queue_rq,假如req->nr_phys_segments不為0,nvme_map_data會被調用。

static int nvme_map_data(struct nvme_dev *dev, struct request *req,
        struct nvme_command *cmnd)
{
    struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
    struct request_queue *q = req->q;
    enum dma_data_direction dma_dir = rq_data_dir(req) ?
            DMA_TO_DEVICE : DMA_FROM_DEVICE;
    int ret = BLK_MQ_RQ_QUEUE_ERROR;

    sg_init_table(iod->sg, req->nr_phys_segments);
    iod->nents = blk_rq_map_sg(q, req, iod->sg);
    if (!iod->nents)
        goto out;

    ret = BLK_MQ_RQ_QUEUE_BUSY;
    if (!dma_map_sg(dev->dev, iod->sg, iod->nents, dma_dir))
        goto out;

    if (!nvme_setup_prps(dev, req, blk_rq_bytes(req)))
        goto out_unmap;

    ret = BLK_MQ_RQ_QUEUE_ERROR;
    if (blk_integrity_rq(req)) {
        if (blk_rq_count_integrity_sg(q, req->bio) != )
            goto out_unmap;

        sg_init_table(&iod->meta_sg, );
        if (blk_rq_map_integrity_sg(q, req->bio, &iod->meta_sg) != )
            goto out_unmap;

        if (rq_data_dir(req))
            nvme_dif_remap(req, nvme_dif_prep);

        if (!dma_map_sg(dev->dev, &iod->meta_sg, , dma_dir))
            goto out_unmap;
    }

    /* 把配置好的prp entry寫入兩個prp寄存器,詳解請看nvme_setup_prps */
    cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
    cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
    if (blk_integrity_rq(req))
        cmnd->rw.metadata = cpu_to_le64(sg_dma_address(&iod->meta_sg));
    return BLK_MQ_RQ_QUEUE_OK;

out_unmap:
    dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
out:
    return ret;
}
           

nvme_map_data流程分析:

  • 調用sg_init_table,初始化的就是區域2中存放的scatterlist。
  • 調用blk_rq_map_sg
  • 調用dma_map_sg
  • 調用nvme_setup_prps,一個沒有一行注釋的長函數~其實這個函數分成3個部分,對應設定prp的三種不同情況。第一種情況是一個prp entry就能描述所有的傳輸,第二種情況是需要兩個prp entry才能描述所有的傳輸,第三種情況是需要prp list才能描述所有的傳輸。我會在代碼中加入注釋來說明。
static bool nvme_setup_prps(struct nvme_dev *dev, struct request *req,
        int total_len)
{
    struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
    struct dma_pool *pool;
    int length = total_len; /* length就是全部prp需要傳輸的位元組長度 */
    struct scatterlist *sg = iod->sg;
    int dma_len = sg_dma_len(sg); /* 第一個sg描述的資料長度 */
    u64 dma_addr = sg_dma_address(sg); /* 第一個sg描述的資料實體位址 */
    u32 page_size = dev->ctrl.page_size; /* nvme的頁尺寸 */
    int offset = dma_addr & (page_size - ); /* 第一個sg描述的資料實體位址在nvme頁當中的偏移 */
    __le64 *prp_list;
    __le64 **list = iod_list(req);
    dma_addr_t prp_dma;
    int nprps, i;

    /* 一個prp entry就能描述所有的傳輸 */
    length -= (page_size - offset);
    if (length <= )
        return true;

    dma_len -= (page_size - offset);
    if (dma_len) {
        dma_addr += (page_size - offset);
    } else {
        /* dma_len為0,也就是第一個sg描述的資料全部在一個nvme頁中,
        可以用一個prp來表示,是以跳到下一個sg進行處理。
        sg = sg_next(sg);
        dma_addr = sg_dma_address(sg);
        dma_len = sg_dma_len(sg);
    }

    /* 兩個prp entry才能描述所有的傳輸 */
    if (length <= page_size) {
        iod->first_dma = dma_addr;
        return true;
    }

    /* 這裡開始就是prp list的處理了,一個prp entry 8個位元組,是以根據需要使用的prp entry的個數來決定使用哪個dma pool更好 */
    nprps = DIV_ROUND_UP(length, page_size);
    if (nprps <= ( / )) {
        pool = dev->prp_small_pool;
        iod->npages = ;
    } else {
        pool = dev->prp_page_pool;
        iod->npages = ;
    }

    prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
    if (!prp_list) {
        iod->first_dma = dma_addr;
        iod->npages = -;
        return false;
    }
    list[] = prp_list;
    iod->first_dma = prp_dma;
    i = ;
    for (;;) {
        /* 一個dma memory用完了,需要從pool裡再申請一個來存放其他的prp entry */
        if (i == page_size >> ) {
            __le64 *old_prp_list = prp_list;
            prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
            if (!prp_list)
                return false;
            list[iod->npages++] = prp_list;
            prp_list[] = old_prp_list[i - ];
            old_prp_list[i - ] = cpu_to_le64(prp_dma);
            i = ;
        }
        /* 存放prp list中的entry */
        prp_list[i++] = cpu_to_le64(dma_addr);
        dma_len -= page_size;
        dma_addr += page_size;
        length -= page_size;
        if (length <= )
            break;
        if (dma_len > )
            continue;
        BUG_ON(dma_len < );
        sg = sg_next(sg);
        dma_addr = sg_dma_address(sg);
        dma_len = sg_dma_len(sg);
    }

    return true;
}
           

回到nvme_queue_rq,調用__nvme_submit_cmd。這個函數很簡單,單很重要,就是把指令複制到submission queue中,然後把最後一個指令的索引寫入doorbell的寄存器通知裝置去處理。

static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
                        struct nvme_command *cmd)
{
    u16 tail = nvmeq->sq_tail;

    if (nvmeq->sq_cmds_io)
        memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
    else
        memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));

    if (++tail == nvmeq->q_depth)
        tail = ;
    writel(tail, nvmeq->q_db);
    nvmeq->sq_tail = tail;
}
           

來看nvme_queue_rq裡最後一個調用的函數__nvme_process_cq,從名字上就很容易知道是用來處理completion queue的。

static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
{
    u16 head, phase;

    head = nvmeq->cq_head;
    phase = nvmeq->cq_phase;

    for (;;) {
        struct nvme_completion cqe = nvmeq->cqes[head];
        u16 status = le16_to_cpu(cqe.status);
        struct request *req;

        if ((status & ) != phase)
            break;
        nvmeq->sq_head = le16_to_cpu(cqe.sq_head);
        if (++head == nvmeq->q_depth) {
            head = ;
            phase = !phase;
        }

        if (tag && *tag == cqe.command_id)
            *tag = -;

        if (unlikely(cqe.command_id >= nvmeq->q_depth)) {
            dev_warn(nvmeq->q_dmadev,
                "invalid id %d completed on queue %d\n",
                cqe.command_id, le16_to_cpu(cqe.sq_id));
            continue;
        }

        /*
         * AEN requests are special as they don't time out and can
         * survive any kind of queue freeze and often don't respond to
         * aborts.  We don't even bother to allocate a struct request
         * for them but rather special case them here.
         */
        if (unlikely(nvmeq->qid ==  &&
                cqe.command_id >= NVME_AQ_BLKMQ_DEPTH)) {
            nvme_complete_async_event(nvmeq->dev, &cqe);
            continue;
        }

        req = blk_mq_tag_to_rq(*nvmeq->tags, cqe.command_id);
        if (req->cmd_type == REQ_TYPE_DRV_PRIV) {
            u32 result = le32_to_cpu(cqe.result);
            req->special = (void *)(uintptr_t)result;
        }
        blk_mq_complete_request(req, status >> );

    }

    /* If the controller ignores the cq head doorbell and continuously
     * writes to the queue, it is theoretically possible to wrap around
     * the queue twice and mistakenly return IRQ_NONE.  Linux only
     * requires that % of your interrupts are handled, so this isn't
     * a big problem.
     */
    if (head == nvmeq->cq_head && phase == nvmeq->cq_phase)
        return;

    if (likely(nvmeq->cq_vector >= ))
        writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
    nvmeq->cq_head = head;
    nvmeq->cq_phase = phase;

    nvmeq->cqe_seen = ;
}
           

先看一下nvme協定裡規定的completion entry的格式,每個16位元組。

linux裡的nvme驅動代碼分析(加載初始化)

我們知道,不管是admin queue還是io queue,都有若幹個submission queue和一個completion queue。而不管是submission queue還是completion queue,都是通過head和tail變量來管理的。主機端負責更新submission queue的tail來表示新任務的添加。裝置端從submission queue裡拿出任務處理,并修改head值。但是這個head怎麼通知給主機那?答案就在completion entry的SQ Head Pointer中。對于completion queue,是反過來的,裝置負責往裡面添加元素,然後修改tail值。這個tail值是怎麼傳遞給主機的那?這裡就用到了另外一個機制,就是在Status Field中的PP标志位。

linux裡的nvme驅動代碼分析(加載初始化)

P标志位全稱是Phase Tag,在completion queue剛建立的時候,全部初始化成0。當裝置第一遍往completion queue裡添加元素的時候,就把對應的entry的Phase Tag設定成1。這樣,主機端通過Phase Tag的值就能夠知道有多少個新的元素被添加到了completion queue中。當裝置添加元素到了隊列的底部,就要重新回到索引0處,這也就是completion queue的第二遍更新,這次裝置又會把Phase Tag全部設定成0。就這樣,一遍1一遍0再一遍1再一遍0…最後,主機處理好completion之後,更新completion queue的head值,并通過doorbell告訴裝置。

__nvme_process_cq的主要機制分析了,但是為什麼要在nvme_queue_rq的最後調用__nvme_process_cq,有點疑惑。因為即使不顯示調用__nvme_process_cq,裝置在處理完submission之後,也會發送中斷(INT/MSI)給主機,在主機的中斷處理函數中,也會進行__nvme_process_cq的調用。

static irqreturn_t nvme_irq(int irq, void *data)
{
    irqreturn_t result;
    struct nvme_queue *nvmeq = data;
    spin_lock(&nvmeq->q_lock);
    nvme_process_cq(nvmeq);
    result = nvmeq->cqe_seen ? IRQ_HANDLED : IRQ_NONE;
    nvmeq->cqe_seen = ;
    spin_unlock(&nvmeq->q_lock);
    return result;
}
           

在__nvme_process_cq中,對于處理好了的request,會調用blk_mq_complete_request。這會觸發我們之前注冊的nvme_complete_rq被調用。

static void nvme_complete_rq(struct request *req)
{
    struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
    struct nvme_dev *dev = iod->nvmeq->dev;
    int error = ;

    nvme_unmap_data(dev, req);

    if (unlikely(req->errors)) {
        if (nvme_req_needs_retry(req, req->errors)) {
            nvme_requeue_req(req);
            return;
        }

        if (req->cmd_type == REQ_TYPE_DRV_PRIV)
            error = req->errors;
        else
            error = nvme_error_status(req->errors);
    }

    if (unlikely(iod->aborted)) {
        dev_warn(dev->dev,
            "completing aborted command with status: %04x\n",
            req->errors);
    }

    blk_mq_end_request(req, error);
}
           

分析了那麼一大串,nvme_identify_ctrl終于講完了。往上到nvme_init_identify,也已經基本沒什麼事情要做了。再往上回到nvme_reset_work,看看nvme_init_identify之後的nvme_setup_io_queues做了些什麼。

static int nvme_setup_io_queues(struct nvme_dev *dev)
{
    struct nvme_queue *adminq = dev->queues[];
    struct pci_dev *pdev = to_pci_dev(dev->dev);
    int result, i, vecs, nr_io_queues, size;

    nr_io_queues = num_possible_cpus();
    result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
    if (result < )
        return result;

    /*
     * Degraded controllers might return an error when setting the queue
     * count.  We still want to be able to bring them online and offer
     * access to the admin queue, as that might be only way to fix them up.
     */
    if (result > ) {
        dev_err(dev->dev, "Could not set queue count (%d)\n", result);
        nr_io_queues = ;
        result = ;
    }

    if (dev->cmb && NVME_CMB_SQS(dev->cmbsz)) {
        result = nvme_cmb_qdepth(dev, nr_io_queues,
                sizeof(struct nvme_command));
        if (result > )
            dev->q_depth = result;
        else
            nvme_release_cmb(dev);
    }

    size = db_bar_size(dev, nr_io_queues);
    if (size > ) {
        iounmap(dev->bar);
        do {
            dev->bar = ioremap(pci_resource_start(pdev, ), size);
            if (dev->bar)
                break;
            if (!--nr_io_queues)
                return -ENOMEM;
            size = db_bar_size(dev, nr_io_queues);
        } while ();
        dev->dbs = dev->bar + ;
        adminq->q_db = dev->dbs;
    }

    /* Deregister the admin queue's interrupt */
    free_irq(dev->entry[].vector, adminq);

    /*
     * If we enable msix early due to not intx, disable it again before
     * setting up the full range we need.
     */
    if (!pdev->irq)
        pci_disable_msix(pdev);

    for (i = ; i < nr_io_queues; i++)
        dev->entry[i].entry = i;
    vecs = pci_enable_msix_range(pdev, dev->entry, , nr_io_queues);
    if (vecs < ) {
        vecs = pci_enable_msi_range(pdev, , min(nr_io_queues, ));
        if (vecs < ) {
            vecs = ;
        } else {
            for (i = ; i < vecs; i++)
                dev->entry[i].vector = i + pdev->irq;
        }
    }

    /*
     * Should investigate if there's a performance win from allocating
     * more queues than interrupt vectors; it might allow the submission
     * path to scale better, even if the receive path is limited by the
     * number of interrupts.
     */
    nr_io_queues = vecs;
    dev->max_qid = nr_io_queues;

    result = queue_request_irq(dev, adminq, adminq->irqname);
    if (result) {
        adminq->cq_vector = -;
        goto free_queues;
    }

    /* Free previously allocated queues that are no longer usable */
    nvme_free_queues(dev, nr_io_queues + );
    return nvme_create_io_queues(dev);

 free_queues:
    nvme_free_queues(dev, );
    return result;
}
           

nvme_setup_io_queues流程分析:

  • 調用nvme_set_queue_count
  • 調用nvme_create_io_queues

nvme_set_queue_count發送了一個set features的指令,feature id為0x7,設定io queue的數量。

int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
{
    u32 q_count = (*count - ) | ((*count - ) << );
    u32 result;
    int status, nr_io_queues;

    status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, ,
            &result);
    if (status)
        return status;

    nr_io_queues = min(result & , result >> ) + ;
    *count = min(*count, nr_io_queues);
    return ;
}
           
linux裡的nvme驅動代碼分析(加載初始化)

queue的大小存放在set features指令的dword 11中。

int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
                    dma_addr_t dma_addr, u32 *result)
{
    struct nvme_command c;

    memset(&c, , sizeof(c));
    c.features.opcode = nvme_admin_set_features;
    c.features.prp1 = cpu_to_le64(dma_addr);
    c.features.fid = cpu_to_le32(fid);
    c.features.dword11 = cpu_to_le32(dword11);

    return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, , result, );
}
           
linux裡的nvme驅動代碼分析(加載初始化)

回到nvme_setup_io_queues,分析nvme_create_io_queues。

static int nvme_create_io_queues(struct nvme_dev *dev)
{
    unsigned i;
    int ret = ;

    for (i = dev->queue_count; i <= dev->max_qid; i++) {
        if (!nvme_alloc_queue(dev, i, dev->q_depth)) {
            ret = -ENOMEM;
            break;
        }
    }

    for (i = dev->online_queues; i <= dev->queue_count - ; i++) {
        ret = nvme_create_queue(dev->queues[i], i);
        if (ret) {
            nvme_free_queues(dev, i);
            break;
        }
    }

    /*
     * Ignore failing Create SQ/CQ commands, we can continue with less
     * than the desired aount of queues, and even a controller without
     * I/O queues an still be used to issue admin commands.  This might
     * be useful to upgrade a buggy firmware for example.
     */
    return ret >=  ?  : ret;
}
           

nvme_alloc_queue在之前分析過就不再累贅了,主要看下nvme_create_queue。

static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
{
    struct nvme_dev *dev = nvmeq->dev;
    int result;

    nvmeq->cq_vector = qid - ;
    result = adapter_alloc_cq(dev, qid, nvmeq);
    if (result < )
        return result;

    result = adapter_alloc_sq(dev, qid, nvmeq);
    if (result < )
        goto release_cq;

    result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
    if (result < )
        goto release_sq;

    nvme_init_queue(nvmeq, qid);
    return result;

 release_sq:
    adapter_delete_sq(dev, qid);
 release_cq:
    adapter_delete_cq(dev, qid);
    return result;
}
           

adapter_alloc_cq發送opcode為0x5的create io completion queue指令。

static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
                        struct nvme_queue *nvmeq)
{
    struct nvme_command c;
    int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;

    /*
     * Note: we (ab)use the fact the the prp fields survive if no data
     * is attached to the request.
     */
    memset(&c, , sizeof(c));
    c.create_cq.opcode = nvme_admin_create_cq;
    c.create_cq.prp1 = cpu_to_le64(nvmeq->cq_dma_addr);
    c.create_cq.cqid = cpu_to_le16(qid);
    c.create_cq.qsize = cpu_to_le16(nvmeq->q_depth - );
    c.create_cq.cq_flags = cpu_to_le16(flags);
    c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);

    return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, );
}
           

adapter_alloc_sq發送opcode為0x1的create io submission queue指令。

static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
                        struct nvme_queue *nvmeq)
{
    struct nvme_command c;
    int flags = NVME_QUEUE_PHYS_CONTIG | NVME_SQ_PRIO_MEDIUM;

    /*
     * Note: we (ab)use the fact the the prp fields survive if no data
     * is attached to the request.
     */
    memset(&c, , sizeof(c));
    c.create_sq.opcode = nvme_admin_create_sq;
    c.create_sq.prp1 = cpu_to_le64(nvmeq->sq_dma_addr);
    c.create_sq.sqid = cpu_to_le16(qid);
    c.create_sq.qsize = cpu_to_le16(nvmeq->q_depth - );
    c.create_sq.sq_flags = cpu_to_le16(flags);
    c.create_sq.cqid = cpu_to_le16(qid);

    return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, );
}
           

最後注冊中斷,就是之前我們列出來過的nvme0q1。

再次回到nvme_reset_work,分析nvme_dev_list_add。這個函數主要做的是啟動了一個核心線程nvme_kthread,這個函數我們後面再分析。

static int nvme_dev_list_add(struct nvme_dev *dev)
{
    bool start_thread = false;

    spin_lock(&dev_list_lock);
    if (list_empty(&dev_list) && IS_ERR_OR_NULL(nvme_thread)) {
        start_thread = true;
        nvme_thread = NULL;
    }
    list_add(&dev->node, &dev_list);
    spin_unlock(&dev_list_lock);

    if (start_thread) {
        nvme_thread = kthread_run(nvme_kthread, NULL, "nvme");
        wake_up_all(&nvme_kthread_wait);
    } else
        wait_event_killable(nvme_kthread_wait, nvme_thread);

    if (IS_ERR_OR_NULL(nvme_thread))
        return nvme_thread ? PTR_ERR(nvme_thread) : -EINTR;

    return ;
}
           

繼續回到nvme_reset_work,分析nvme_start_queues。

void nvme_start_queues(struct nvme_ctrl *ctrl)
{
    struct nvme_ns *ns;

    mutex_lock(&ctrl->namespaces_mutex);
    list_for_each_entry(ns, &ctrl->namespaces, list) {
        queue_flag_clear_unlocked(QUEUE_FLAG_STOPPED, ns->queue);
        blk_mq_start_stopped_hw_queues(ns->queue, true);
        blk_mq_kick_requeue_list(ns->queue);
    }
    mutex_unlock(&ctrl->namespaces_mutex);
}
           

繼續回到nvme_reset_work,分析最後一個函數nvme_dev_add。blk_mq_alloc_tag_set在之前調用過一次,是為admin queue配置設定tag set。這次則是為了io queue配置設定tag set。

static int nvme_dev_add(struct nvme_dev *dev)
{
    if (!dev->ctrl.tagset) {
        dev->tagset.ops = &nvme_mq_ops;
        dev->tagset.nr_hw_queues = dev->online_queues - ;
        dev->tagset.timeout = NVME_IO_TIMEOUT;
        dev->tagset.numa_node = dev_to_node(dev->dev);
        dev->tagset.queue_depth =
                min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH) - ;
        dev->tagset.cmd_size = nvme_cmd_size(dev);
        dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
        dev->tagset.driver_data = dev;

        if (blk_mq_alloc_tag_set(&dev->tagset))
            return ;
        dev->ctrl.tagset = &dev->tagset;
    }
    nvme_queue_scan(dev);
    return ;
}
           

最後,調用nvme_queue_scan排程了另一個work,也就是函數nvme_dev_scan。

static void nvme_queue_scan(struct nvme_dev *dev)
{
    /*
     * Do not queue new scan work when a controller is reset during
     * removal.
     */
    if (test_bit(NVME_CTRL_REMOVING, &dev->flags))
        return;
    queue_work(nvme_workq, &dev->scan_work);
}
           

至此,nvme_reset_work完全結束。讓我們進入到下一個work裡,去征服另一個山頭吧!

nvme_dev_scan — 又一個冗長的work

static void nvme_dev_scan(struct work_struct *work)
{
    struct nvme_dev *dev = container_of(work, struct nvme_dev, scan_work);

    if (!dev->tagset.tags)
        return;
    nvme_scan_namespaces(&dev->ctrl);
    nvme_set_irq_hints(dev);
}
           

調用nvme_scan_namespaces。

void nvme_scan_namespaces(struct nvme_ctrl *ctrl)
{
    struct nvme_id_ctrl *id;
    unsigned nn;

    if (nvme_identify_ctrl(ctrl, &id))
        return;

    mutex_lock(&ctrl->namespaces_mutex);
    nn = le32_to_cpu(id->nn);
    if (ctrl->vs >= NVME_VS(, ) &&
        !(ctrl->quirks & NVME_QUIRK_IDENTIFY_CNS)) {
        if (!nvme_scan_ns_list(ctrl, nn))
            goto done;
    }
    __nvme_scan_namespaces(ctrl, le32_to_cpup(&id->nn));
 done:
    list_sort(NULL, &ctrl->namespaces, ns_cmp);
    mutex_unlock(&ctrl->namespaces_mutex);
    kfree(id);
}
           

先調用nvme_identify_ctrl給裝置發一個identify指令,然後把取得的namespace number傳給__nvme_scan_namespaces。

linux裡的nvme驅動代碼分析(加載初始化)
static void __nvme_scan_namespaces(struct nvme_ctrl *ctrl, unsigned nn)
{
    struct nvme_ns *ns, *next;
    unsigned i;

    lockdep_assert_held(&ctrl->namespaces_mutex);

    for (i = ; i <= nn; i++)
        nvme_validate_ns(ctrl, i);

    list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
        if (ns->ns_id > nn)
            nvme_ns_remove(ns);
    }
}
           

為每個namespace調用一次nvme_validate_ns。nvme協定規定,namespace号0表示不使用namespace功能,0xFFFFFFFF表示比對任何namespace号。是以正常的namespace号是從1開始的。

static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
{
    struct nvme_ns *ns;

    ns = nvme_find_ns(ctrl, nsid);
    if (ns) {
        if (revalidate_disk(ns->disk))
            nvme_ns_remove(ns);
    } else
        nvme_alloc_ns(ctrl, nsid);
}
           

先查找某個namespace是否已經存在,初始化的時候,namespace都還沒有建立,是以傳回的肯定是空。

static struct nvme_ns *nvme_find_ns(struct nvme_ctrl *ctrl, unsigned nsid)
{
    struct nvme_ns *ns;

    lockdep_assert_held(&ctrl->namespaces_mutex);

    list_for_each_entry(ns, &ctrl->namespaces, list) {
        if (ns->ns_id == nsid)
            return ns;
        if (ns->ns_id > nsid)
            break;
    }
    return NULL;
}
           

既然找不到相應的namespace,那就要建立它。

static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
{
    struct nvme_ns *ns;
    struct gendisk *disk;
    int node = dev_to_node(ctrl->dev);

    lockdep_assert_held(&ctrl->namespaces_mutex);

    /* 為nvme_ns結構體配置設定空間 */
    ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
    if (!ns)
        return;

    /* 為ns配置設定索引号,ida機制之前有講過 */
    ns->instance = ida_simple_get(&ctrl->ns_ida, , , GFP_KERNEL);
    if (ns->instance < )
        goto out_free_ns;

    /* 建立一個request queue */
    ns->queue = blk_mq_init_queue(ctrl->tagset);
    if (IS_ERR(ns->queue))
        goto out_release_instance;
    queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
    ns->queue->queuedata = ns;
    ns->ctrl = ctrl;

    /* 建立一個gendisk */
    disk = alloc_disk_node(, node);
    if (!disk)
        goto out_free_queue;

    kref_init(&ns->kref);
    ns->ns_id = nsid;
    ns->disk = disk;
    ns->lba_shift = ; /* set to a default value for 512 until disk is validated */


    blk_queue_logical_block_size(ns->queue,  << ns->lba_shift);
    nvme_set_queue_limits(ctrl, ns->queue);

    /* 為gendisk做初始化 */
    disk->major = nvme_major;
    disk->first_minor = ;
    disk->fops = &nvme_fops;
    disk->private_data = ns;
    disk->queue = ns->queue;
    disk->driverfs_dev = ctrl->device;
    disk->flags = GENHD_FL_EXT_DEVT;
    /* /dev下的nvme0n1 */
    sprintf(disk->disk_name, "nvme%dn%d", ctrl->instance, ns->instance);

    if (nvme_revalidate_disk(ns->disk))
        goto out_free_disk;

    /* 把這個建立的namespace加入到ctrl->namespaces連結清單裡去 */
    list_add_tail(&ns->list, &ctrl->namespaces);
    kref_get(&ctrl->kref);
    if (ns->type == NVME_NS_LIGHTNVM)
        return;

    /* alloc_disk_node隻是建立一個disk,而add_disk才是真正把它加入系統,可以進行操作。
    這裡建立的就是/dev/nvme0n1,可是為什麼主裝置号是259那?在add_disk裡面調用了一個函數blk_alloc_devt,
    在這個函數裡,會有一段邏輯來判斷是用我們自己設定的主裝置号來注冊裝置,還是用BLOCK_EXT_MAJOR來注冊。
    但是究竟為什麼要這樣做,我還沒有搞明白,希望有誰知道的不吝賜教。 */
    add_disk(ns->disk);
    if (sysfs_create_group(&disk_to_dev(ns->disk)->kobj,
                    &nvme_ns_attr_group))
        pr_warn("%s: failed to create sysfs group for identification\n",
            ns->disk->disk_name);
    return;
 out_free_disk:
    kfree(disk);
 out_free_queue:
    blk_cleanup_queue(ns->queue);
 out_release_instance:
    ida_simple_remove(&ctrl->ns_ida, ns->instance);
 out_free_ns:
    kfree(ns);
}
           

nvme_alloc_ns中有個重要函數nvme_revalidate_disk需要跟進去。

static int nvme_revalidate_disk(struct gendisk *disk)
{
    struct nvme_ns *ns = disk->private_data;
    struct nvme_id_ns *id;
    u8 lbaf, pi_type;
    u16 old_ms;
    unsigned short bs;

    if (test_bit(NVME_NS_DEAD, &ns->flags)) {
        set_capacity(disk, );
        return -ENODEV;
    }
    if (nvme_identify_ns(ns->ctrl, ns->ns_id, &id)) {
        dev_warn(ns->ctrl->dev, "%s: Identify failure nvme%dn%d\n",
                __func__, ns->ctrl->instance, ns->ns_id);
        return -ENODEV;
    }
    if (id->ncap == ) {
        kfree(id);
        return -ENODEV;
    }

    if (nvme_nvm_ns_supported(ns, id) && ns->type != NVME_NS_LIGHTNVM) {
        if (nvme_nvm_register(ns->queue, disk->disk_name)) {
            dev_warn(ns->ctrl->dev,
                "%s: LightNVM init failure\n", __func__);
            kfree(id);
            return -ENODEV;
        }
        ns->type = NVME_NS_LIGHTNVM;
    }

    if (ns->ctrl->vs >= NVME_VS(, ))
        memcpy(ns->eui, id->eui64, sizeof(ns->eui));
    if (ns->ctrl->vs >= NVME_VS(, ))
        memcpy(ns->uuid, id->nguid, sizeof(ns->uuid));

    old_ms = ns->ms;
    lbaf = id->flbas & NVME_NS_FLBAS_LBA_MASK;
    ns->lba_shift = id->lbaf[lbaf].ds;
    ns->ms = le16_to_cpu(id->lbaf[lbaf].ms);
    ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT);

    /*
     * If identify namespace failed, use default 512 byte block size so
     * block layer can use before failing read/write for  capacity.
     */
    if (ns->lba_shift == )
        ns->lba_shift = ;
    bs =  << ns->lba_shift;
    /* XXX: PI implementation requires metadata equal t10 pi tuple size */
    pi_type = ns->ms == sizeof(struct t10_pi_tuple) ?
                    id->dps & NVME_NS_DPS_PI_MASK : ;

    blk_mq_freeze_queue(disk->queue);
    if (blk_get_integrity(disk) && (ns->pi_type != pi_type ||
                ns->ms != old_ms ||
                bs != queue_logical_block_size(disk->queue) ||
                (ns->ms && ns->ext)))
        blk_integrity_unregister(disk);

    ns->pi_type = pi_type;
    blk_queue_logical_block_size(ns->queue, bs);

    if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
        nvme_init_integrity(ns);
    if (ns->ms && !(ns->ms ==  && ns->pi_type) && !blk_get_integrity(disk))
        set_capacity(disk, );
    else
        /* id->nsze是Namespace Size,即這個namespace所包含的所有LBA的總數,調試發現是。
        而ns->lba_shift為,表示一個扇區個位元組。
        是以*=GB,也就是我們剛開始使用qemu-img指令生成的磁盤鏡像的大小。 */
        set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - ));

    if (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM)
        nvme_config_discard(ns);
    blk_mq_unfreeze_queue(disk->queue);

    kfree(id);
    return ;
}
           

回到nvme_alloc_ns,執行完add_disk之後,注冊在disk->fops中的nvme_open和nvme_revalidate_disk會被調用。

static const struct block_device_operations nvme_fops = {
    .owner      = THIS_MODULE,
    .ioctl      = nvme_ioctl,
    .compat_ioctl   = nvme_compat_ioctl,
    .open       = nvme_open,
    .release    = nvme_release,
    .getgeo     = nvme_getgeo,
    .revalidate_disk= nvme_revalidate_disk,
    .pr_ops     = &nvme_pr_ops,
};
           
static int nvme_open(struct block_device *bdev, fmode_t mode)
{
    return nvme_get_ns_from_disk(bdev->bd_disk) ?  : -ENXIO;
}

static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk)
{
    struct nvme_ns *ns;

    spin_lock(&dev_list_lock);
    ns = disk->private_data;
    if (ns && !kref_get_unless_zero(&ns->kref))
        ns = NULL;
    spin_unlock(&dev_list_lock);

    return ns;
}
           

回到nvme_dev_scan,調用nvme_set_irq_hints。主要做一些中斷親和性的優化工作。

static void nvme_set_irq_hints(struct nvme_dev *dev)
{
    struct nvme_queue *nvmeq;
    int i;

    for (i = ; i < dev->online_queues; i++) {
        nvmeq = dev->queues[i];

        if (!nvmeq->tags || !(*nvmeq->tags))
            continue;

        irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
                    blk_mq_tags_cpumask(*nvmeq->tags));
    }
}
           

至此,nvme_dev_scan也結束了,整個nvme驅動的初始化也基本完成了。初始化完成之後,留給我們的就是/dev下的nvme0和nvme0n1兩個供應用層操作的接口,以及一個默默無聞在那裡1秒輪詢一次的核心線程nvme_kthread。這個線程做的事情很有限,檢測是否需要重新開機,是否有沒有處理的completion消息。但是我沒有明白,像處理CQ這種事情不都是有中斷驅動的嗎,為何還需要去輪詢?

static int nvme_kthread(void *data)
{
    struct nvme_dev *dev, *next;

    while (!kthread_should_stop()) {
        set_current_state(TASK_INTERRUPTIBLE);
        spin_lock(&dev_list_lock);
        list_for_each_entry_safe(dev, next, &dev_list, node) {
            int i;
            u32 csts = readl(dev->bar + NVME_REG_CSTS);

            /*
             * Skip controllers currently under reset.
             */
            if (work_pending(&dev->reset_work) || work_busy(&dev->reset_work))
                continue;

            if ((dev->subsystem && (csts & NVME_CSTS_NSSRO)) ||
                            csts & NVME_CSTS_CFS) {
                if (queue_work(nvme_workq, &dev->reset_work)) {
                    dev_warn(dev->dev,
                        "Failed status: %x, reset controller\n",
                        readl(dev->bar + NVME_REG_CSTS));
                }
                continue;
            }
            for (i = ; i < dev->queue_count; i++) {
                struct nvme_queue *nvmeq = dev->queues[i];
                if (!nvmeq)
                    continue;
                spin_lock_irq(&nvmeq->q_lock);
                nvme_process_cq(nvmeq);

                while (i ==  && dev->ctrl.event_limit > )
                    nvme_submit_async_event(dev);
                spin_unlock_irq(&nvmeq->q_lock);
            }
        }
        spin_unlock(&dev_list_lock);
        schedule_timeout(round_jiffies_relative(HZ));
    }
    return ;
}
           

由于篇幅有限,有關初始化後的讀寫打算放在另一篇文章裡寫。

當然,由于對nvme的學習還處在入門階段,代碼分析總會出現不夠詳細甚至是錯誤的地方,希望大家對于錯誤的地方積極指出,我好确認後改掉,謝謝。

繼續閱讀