報錯 ncclCommInitRank failed.

環境

4 GeForce GTX 1080 GPUS
docker image nnabla/nnabla-ext-cuda-multi-gpu:py36-cuda102-mpi3.1.6-v1.14.0

代碼

從倉庫nnabla-ext-cuda-multi-gpu拉取鏡像 docker pull nnabla/nnabla-ext-cuda-multi-gpu:py36-cuda102-mpi3.1.6-v1.14.0
運作 docker run -it --rm --gpus all nnabla/nnabla-ext-cuda-multi-gpu:py36-cuda102-mpi3.1.6-v1.14.0
添加 test.py

import nnabla.communicators as C
from nnabla.ext_utils import get_extension_context
extension_module = "cudnn"
ctx = get_extension_context(extension_module)
comm = C.MultiProcessCommunicator(ctx)
comm.init()
print(f'sizes={comm.size}, divice_id={comm.rank}')

運作 mpiexec -np 4 python test.py 将會抛出異常。（異常隻發生在使用GPU數大于2時）

bug

抛出異常如下：

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    comm.init()
  File "communicator.pyx", line 121, in nnabla.communicator.Communicator.init
RuntimeError: target_specific error in init
/home/gitlab-runner/builds/g9zRZKFe/2/nnabla/builders/all/nnabla-ext-cuda/src/nbla/cuda/communicator/multi_process_data_parallel_communicator.cu:358
ncclCommInitRank failed.

使用

NCCL_DEBUG=INFO

檢視詳細資訊

mpiexec -np 4 -x NCCL_DEBUG=INFO python test.py

...
0db89117f3b2:87:87 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
0db89117f3b2:87:87 [2] NCCL INFO include/shm.h:41 -> 2

0db89117f3b2:87:87 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-6d2dacd576938b74-0-3-2 (size 9637888)
...

可以看到沒有多餘的共享記憶體，但是使用

nvidia-smi

檢視GPU情況，發現記憶體并沒有過多使用。

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:01:00.0 Off |                  N/A |
| 27%   30C    P8     5W / 180W |    815MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    On   | 00000000:02:00.0 Off |                  N/A |
| 27%   33C    P8     6W / 180W |      4MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    On   | 00000000:03:00.0 Off |                  N/A |
| 28%   35C    P8     5W / 180W |      4MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    On   | 00000000:04:00.0  On |                  N/A |
| 28%   34C    P8     6W / 180W |      4MiB /  8118MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

原因

異常原因是NCCL不能在

/dev/shm

建立共享記憶體檔案。因為docker預設的

/dev/shm

檔案的大小為64MB太小了，是以當使用GPU數大于2時，會顯示記憶體不夠。

之前版本不出現這個錯誤，原因是nccl從2.6更新到2.7後，GPU之間的通信方式從p2p改為使用共享記憶體片段，是以如果使用nccl2.7以下版本将不會出現這個問題。

解決

有3中方式：

在 /etc/nccl.conf 或 ~/.nncd.conf 檔案中，添加配置 NCCL_SHM_DISABLE=1 。不适用共享記憶體，但是使用運作效率會降低。
可以映射主控端上的 /dev/shm ，即 docker run -v /dev/shm:/dev/shm ... ，但是這樣會在主控端上留下髒檔案。
運作時，修改容器共享記憶體的大小，即 docker run --shm-size=256m ... 。

參考

NCCL配置
https://github.com/NVIDIA/nccl/issues/290
https://github.com/PaddlePaddle/Paddle/pull/28484
https://github.com/horovod/horovod/issues/2395
https://github.com/NVIDIA/nccl/issues/406（使用 NCCL_SHM_DISABLED=1 可能會降低效率）

報錯 ncclCommInitRank failed.

環境

代碼

bug

原因

解決

參考

繼續閱讀

特征縮放 | 歸一化和标準化（上）

Boosting算法總結（ada boosting、GBDT、XGBoost）

作圖直覺了解Parzen窗估計（附Python代碼）

python字典清單

Python —— Numpy數組組合