天天看點

使用容器裝tensorflow gpu版筆記--- nvidia-docker

1.安裝 nvidia-docker,詳見https://github.com/NVIDIA/nvidia-docker/

2.完成後測試cuda可用:

docker run --gpus all nvidia/cuda:10.0-base-centos7 nvidia-smi

3.确認可用後會看到nvidia-smi指令的結果,然後開啟自己容器的之路,首先建立一個容器當虛拟機用,這裡我選擇的是nvidia/cuda:10.0-base-centos7 鏡像。

docker run -tdi --gpus all -v /data/projects:/run/projects --name='andp_buck3' nvidia/cuda:10.0-base-centos7 /bin/bash

3.1. 後來發現nvidia/cuda:10.0-base-centos7可能是個比較基礎的容器不太夠用,最後python-tensorflow設定gpu的時候會報形如: Could not dlopen library 'libcublas.so.10.0',那麼開始嘗試更全的基礎容器:

docker run -tdi --gpus all -v /data/projects:/run/projects --name='andp_buck4' nvidia/cuda:10.1-cudnn7-devel-centos7 /bin/bash

nvidia/cuda:10.1-cudnn7-devel-centos7 這個就很大,後面繼續類似的流程看。

4.進入剛剛建立的容器(容器内nvidia-smi指令無誤)

docker exec -it andp_buck3 /bin/bash

5. 開始安裝python環境,參照自己之前的内容前一篇部落格,假定之前需要的檔案都已經有了,在我的tools裡面:

cd /run/projects/tools/
cd openssl-1.1.1
 ./config --prefix=/usr/local/openssl shared zlib
make && make install
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/openssl/lib" >> $HOME/.bash_profile
source $HOME/.bash_profile
openssl version
 cd ../Python-3.7.4
./configure --prefix=/usr/local/python374 --enable-optimizations --enable-shared --with-openssl=/usr/local/openssl
make && make install
ln -s  /usr/local/python374/bin/pip3 /usr/bin/
ln -s /usr/local/python374/bin/python3 /usr/bin/
           

6.嘗試安裝 tensorflow-gpu

pip3 install tensorflow-gpu==1.14 -i https://mirrors.aliyun.com/pypi/simple/

7.安裝順利完成,import tensorflow時仍然會出現:

ImportError: /lib64/libm.so.6: version `GLIBC_2.23' not found

那麼,再走一下前一篇部落格後面關于這裡的步驟,這裡也做個更流暢的彙總版吧:

8.因為之前build過gcc9.2,直接容器外的.so關聯試試:

ln /run/projects/tools/glibc-2.30/build/math/libm.so.6 /lib64/libm.so.6  -s

之後再 import tensorflow 報另外的錯:      /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20'

9.gcc9.2.0在容器外安裝過,那應該還是在容器内走一邊gcc9.2.0安裝:

cd /make-4.2.1
./configure --prefix=$HOME/local
make -v

make && make install 
/root/local/bin/make -v
mv /usr/bin/make /usr/bin/make3
ln -s /root/local/bin/make /usr/bin/make
make -v
gmake
gmake -v
cd ../gcc-9.2.0
./contrib/download_prerequisites
mkdir build 
cd build/
../configure --prefix=/usr/local/gcc-9.2.0 --enable-bootstrap --enable-checking=release --enable-languages=c,c++ --disable-multilib
缺少報錯: configure: error: Building GCC requires GMP 4.2+, MPFR 2.4.0+ and MPC 0.8.0+.
yum install wget bzip2 gcc gcc-c++ glibc-headers (不定必須)
yum install autoconf (不定必須)

yum install gmp
yum install mpfr
yum install libmpc-devel bison

../configure --prefix=/usr/local/gcc-9.2.0 --enable-bootstrap --enable-checking=release --enable-languages=c,c++ --disable-multilib (這次 ok了:)
make && make install (需要很久)

gcc -v
echo -e '\nexport PATH=/usr/local/gcc-9.2.0/bin:$PATH\n' >> ~/.bash_profile 
source ~/.bash_profile 
gcc -v
ln -sv /usr/local/gcc-9.2.0/include/ /usr/include/gcc
ldconfig -v
ldconfig -p |grep gcc    #導出驗證
gcc -v
cd ../../glibc-2.30/bulid/
LD_LIBRARY_PATH='' 
../configure  --prefix=/usr --disable-profile --enable-add-ons --with-headers=/usr/include --with-binutils=/usr/bin
make 
make install
sudo find / -name glibc*
strings  math/libm.so.6 | grep GLIBC_2.23
mv /lib64/libm.so.6 /lib64/libm.so.6.old
cp math/libm.so.6 /lib64/libm.so.6
find / -name libstdc++.so.6* 
strings /usr/lib64/libstdc++.so.6.0.19 | grep CXXABI_1.3
strings /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 | grep CXXABI_1.3
ln -s /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6
mv /usr/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6.old1
ln -s /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6

           

10.測試發現已經可以import  tensorflow啦,後面安裝一些包把環境收尾就好

pip3 install pandas ipython sqlalchemy pymysql psycopg2-binary pyhive scipy numpy  -i https://mirrors.aliyun.com/pypi/simple/
           

就可以愉快的在容器内使用gpu訓練tensorflow項目啦。

11.測試可以跑訓練項目完成後,commit 容器并上傳鏡像:

[[email protected] ~]# docker commit -m 'for tensorflow-gpu-py374' -a='antony314' 3ff2d3cfa0ba antony314/centos76:v2.2
sha256:21f3b71f9939226f1d817c19ed88f14fa0c2ff5e76eed7b5b17b9fa9463801cf
[[email protected] ~]# docker images
REPOSITORY              TAG                 IMAGE ID            CREATED             SIZE
antony314/centos76      v2.2                21f3b71f9939        2 minutes ago       4.29GB
antony314/centos76      v2.1                52427f8da2c5        2 months ago        1.84GB
antony314/centos76      v2                  c65e5d82b7d4        2 months ago        1.84GB
antony314/centos76      v1                  241bcf6311b7        2 months ago        611MB
tensorflow/tensorflow   latest              d64a95598d6c        2 months ago        1.03GB
nvidia/cuda             10.0-base-centos7   e9f670f1d5b9        3 months ago        254MB
nvidia/cuda             9.0-base            1443caa429f9        3 months ago        137MB
nvidia/cuda             10.0-base           5026b20f9c3d        3 months ago        110MB
antony314/centos76      7.6init             2cf0fa81ce78        4 months ago        202MB
[[email protected] ~]# docker push antony314/centos76:v2.2
The push refers to repository [docker.io/antony314/centos76]
711e037a5568: Pushed 
74f64c7f6830: Mounted from nvidia/cuda 
ccbc602e5359: Mounted from nvidia/cuda 
a71b7655dacc: Mounted from nvidia/cuda 
5d01beb4238f: Mounted from nvidia/cuda 
877b494a9f30: Mounted from nvidia/cuda 
v2.2: digest: sha256:a742553910d749b1d1a2ab22d85e2f0145af301c6dbca4b89becf1c3b6266129 size: 1577
           

最後,暫時挂起一個一直很頭疼的問題,容器越來越大。