1.安裝 nvidia-docker,詳見https://github.com/NVIDIA/nvidia-docker/
2.完成後測試cuda可用:
docker run --gpus all nvidia/cuda:10.0-base-centos7 nvidia-smi
3.确認可用後會看到nvidia-smi指令的結果,然後開啟自己容器的之路,首先建立一個容器當虛拟機用,這裡我選擇的是nvidia/cuda:10.0-base-centos7 鏡像。
docker run -tdi --gpus all -v /data/projects:/run/projects --name='andp_buck3' nvidia/cuda:10.0-base-centos7 /bin/bash
3.1. 後來發現nvidia/cuda:10.0-base-centos7可能是個比較基礎的容器不太夠用,最後python-tensorflow設定gpu的時候會報形如: Could not dlopen library 'libcublas.so.10.0',那麼開始嘗試更全的基礎容器:
docker run -tdi --gpus all -v /data/projects:/run/projects --name='andp_buck4' nvidia/cuda:10.1-cudnn7-devel-centos7 /bin/bash
nvidia/cuda:10.1-cudnn7-devel-centos7 這個就很大,後面繼續類似的流程看。
4.進入剛剛建立的容器(容器内nvidia-smi指令無誤)
docker exec -it andp_buck3 /bin/bash
5. 開始安裝python環境,參照自己之前的内容前一篇部落格,假定之前需要的檔案都已經有了,在我的tools裡面:
cd /run/projects/tools/
cd openssl-1.1.1
./config --prefix=/usr/local/openssl shared zlib
make && make install
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/openssl/lib" >> $HOME/.bash_profile
source $HOME/.bash_profile
openssl version
cd ../Python-3.7.4
./configure --prefix=/usr/local/python374 --enable-optimizations --enable-shared --with-openssl=/usr/local/openssl
make && make install
ln -s /usr/local/python374/bin/pip3 /usr/bin/
ln -s /usr/local/python374/bin/python3 /usr/bin/
6.嘗試安裝 tensorflow-gpu
pip3 install tensorflow-gpu==1.14 -i https://mirrors.aliyun.com/pypi/simple/
7.安裝順利完成,import tensorflow時仍然會出現:
ImportError: /lib64/libm.so.6: version `GLIBC_2.23' not found
那麼,再走一下前一篇部落格後面關于這裡的步驟,這裡也做個更流暢的彙總版吧:
8.因為之前build過gcc9.2,直接容器外的.so關聯試試:
ln /run/projects/tools/glibc-2.30/build/math/libm.so.6 /lib64/libm.so.6 -s
之後再 import tensorflow 報另外的錯: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20'
9.gcc9.2.0在容器外安裝過,那應該還是在容器内走一邊gcc9.2.0安裝:
cd /make-4.2.1
./configure --prefix=$HOME/local
make -v
make && make install
/root/local/bin/make -v
mv /usr/bin/make /usr/bin/make3
ln -s /root/local/bin/make /usr/bin/make
make -v
gmake
gmake -v
cd ../gcc-9.2.0
./contrib/download_prerequisites
mkdir build
cd build/
../configure --prefix=/usr/local/gcc-9.2.0 --enable-bootstrap --enable-checking=release --enable-languages=c,c++ --disable-multilib
缺少報錯: configure: error: Building GCC requires GMP 4.2+, MPFR 2.4.0+ and MPC 0.8.0+.
yum install wget bzip2 gcc gcc-c++ glibc-headers (不定必須)
yum install autoconf (不定必須)
yum install gmp
yum install mpfr
yum install libmpc-devel bison
../configure --prefix=/usr/local/gcc-9.2.0 --enable-bootstrap --enable-checking=release --enable-languages=c,c++ --disable-multilib (這次 ok了:)
make && make install (需要很久)
gcc -v
echo -e '\nexport PATH=/usr/local/gcc-9.2.0/bin:$PATH\n' >> ~/.bash_profile
source ~/.bash_profile
gcc -v
ln -sv /usr/local/gcc-9.2.0/include/ /usr/include/gcc
ldconfig -v
ldconfig -p |grep gcc #導出驗證
gcc -v
cd ../../glibc-2.30/bulid/
LD_LIBRARY_PATH=''
../configure --prefix=/usr --disable-profile --enable-add-ons --with-headers=/usr/include --with-binutils=/usr/bin
make
make install
sudo find / -name glibc*
strings math/libm.so.6 | grep GLIBC_2.23
mv /lib64/libm.so.6 /lib64/libm.so.6.old
cp math/libm.so.6 /lib64/libm.so.6
find / -name libstdc++.so.6*
strings /usr/lib64/libstdc++.so.6.0.19 | grep CXXABI_1.3
strings /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 | grep CXXABI_1.3
ln -s /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6
mv /usr/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6.old1
ln -s /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6
10.測試發現已經可以import tensorflow啦,後面安裝一些包把環境收尾就好
pip3 install pandas ipython sqlalchemy pymysql psycopg2-binary pyhive scipy numpy -i https://mirrors.aliyun.com/pypi/simple/
就可以愉快的在容器内使用gpu訓練tensorflow項目啦。
11.測試可以跑訓練項目完成後,commit 容器并上傳鏡像:
[[email protected] ~]# docker commit -m 'for tensorflow-gpu-py374' -a='antony314' 3ff2d3cfa0ba antony314/centos76:v2.2
sha256:21f3b71f9939226f1d817c19ed88f14fa0c2ff5e76eed7b5b17b9fa9463801cf
[[email protected] ~]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
antony314/centos76 v2.2 21f3b71f9939 2 minutes ago 4.29GB
antony314/centos76 v2.1 52427f8da2c5 2 months ago 1.84GB
antony314/centos76 v2 c65e5d82b7d4 2 months ago 1.84GB
antony314/centos76 v1 241bcf6311b7 2 months ago 611MB
tensorflow/tensorflow latest d64a95598d6c 2 months ago 1.03GB
nvidia/cuda 10.0-base-centos7 e9f670f1d5b9 3 months ago 254MB
nvidia/cuda 9.0-base 1443caa429f9 3 months ago 137MB
nvidia/cuda 10.0-base 5026b20f9c3d 3 months ago 110MB
antony314/centos76 7.6init 2cf0fa81ce78 4 months ago 202MB
[[email protected] ~]# docker push antony314/centos76:v2.2
The push refers to repository [docker.io/antony314/centos76]
711e037a5568: Pushed
74f64c7f6830: Mounted from nvidia/cuda
ccbc602e5359: Mounted from nvidia/cuda
a71b7655dacc: Mounted from nvidia/cuda
5d01beb4238f: Mounted from nvidia/cuda
877b494a9f30: Mounted from nvidia/cuda
v2.2: digest: sha256:a742553910d749b1d1a2ab22d85e2f0145af301c6dbca4b89becf1c3b6266129 size: 1577
最後,暫時挂起一個一直很頭疼的問題,容器越來越大。