1.安装 nvidia-docker,详见https://github.com/NVIDIA/nvidia-docker/
2.完成后测试cuda可用:
docker run --gpus all nvidia/cuda:10.0-base-centos7 nvidia-smi
3.确认可用后会看到nvidia-smi命令的结果,然后开启自己容器的之路,首先创建一个容器当虚拟机用,这里我选择的是nvidia/cuda:10.0-base-centos7 镜像。
docker run -tdi --gpus all -v /data/projects:/run/projects --name='andp_buck3' nvidia/cuda:10.0-base-centos7 /bin/bash
3.1. 后来发现nvidia/cuda:10.0-base-centos7可能是个比较基础的容器不太够用,最后python-tensorflow设置gpu的时候会报形如: Could not dlopen library 'libcublas.so.10.0',那么开始尝试更全的基础容器:
docker run -tdi --gpus all -v /data/projects:/run/projects --name='andp_buck4' nvidia/cuda:10.1-cudnn7-devel-centos7 /bin/bash
nvidia/cuda:10.1-cudnn7-devel-centos7 这个就很大,后面继续类似的流程看。
4.进入刚刚创建的容器(容器内nvidia-smi命令无误)
docker exec -it andp_buck3 /bin/bash
5. 开始安装python环境,参照自己之前的内容前一篇博客,假定之前需要的文件都已经有了,在我的tools里面:
cd /run/projects/tools/
cd openssl-1.1.1./config --prefix=/usr/local/openssl shared zlib
make && make install
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/openssl/lib" >> $HOME/.bash_profile
source $HOME/.bash_profile
openssl versioncd ../Python-3.7.4
./configure --prefix=/usr/local/python374 --enable-optimizations --enable-shared --with-openssl=/usr/local/openssl
make && make install
ln -s /usr/local/python374/bin/pip3 /usr/bin/
ln -s /usr/local/python374/bin/python3 /usr/bin/
6.尝试安装 tensorflow-gpu
pip3 install tensorflow-gpu==1.14 -i https://mirrors.aliyun.com/pypi/simple/
7.安装顺利完成,import tensorflow时仍然会出现:
ImportError: /lib64/libm.so.6: version `GLIBC_2.23' not found
那么,再走一下前一篇博客后面关于这里的步骤,这里也做个更流畅的汇总版吧:
8.因为之前build过gcc9.2,直接容器外的.so关联试试:
ln /run/projects/tools/glibc-2.30/build/math/libm.so.6 /lib64/libm.so.6 -s
之后再 import tensorflow 报另外的错: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20'
9.gcc9.2.0在容器外安装过,那应该还是在容器内走一边gcc9.2.0安装:
cd /make-4.2.1
./configure --prefix=$HOME/local
make -vmake && make install
/root/local/bin/make -v
mv /usr/bin/make /usr/bin/make3
ln -s /root/local/bin/make /usr/bin/make
make -v
gmake
gmake -v
cd ../gcc-9.2.0
./contrib/download_prerequisites
mkdir build
cd build/
../configure --prefix=/usr/local/gcc-9.2.0 --enable-bootstrap --enable-checking=release --enable-languages=c,c++ --disable-multilib
缺少报错: configure: error: Building GCC requires GMP 4.2+, MPFR 2.4.0+ and MPC 0.8.0+.
yum install wget bzip2 gcc gcc-c++ glibc-headers (不定必须)
yum install autoconf (不定必须)yum install gmp
yum install mpfr
yum install libmpc-devel bison../configure --prefix=/usr/local/gcc-9.2.0 --enable-bootstrap --enable-checking=release --enable-languages=c,c++ --disable-multilib (这次 ok了:)
make && make install (需要很久)gcc -v
echo -e '\nexport PATH=/usr/local/gcc-9.2.0/bin:$PATH\n' >> ~/.bash_profile
source ~/.bash_profile
gcc -v
ln -sv /usr/local/gcc-9.2.0/include/ /usr/include/gcc
ldconfig -v
ldconfig -p |grep gcc #导出验证
gcc -v
cd ../../glibc-2.30/bulid/
LD_LIBRARY_PATH=''
../configure --prefix=/usr --disable-profile --enable-add-ons --with-headers=/usr/include --with-binutils=/usr/bin
make
make install
sudo find / -name glibc*
strings math/libm.so.6 | grep GLIBC_2.23
mv /lib64/libm.so.6 /lib64/libm.so.6.old
cp math/libm.so.6 /lib64/libm.so.6
find / -name libstdc++.so.6*
strings /usr/lib64/libstdc++.so.6.0.19 | grep CXXABI_1.3
strings /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 | grep CXXABI_1.3
ln -s /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6
mv /usr/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6.old1
ln -s /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6
10.测试发现已经可以import tensorflow啦,后面安装一些包把环境收尾就好
pip3 install pandas ipython sqlalchemy pymysql psycopg2-binary pyhive scipy numpy -i https://mirrors.aliyun.com/pypi/simple/
就可以愉快的在容器内使用gpu训练tensorflow项目啦。
11.测试可以跑训练项目完成后,commit 容器并上传镜像:
[root@localhost ~]# docker commit -m 'for tensorflow-gpu-py374' -a='antony314' 3ff2d3cfa0ba antony314/centos76:v2.2
sha256:21f3b71f9939226f1d817c19ed88f14fa0c2ff5e76eed7b5b17b9fa9463801cf
[root@localhost ~]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
antony314/centos76 v2.2 21f3b71f9939 2 minutes ago 4.29GB
antony314/centos76 v2.1 52427f8da2c5 2 months ago 1.84GB
antony314/centos76 v2 c65e5d82b7d4 2 months ago 1.84GB
antony314/centos76 v1 241bcf6311b7 2 months ago 611MB
tensorflow/tensorflow latest d64a95598d6c 2 months ago 1.03GB
nvidia/cuda 10.0-base-centos7 e9f670f1d5b9 3 months ago 254MB
nvidia/cuda 9.0-base 1443caa429f9 3 months ago 137MB
nvidia/cuda 10.0-base 5026b20f9c3d 3 months ago 110MB
antony314/centos76 7.6init 2cf0fa81ce78 4 months ago 202MB
[root@localhost ~]# docker push antony314/centos76:v2.2
The push refers to repository [docker.io/antony314/centos76]
711e037a5568: Pushed
74f64c7f6830: Mounted from nvidia/cuda
ccbc602e5359: Mounted from nvidia/cuda
a71b7655dacc: Mounted from nvidia/cuda
5d01beb4238f: Mounted from nvidia/cuda
877b494a9f30: Mounted from nvidia/cuda
v2.2: digest: sha256:a742553910d749b1d1a2ab22d85e2f0145af301c6dbca4b89becf1c3b6266129 size: 1577
最后,暂时挂起一个一直很头疼的问题,容器越来越大。