GPU云主机是基于UCloud成熟的云计算技术,专享高性能GPU硬件的云主机服务。大幅提升图形图像处理和高性能计算能力,并具备弹性、低成本、易于使用等特性。有效提升图形处理、科学计算等领域的计算处理效率,降低IT成本投入。本文,小编就带大家来看一下UCloud GPU云主机驱动安装指南汇总。
点击进入:UCloud官网
一、CentOS7环境配置
1.检查GPU设备识别。
$ yum install pciutils
$ sudo lspci | grep NVIDIA
3D controller: NVIDIA Corporation GK210GL [Tesla K80] 表示识别为K80
3D controller: NVIDIA Corporation Device 1b38 (rev a1) 表示为P40
2.获取cuda网络源,并配置
NVidia官方源地址http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/
$ wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-8.0.61-1.x86_64.rpm
$ rpm -Uvh cuda-repo-rhel7-8.0.61-1.x86_64.rpm
注:安装nvidia驱动需要kernel-devel包,安装方法如下:
$ wget http://vault.centos.org/7.0.1406/updates/x86_64/Packages/kernel-devel-3.10.0-123.4.4.el7.x86_64.rpm
$ wget http://vault.centos.org/7.0.1406/updates/x86_64/Packages/kernel-headers-3.10.0-123.4.4.el7.x86_64.rpm
$ rpm -Uvh kernel-devel-3.10.0-123.4.4.el7.x86_64.rpm
$ rpm -Uvh kernel-headers-3.10.0-123.4.4.el7.x86_64.rpm
3.安装cuda 8.0
$yum install cuda-8-0
3.1查看驱动状态
$ sudo nvidia-smi
看到如下输出表示GPU驱动正常:
4.测试GPU基本功能(可选)
4.1增加LD path
$export LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:/usr/lib64/:$LD_LIBRARY_PATH"
4.2安装cuda examples
$ cd /usr/local/cuda/bin
$ sh cuda-install-samples-8.0.sh ~/cuda-test/
$ cd ~/cuda-test/NVIDIA_CUDA-8.0_Samples
$ make
$ ./bin/x86_64/linux/release/deviceQuery 获取设备状态
$ ./bin/x86_64/linux/release/bandwidthTest 测试设备带宽
Note:如果编译过程发现lnvcuvid的错误,可以执行:
$find.-type f-execdir sed-i's/UBUNTU_PKG_NAME="nvidia-367"/UBUNTU_PKG_NAME="nvidia-375"/g''{}'
其中nvidia-375是当前安装的驱动的版本
5.安装cudnn
选装,注:不同AI框架对cudnn的版本支持不同
5.1下载cudnn软件包
,需要注册nvidia账号后才能下载。
注意:CentOS下载cuDNN v5.1 Library for Linux
5.2安装
案例使用cudnn5.1,因为TensorFlow目前仅支持5.1$tar-zxf cudnn-8.0-linux-x64-v5.1.tgz
解压的路径可以自由选择,一般是/usr/lib下面,这边假设为$export LD_LIBRARY_PATH=:$LD_LIBRARY_PATH
二、Ubuntu14.04环境配置
1.检查GPU设备识别。
$ sudo lspci | grep NVIDIA
3D controller: NVIDIA Corporation GK210GL [Tesla K80] 表示识别为K80
3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1) 表示为P40
2.获取cuda网络源,并配置:
NVidia官方源地址http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/
$ wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_8.0.44-1_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu1404_8.0.44-1_amd64.deb
$ sudo apt-get update
3.安装cuda 8.0
在安装前请uname-a检测当前内核版本,然后确保对应版本的kernel-header包已经安装,否则无法正常编译驱动。
$ uname -a
$ Linux X-X-X-X 3.13.0-123-generic #172-Ubuntu SMP Mon
$ sudo apt search 3.13.0-123-generic
$ p linux-cloud-tools-3.13.0-123-generic - Linux kernel version specific cloud tools for version 3.13.0-123
$ p linux-headers-3.13.0-123-generic - Linux kernel headers for version 3.13.0 on 64 bit x86 SMP
$ p linux-headers-3.13.0-123-generic:i386 - Linux kernel headers for version 3.13.0 on 32 bit x86 SMP
$ sudo apt-get install linux-headers-3.13.0-123-generic
安装cuda
$ sudo apt-get install cuda-8.0
3.1查看驱动状态
$sudo nvidia-smi看到如下输出表示GPU驱动正常:
4.测试GPU基本功能(可选)
4.1增加LD path
$export LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:/usr/lib64/:$LD_LIBRARY_PATH"
4.2安装cuda examples
$ cd /usr/local/cuda/bin
$ sh cuda-install-samples-8.0.sh ~/cuda-test/
$ cd ~/cuda-test/NVIDIA_CUDA-8.0_Samples
$ make
$ ./bin/x86_64/linux/release/deviceQuery 获取设备状态
$ ./bin/x86_64/linux/release/bandwidthTest 测试设备带宽
如果编译过程发现lnvcuvid的错误,可以执行:
$ find . -type f -execdir sed -i 's/UBUNTU_PKG_NAME = "nvidia-367"/UBUNTU_PKG_NAME = "nvidia-375"/g' '{}' \
其中nvidia-375是当前安装的驱动的版本
5.安装cudnn
选装,注:不同AI框架对cudnn的版本支持不同
5.1下载cudnn软件包
https://developer.nvidia.com/cudnn,需要注册nvidia账号后才能下载
5.2安装
案例使用cudnn5.1,因为TensorFlow目前仅支持5.1
ubuntu可以选择cuDNN v5.1 Runtime Library for Ubuntu14.04(Deb)$sudo dpkg-i libcudnn5_5.1.10-1+cuda8.0_amd64.deb
5.关闭ubuntu自动更新内核及NVidia Tools
建议操作$sudo vim/etc/apt/apt.conf.d/10periodic将APT::Periodic::Update-Package-Lists"1";修改为APT::Periodic::Update-Package-Lists"0";以禁止ubuntu自动更新软件包
三、Ubuntu16.04环境配置
1.检查GPU设备识别。
$ sudo lspci | grep NVIDIA
3D controller: NVIDIA Corporation GK210GL [Tesla K80] 表示识别为K80
3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1) 表示为P40
2.屏蔽开源驱动nouveau
编辑如下文件:
sudo vim /etc/modprobe.d/blacklist-nouveau.conf
写入下列内容:
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
更新并重启:
sudo update-initramfs -u
sudo reboot
sudo apt-get install build-essential pkg-config linux-headers-`uname -r`
3.安装nvidia驱动
3.1下载
到nvidia官网下载合适的驱动(目前版本418.126.02),地址https://www.nvidia.com/Download/index.aspx?lang=en-us
也可从UFile下载,速度更快http://gpu.cn-bj.ufileos.com/NVIDIA-Linux-x86_64-418.126.02.run
3.2安装
sudo chmod +x NVIDIA-Linux-x86_64-418.126.02.run
sudo ./NVIDIA-Linux-x86_64-418.126.02.run
3.3查看驱动状态
$sudo nvidia-smi看到如下输出表示GPU驱动正常:
4.安装cuda库
4.1网络安装
sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-ubuntu1604.pin
sudo mv cuda-ubuntu1604.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda
四、Ubuntu18.04环境配置
1.检查GPU设备识别。
$ sudo lspci | grep NVIDIA
3D controller: NVIDIA Corporation GK210GL [Tesla K80] 表示识别为K80
3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1) 表示为P40
2.屏蔽开源驱动nouveau
编辑如下文件:
sudo vim /etc/modprobe.d/blacklist-nouveau.conf
写入下列内容:
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
更新并重启:
sudo update-initramfs -u
sudo reboot
sudo apt-get install build-essential pkg-config
控制台Ubuntu 18.04镜像的内核为4.15.0-68-generic,该版本linux-headers-4.15.0-68在Ubuntu官方已无法下载(状态deleted),此为安装驱动所必需,建议先升级内核至后续版本。
可从官方https://kernel.ubuntu.com/~kernel-ppa/mainline/下载内核,例如4.15.1
也可从UFile下载,速度更快
http://gpu.cn-bj.ufileos.com/linux-headers-4.15.1-041501-generic_4.15.1-041501.201802031831_amd64.deb
http://gpu.cn-bj.ufileos.com/linux-headers-4.15.1-041501_4.15.1-041501.201802031831_all.deb
http://gpu.cn-bj.ufileos.com/linux-image-4.15.1-041501-generic_4.15.1-041501.201802031831_amd64.deb
安装内核,重启并查看版本:
sudo dpkg -i *.deb
sudo reboot
uname -r
3.安装nvidia驱动
3.1下载
到nvidia官网下载合适的驱动(目前版本418.126.02),地址https://www.nvidia.com/Download/index.aspx?lang=en-us
也可从UFile下载,速度更快http://gpu.cn-bj.ufileos.com/NVIDIA-Linux-x86_64-418.126.02.run
3.2安装
sudo chmod +x NVIDIA-Linux-x86_64-418.126.02.run
sudo ./NVIDIA-Linux-x86_64-418.126.02.run
3.3查看驱动状态
$sudo nvidia-smi看到如下输出表示GPU驱动正常:
4.安装cuda库
4.1网络安装
sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda
4.2本地安装
wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
sudo sh cuda_10.2.89_440.33.01_linux.run
五、Rocky Linux 8环境配置
1.检查GPU设备识别。
# yum install pciutils
# sudo lspci | grep NVIDIA
3D controller: NVIDIA Corporation GV100GL [Tesla V100S PCIe 32GB] (rev a1) 表示识别为V100S
2.将Nvidia驱动程序添加到软件包管理器列表中
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
3.安装NVIDIA驱动和配置
sudo dnf install nvidia-driver nvidia-settings
4.安装CUDA
sudo dnf install cuda-driver
5.重启系统并验证
重启操作系统之后,通过"nvidia-smi"命令来查看显卡是否正常工作。
[root@10-13-47-75 ~]# nvidia-smi
Thu Apr 6 17:10:52 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100S-PCIE-32GB Off| 00000000:00:03.0 Off | 0 |
| N/A 26C P0 35W / 250W| 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
六、Redhat 6.6环境配置
1.检查GPU设备识别。
# yum install pciutils
# sudo lspci | grep NVIDIA
3D controller: NVIDIA Corporation Device 1df6 (rev a1) 表示识别为V100S
运行"yum install pciutils"提示"This system is not registered with an entitlement server.You can use subscription-manager to register."
请运行以下命令启动Redhat账号登录:
subscription-manager register
按照提示输入Redhat帐户的用户名和密码。
确认系统已成功注册,并启用订阅:
subscription-manager list--consumed
运行以下命令以更新系统:
yum update
2.下载GPU驱动
wget https://cn.download.nvidia.com/tesla/460.106.00/NVIDIA-Linux-x86_64-460.106.00.run
驱动版本可根据业务需求从nvidia官方链接下载,https://www.nvidia.cn/Download/index.aspx?lang=cn。
3.禁用nouveau
因部分linux系统安装的nouveau驱动与nvidia驱动有冲突,因此需先禁用。输入"lsmod|grep nouveau",如果有返回,则需禁用,禁用方式如下:
# lsmod | grep nouveau
nouveau 1514531 0
ttm 89568 1 nouveau
drm_kms_helper 127731 1 nouveau
drm 355270 3 nouveau,ttm,drm_kms_helper
i2c_algo_bit 5903 1 nouveau
i2c_core 29164 5 i2c_piix4,nouveau,drm_kms_helper,drm,i2c_algo_bit
mxm_wmi 1967 1 nouveau
video 21686 1 nouveau
wmi 6287 2 nouveau,mxm_wmi
# tail -1 /etc/modprobe.d/blacklist.conf
blacklist nouveau
4.安装驱动
# sudo sh .run --kernel-source-path=/usr/src/kernels/
其中"driver_installer"是驱动程序的安装程序文件名,"kernel_version"是当前系统检查搭到的内核版本号。检查当前正在运行的内核版本,可以通过以下命令来查看:
uname -r
5.验证GPU卡是否正常工作
使用nvidia-smi来验证,如果能正常显示出卡型,即可正常使用。
# nvidia-smi
Fri Apr 7 15:02:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... Off | 00000000:00:03.0 Off | 0 |
| N/A 26C P0 35W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |