驱动安装前准备
检查GPU卡的带宽
lspci -vvv | grep -A 25 1e3e | grep -E "1e3e|LnkSta|Memory"
输出结果的LnkSta行, Speed 和 Width 后面均应为(ok)
运行权限
建议切换到root用户
sudo su
必须的依赖包
gcc make unzip tar
依赖包安装方法:
ubuntu系统:
apt-get update
apt-get install -y make gcc unzip tar
kylin、CenOS、RHEL等系统:
yum install -y make unzip tar
内核禁止自动升级
ubuntu系统:
apt-mark hold linux-image-generic linux-headers-generic linux-headers-$(uname -r) linux-image-$(uname -r) linux-modules-$(uname -r) linux-modules-extra-$(uname -r)
kylin、CenOS、RHEL等系统:
echo "exclude=kernel*" >>/etc/yum.conf
安装GCC12
ubuntu内核版本高于5.19或者安装天数驱动时报错error code32时需要安装。
apt install software-properties-common
apt-get update
add-apt-repository ppa:ubuntu-toolchain-r/test
apt install gcc-12
查看gcc12是否装成功
gcc-12 --version
备注:
如果需要在宿主机或者conda环境下跑示例模型,需要下载天数框架包(如果只需要安装驱动,不需要执行这步)。
cd /home
mkdir whl310
在sftp上下载需要的安装包——智铠产品(MR-50,MR-100)
安装驱动所需文件下载:
sftp -P 29880 iluvatar_mr@iftp.iluvatar.com.cn
sftp密码:联系天数工程师
CUDA头文件:get /partial_install_cuda_header.tar.gz /home
固件:
MR-50卡:get /MR_4.2.0/x86/firmware/fw_tool/mr50/ixfw-mr_50-2.0.20_x86_64.run /home
MR-100卡:get /MR_4.2.0/x86/firmware/fw_tool/mr100/ixfw-mr_100-2.0.20_x86_64.run /home
SDK:
get /MR_4.2.0/x86/sdk/corex-installer-linux64-4.2.0_x86_64_10.2.run /home
get /MR_4.2.0/x86/sdk/corex-samples-4.2.0_x86_64.run /home
测试工具下载
get /client_tmp/support/easy_check.sh /home
备注:
1、如果需要在宿主机或者conda环境下跑示例模型需要下载天数框架包,下载框架包和框架自动安装脚本到/home/whl310:
get /client_tmp/support/installwhl.sh /home
get /MR_4.2.0/x86/apps/py3.10/* /home/whl310
get /MR_4.2.0/x86/add-on/py3.10/* /home/whl310
2、如果需要创建conda环境,需要下载conda安装包到/home目录下
get /client_tmp/support/Miniconda3-latest-Linux-x86_64.sh /home
3、如果需要创建docker环境,需要下载docker离线安装包到/home目录下
get /client_tmp/support/docker-20.10.14.tgz /home
get /client_tmp/support/docker.service /home
get /client_tmp/support/daemon.json /home
ubuntu系统:get /MR_4.2.0/x86/sdk/corex-docker-installer-4.2.0-10.2-ubuntu20.04-py3.10-x86_64.run /home
kylin、CenOS、RHEL等系统:get /MR_4.2.0/x86/sdk/corex-docker-installer-4.2.0-10.2-centos7.8.2003-py3.10-x86_64.run /home
4、如果需要执行示例模型测试,需要下载示例模型包和数据集
get /MR_4.2.0/x86/sdk/inferencesamples-4.2.0.tgz /home
get /dataset/corex-inference-data-3.3.0.tar /home
在sftp上下载需要的安装包——天垓产品(BI-100、BI-150、BI-150s)
安装驱动所需文件下载:
sftp -P 29880 iluvatar_corex@iftp.iluvatar.com.cn
sftp密码:联系天数工程师
CUDA头文件:get /partial_install_cuda_header.zip /home
固件:
BI-100:
get /BI_100/3.2.0_BI100/x86/fw_tool/bi100/ixfw-bi_v100-5.2.1_x86_64.run /home
BI-150:
get /BI_150/BI_4.2.0/x86/firmware/fw_tool/bi150/ixfw-bi_v150-2.0.0_x86_64.run /home
BI-150s:
get /BI_150/BI_4.2.0/x86/firmware/fw_tool/bi150s/ixfw-bi_v150s-1.0.3_x86_64.run /home
SDK:
BI-100:get /BI_100/3.2.1_BI100/x86/corex-installer-linux64-3.2.1_x86_64_10.2.run /home
BI-150、BI-150s:
get /BI_150/BI_4.2.0/x86/sdk/corex-installer-linux64-4.2.0_x86_64_10.2.run /home
get /BI_150/BI_4.2.0/x86/sdk/deeplearningsamples-4.2.0.tgz /home
备注:
如果需要在宿主机或者conda环境下跑示例模型需要下载天数框架包,下载框架包和框架自动安装脚本到/home/whl310
get /client_tmp/support/installwhl.sh /home
BI-100:
get /BI_100/3.2.1_BI100/x86/apps-py3.10/* /home/whl310
BI-150:
get /BI_150/BI_4.2.0/x86/apps/py3.10/* /home/whl310
get /BI_150/BI_4.2.0/x86/add-on/py3.10/*.whl /home/whl310
测试工具下载
get /client_tmp/support/easy_check.sh /home
备注:
1、如果需要创建conda环境,需要下载conda安装包到/home目录下
get /client_tmp/support/Miniconda3-latest-Linux-x86_64.sh /home
2、如果需要创建docker环境,需要下载docker离线安装包和docker镜像包到/home目录下
get /client_tmp/support/docker-20.10.14.tgz /home
get /client_tmp/support/docker.service /home
get /client_tmp/support/daemon.json /home
请区分操作系统版本和加速卡型号:
BI-100 Ubuntu系统:
get /BI_100/3.2.1_BI100/x86/corex-docker-installer-3.2.1-10.2-ubuntu20.04-py3.10-x86_64.run /home
BI-100 kylin系统:
get /BI_100/3.2.1_BI100/x86/corex-docker-installer-3.2.1-10.2-centos7.8.2003-py3.10-x86_64.run /home
BI-150 Ubuntu系统:
get /BI_150/BI_4.2.0/x86/sdk/corex-docker-installer-4.2.0-10.2-ubuntu20.04-py3.10-x86_64.run
BI-150 kylin系统:
get /BI_150/BI_4.2.0/x86/sdk/corex-docker-installer-4.2.0-10.2-centos7.8.2003-py3.10-x86_64.run /home
3、如果需要执行示例模型测试,需要下载数据集
get /data/corex-test-tools-data-mini-2.1.0.tar /home
安装驱动和SDK步骤
安装cuda header
unzip partial_install_cuda_header.zip
cd partial_install_cuda_header
bash install-cuda-header.sh
升级固件
MR-50:
bash /home/ixfw-mr_50-2.0.20_x86_64.run
MR-100:
bash /home/ixfw-mr_100-2.0.20_x86_64.run
BI-100:
bash /home/ixfw-bi_v100-5.2.1_x86_64.run
BI-150:
bash /home/ixfw-bi_v150-2.0.0_x86_64.run
BI-150s:
bash /home/ixfw-bi_v150s-1.0.3_x86_64.run
宿主机上安装驱动和软件栈
MR-50、MR-100、BI-150:
bash /home/corex-installer-linux64-4.2.0_x86_64_10.2.run --silent --driver --toolkit
BI-100:
bash /home/corex-installer-linux64-3.2.1_x86_64_10.2.run --silent --driver --toolkit
设置环境变量
vi /root/.bashrc
添加如下内容:
export LD_LIBRARY_PATH=/usr/local/corex/lib:$
LD_LIBRARY_PATH
export PATH=/usr/local/corex/bin:$PATH
使环境变量生效
source /root/.bashrc
备注:有些低版本SDK需要对SDK目录建立一些软链,比如:
ln -s /usr/local/corex-3.2.1 /usr/local/corex
检查是否安装成功
ixsmi
如果有加速卡信息输出则安装正常。
执行带宽、算力、p2p测试
cd /home
sh easy_check.sh | tee -a test.txt
示例模型测试(conda环境)
安装Minconda3
bash /home/Miniconda3-latest-Linux-x86_64.sh -b -u -p /root/miniconda3
/root/miniconda3/bin/conda init
激活conda环境
source /root/.bashrc
创建conda环境
conda create --name py310 python=3.10
conda env list
conda activate py310
安装模型依赖包
Ubuntu系统:
apt-get install -y libncursesw5 libjpeg-dev zlib1g-dev rustc cargo libmpich-dev libopenmpi-dev libgirepository1.0-dev cmake libcairo2-dev pkg-config
Kylin、CentOS等系统:
yum install -y ncurses-compat-libs libjpeg-turbo-devel zlib-devel rustc cargo mpich-devel openmpi-devel gobject-introspection-devel cmake cairo-devel pkg-config
conda install mpi4py
安装示例模型学习框架
cd /home
bash installwhl.sh /home/whl310 (时间有点长,耐心等待)
安装示例脚本
MR-50、MR-100:
tar -zxvf /home/inferencesamples-4.2.0.tgz
BI-100:
bash /home/corex-samples-3.2.1_x86_64.run
BI-150:
tar -zxvf /home/deeplearningsamples-4.2.0.tgz
推理示例模型测试-mr50,mr100产品
安装深度学习推理框架示例脚本
cp /home/corex-inference-data-3.3.0.tar /home/inferencesamples/
cd /home/inferencesamples/
tar -xvf corex-inference-data-3.3.0.tar
准备推理前环境
cd /home/inferencesamples
bash quick_build_environment.sh
根据对应模型执行初始化脚本init.sh 模型是resnet,如下
cd /home/inferencesamples/executables/resnet
bash init.sh
执行推理模型resnet50测试
bash infer_resnet50_int8_accuracy_ixrt.sh
bash infer_resnet50_int8_accuracy_igie.sh
bash infer_resnet50_int8_performance_ixrt.sh
bash infer_resnet50_int8_performance_igie.sh
bash infer_resnet50_fp16_accuracy_ixrt.sh
bash infer_resnet50_fp16_accuracy_igie.sh
bash infer_resnet50_fp16_performance_igie.sh
bash infer_resnet50_fp16_performance_ixrt.sh
根据对应模型执行初始化脚本init.sh 模型是yolov5s,如下
cd /home/inferencesamples/executables/yolov5s
bash init.sh
执行推理模型yolov5s测试
bash infer_yolov5s_fp16_accuracy_igie.sh
bash infer_yolov5s_fp16_accuracy_ixrt.sh
bash infer_yolov5s_fp16_performance_igie.sh
bash infer_yolov5s_fp16_performance_ixrt.sh
示例模型-bi100产品
安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples
tar -xvf corex-test-tools-data-mini-2.1.0.tar
准备环境
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples
bash quick_build_environment.sh
Pytorch框架使用演示
单卡
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet
bash init_torch.sh
bash train_resnet50_torch.sh
单机多卡分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh
使用AMP混合精度
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet
bash train_resnet50_amp_torch.sh
使用horovod分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/horovod
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh
使用Apex加速
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/apex
bash init_torch.sh
bash train_bert_pretrain_apex_torch.sh
使用DALI加速库
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/dali
bash init_torch.sh
bash train_resnet50_dali_torch.sh
示例模型-bi150,bi150s产品
安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /home/deeplearningsamples/
cd /home/deeplearningsamples
tar -xvf corex-test-tools-data-mini-2.1.0.tar
准备环境
cd /home/deeplearningsamples
bash quick_build_environment.sh
Pytorch框架使用演示
单卡
cd /home/deeplearningsamples/executables/resnet
bash init_torch.sh
bash train_resnet50_torch.sh
单机多卡分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯
)
cd /home/deeplearningsamples/executables/resnet
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh
使用AMP混合精度
cd /home/deeplearningsamples/executables/resnet
bash train_resnet50_amp_torch.sh
使用horovod分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯)
cd /home/deeplearningsamples/executables/horovod
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh
使用Apex加速
cd /home/deeplearningsamples/executables/apex
bash init_torch.sh
bash train_bert_pretrain_apex_torch.sh
使用DALI加速库
cd /home/deeplearningsamples/executables/dali
bash init_torch.sh
bash train_resnet50_dali_torch.sh
示例模型测试(docker环境)
安装docker
cd /home
tar -zxvf docker-20.10.14.tgz
cp docker/* /usr/bin/
cp /home/docker.service /etc/systemd/system/
chmod +x /etc/systemd/system/docker.service
systemctl daemon-reload
systemctl start docker
systemctl enable docker.service
cp /home/daemon.json /etc/docker/
systemctl daemon-reload
systemctl restart docker
导入天数base镜像文件
(这个步骤时间比较长,耐心等待)
MR-50、MR-100产品
bash /home/corex-docker-installer-4.2.0-10.2-ubuntu20.04-py3.10-x86_64.run --silent --disable-dkms #ubuntu系统镜像执行这个
bash /home/corex-docker-installer-4.2.0-10.2-centos7.8.2003-py3.10-x86_64.run --silent --disable-dkms #centos系统镜像执行这个
BI-100产品
bash /home/corex-docker-installer-3.2.1-10.2-ubuntu20.04-py3.10-x86_64.run --silent --disable-dkms #Ubuntu系统镜像执行这个
bash /home/corex-docker-installer-3.2.1-10.2-centos7.8.2003-py3.10-x86_64.run --silent --disable-dkms #Centos系统镜像执行这个
BI-150产品
bash /home/corex-docker-installer-4.2.0-10.2-ubuntu20.04-py3.10-x86_64.run --silent --disable-dkms #Ubuntu系统镜像执行这个
bash /home/corex-docker-installer-4.2.0-10.2-centos7.8.2003-py3.10-x86_64.run --silent --disable-dkms #centos系统镜像执行这个
创建base镜像容器
MR-50、MR-100产品
docker run --shm-size="32g" -it -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home:/home --network=host --name=test --privileged --cap-add=ALL --pid=host corex:4.2.0
BI-100产品
docker run --shm-size="32g" -it -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home:/home --network=host --name=test --privileged --cap-add=ALL --pid=host corex:3.2.1
BI-150产品
docker run --shm-size="32g" -it -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home:/home --network=host --name=test --privileged --cap-add=ALL --pid=host corex:4.2.0
安装示例脚本
MR-50、MR-100产品
bash /home/corex-samples-4.2.0_x86_64.run
BI-100产品
bash /home/corex-samples-3.2.1_x86_64.run
BI-150产品
bash /home/corex-samples-4.2.0_x86_64.run
推理示例模型测试——MR-50、MR-100产品
安装深度学习推理框架示例脚本
cp /home/corex-inference-data-3.3.0.tar /home/inferencesamples/
cd /home/inferencesamples/
tar -xvf corex-inference-data-3.3.0.tar
准备推理前环境
cd /home/inferencesamples
bash quick_build_environment.sh
pip3 install /home/whl310/pycuda-2022.2.2+corex.4.2.0-cp310-cp310-linux_x86_64.whl
根据对应模型执行初始化脚本init.sh 模型是resnet,如下
cd /home/inferencesamples/executables/resnet
bash init.sh
执行推理模型resnet50测试
bash infer_resnet50_int8_accuracy_ixrt.sh
bash infer_resnet50_int8_accuracy_igie.sh
bash infer_resnet50_int8_performance_ixrt.sh
bash infer_resnet50_int8_performance_igie.sh
bash infer_resnet50_fp16_accuracy_ixrt.sh
bash infer_resnet50_fp16_accuracy_igie.sh
bash infer_resnet50_fp16_performance_igie.sh
bash infer_resnet50_fp16_performance_ixrt.sh
根据对应模型执行初始化脚本init.sh 模型是yolov5s,如下
cd /home/inferencesamples/executables/yolov5s
bash init.sh
执行推理模型yolov5s测试
bash infer_yolov5s_fp16_accuracy_igie.sh
bash infer_yolov5s_fp16_accuracy_ixrt.sh
bash infer_yolov5s_fp16_performance_igie.sh
bash infer_yolov5s_fp16_performance_ixrt.sh
示例模型——BI-100产品
安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples
tar -xvf corex-test-tools-data-mini-2.1.0.tar
准备环境
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples
bash quick_build_environment.sh
Pytorch框架使用演示
单卡
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet
bash init_torch.sh
bash train_resnet50_torch.sh
单机多卡分布式
(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh
使用AMP混合精度
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet
bash train_resnet50_amp_torch.sh
使用horovod分布式
(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/horovod
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh
使用Apex加速
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/apex
bash init_torch.sh
bash train_bert_pretrain_apex_torch.sh
使用DALI加速库
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/dali
bash init_torch.sh
bash train_resnet50_dali_torch.sh
示例模型——BI-150产品
安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /home/deeplearningsamples/
cd /home/deeplearningsamples
tar -xvf corex-test-tools-data-mini-2.1.0.tar
准备前环境
cd /home/deeplearningsamples
bash quick_build_environment.sh
Pytorch框架使用演示
单卡
cd /home/deeplearningsamples/executables/resnet
bash init_torch.sh
bash train_resnet50_torch.sh
单机多卡分布式
(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯)
cd /home/deeplearningsamples/executables/resnet
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh
使用AMP混合精度
cd /home/deeplearningsamples/executables/resnet
bash train_resnet50_amp_torch.sh
使用horovod分布式
(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯)
cd /home/deeplearningsamples/executables/horovod
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh
使用Apex加速
cd /home/deeplearningsamples/executables/apex
bash init_torch.sh
bash train_bert_pretrain_apex_torch.sh
使用DALI加速库
cd /home/deeplearningsamples/executables/dali
bash init_torch.sh
bash train_resnet50_dali_torch.sh
报错处理方法:
报错内容:
准备环境时——bash quick_build_environment.sh报错:
ERROR: Could not find a version that satisfies the requirement scikit-build (from versions: none) ERROR: No matching distribution found for scikit-build
处理方法:
添加pip清华源:pip install scikit-build -i https://pypi.tuna.tsinghua.edu.cn/simple