126  
查询码: 00000163
x86下SDK安装、conda安装、docker安装+示例模型测试(BI+MR合并版详细步骤)
专家 作者: 宋美霞 于 2025年01月08日 ,于 2025年05月14日 编辑

驱动安装前准备

检查GPU卡的带宽

lspci -vvv | grep -A 25 1e3e | grep -E "1e3e|LnkSta|Memory"

输出结果的LnkSta行, Speed 和 Width 后面均应为(ok)


运行权限

建议切换到root用户 

sudo su


必须的依赖包

gcc make unzip tar


依赖包安装方法:

ubuntu系统:

apt-get update

apt-get install -y make gcc unzip tar

kylin、CenOS、RHEL等系统:
yum install -y make unzip tar


内核禁止自动升级

ubuntu系统:

apt-mark hold linux-image-generic linux-headers-generic linux-headers-$(uname -r) linux-image-$(uname -r) linux-modules-$(uname -r) linux-modules-extra-$(uname -r)

kylin、CenOS、RHEL等系统:
echo "exclude=kernel*" >>/etc/yum.conf


安装GCC12

ubuntu内核版本高于5.19或者安装天数驱动时报错error code32时需要安装。

apt install software-properties-common

apt-get update

add-apt-repository ppa:ubuntu-toolchain-r/test

apt install gcc-12

查看gcc12是否装成功

gcc-12 --version


备注:
如果需要在宿主机或者conda环境下跑示例模型,需要下载天数框架包(如果只需要安装驱动,不需要执行这步)。

cd /home

mkdir whl310

在sftp上下载需要的安装包——智铠产品(MR-50,MR-100)

 安装驱动所需文件下载:
sftp -P 29880 iluvatar_mr@iftp.iluvatar.com.cn

sftp密码:联系天数工程师
CUDA头文件:get /partial_install_cuda_header.tar.gz /home


固件:

MR-50卡:get /MR_4.2.0/x86/firmware/fw_tool/mr50/ixfw-mr_50-2.0.20_x86_64.run /home 
MR-100卡get /MR_4.2.0/x86/firmware/fw_tool/mr100/ixfw-mr_100-2.0.20_x86_64.run /home

SDK:

get /MR_4.2.0/x86/sdk/corex-installer-linux64-4.2.0_x86_64_10.2.run /home

get /MR_4.2.0/x86/sdk/corex-samples-4.2.0_x86_64.run /home


测试工具下载

get /client_tmp/support/easy_check.sh /home


备注:

1、如果需要在宿主机或者conda环境下跑示例模型需要下载天数框架包,下载框架包和框架自动安装脚本到/home/whl310:
get /client_tmp/support/installwhl.sh /home
get /MR_4.2.0/x86/apps/py3.10/* /home/whl310
get /MR_4.2.0/x86/add-on/py3.10/* /home/whl310


2、如果需要创建conda环境,需要下载conda安装包到/home目录下
get /client_tmp/support/Miniconda3-latest-Linux-x86_64.sh /home


3、如果需要创建docker环境,需要下载docker离线安装包到/home目录下
get /client_tmp/support/docker-20.10.14.tgz /home
get /client_tmp/support/docker.service /home
get /client_tmp/support/daemon.json /home
ubuntu系统:get /MR_4.2.0/x86/sdk/corex-docker-installer-4.2.0-10.2-ubuntu20.04-py3.10-x86_64.run /home 
kylin、CenOS、RHEL等系统:get /MR_4.2.0/x86/sdk/corex-docker-installer-4.2.0-10.2-centos7.8.2003-py3.10-x86_64.run /home


4、如果需要执行示例模型测试,需要下载示例模型包和数据集
get /MR_4.2.0/x86/sdk/inferencesamples-4.2.0.tgz /home
get /dataset/corex-inference-data-3.3.0.tar /home

在sftp上下载需要的安装包——天垓产品(BI-100、BI-150、BI-150s) 

 安装驱动所需文件下载:

sftp -P 29880 iluvatar_corex@iftp.iluvatar.com.cn

sftp密码:联系天数工程师

CUDA头文件:get /partial_install_cuda_header.zip /home


固件:

BI-100:
get /BI_100/3.2.0_BI100/x86/fw_tool/bi100/ixfw-bi_v100-5.2.1_x86_64.run /home 

BI-150:
get /BI_150/BI_4.2.0/x86/firmware/fw_tool/bi150/ixfw-bi_v150-2.0.0_x86_64.run /home 

BI-150s:
get /BI_150/BI_4.2.0/x86/firmware/fw_tool/bi150s/ixfw-bi_v150s-1.0.3_x86_64.run /home 


SDK:

BI-100:get /BI_100/3.2.1_BI100/x86/corex-installer-linux64-3.2.1_x86_64_10.2.run /home 

BI-150、BI-150s:

get /BI_150/BI_4.2.0/x86/sdk/corex-installer-linux64-4.2.0_x86_64_10.2.run /home 

get /BI_150/BI_4.2.0/x86/sdk/deeplearningsamples-4.2.0.tgz /home 


备注:
如果需要在宿主机或者conda环境下跑示例模型需要下载天数框架包,下载框架包和框架自动安装脚本到/home/whl310

get /client_tmp/support/installwhl.sh /home

BI-100:

get /BI_100/3.2.1_BI100/x86/apps-py3.10/* /home/whl310 


BI-150:

get /BI_150/BI_4.2.0/x86/apps/py3.10/* /home/whl310 

get /BI_150/BI_4.2.0/x86/add-on/py3.10/*.whl /home/whl310



测试工具下载

get /client_tmp/support/easy_check.sh /home



备注:

1、如果需要创建conda环境,需要下载conda安装包到/home目录下

get /client_tmp/support/Miniconda3-latest-Linux-x86_64.sh /home


2、如果需要创建docker环境,需要下载docker离线安装包和docker镜像包到/home目录下

get /client_tmp/support/docker-20.10.14.tgz /home

get /client_tmp/support/docker.service /home

get /client_tmp/support/daemon.json /home


请区分操作系统版本和加速卡型号:

BI-100 Ubuntu系统:

get /BI_100/3.2.1_BI100/x86/corex-docker-installer-3.2.1-10.2-ubuntu20.04-py3.10-x86_64.run /home 

BI-100 kylin系统:

get /BI_100/3.2.1_BI100/x86/corex-docker-installer-3.2.1-10.2-centos7.8.2003-py3.10-x86_64.run /home 


BI-150 Ubuntu系统:

get /BI_150/BI_4.2.0/x86/sdk/corex-docker-installer-4.2.0-10.2-ubuntu20.04-py3.10-x86_64.run

BI-150 kylin系统:

get /BI_150/BI_4.2.0/x86/sdk/corex-docker-installer-4.2.0-10.2-centos7.8.2003-py3.10-x86_64.run /home 


3、如果需要执行示例模型测试,需要下载数据集

get /data/corex-test-tools-data-mini-2.1.0.tar /home


安装驱动和SDK步骤

安装cuda header
unzip partial_install_cuda_header.zip

cd partial_install_cuda_header

bash install-cuda-header.sh


升级固件
MR-50:

bash /home/ixfw-mr_50-2.0.20_x86_64.run

MR-100:

bash /home/ixfw-mr_100-2.0.20_x86_64.run

BI-100:

bash /home/ixfw-bi_v100-5.2.1_x86_64.run

BI-150:

bash /home/ixfw-bi_v150-2.0.0_x86_64.run

BI-150s:

bash /home/ixfw-bi_v150s-1.0.3_x86_64.run


宿主机上安装驱动和软件栈
MR-50、MR-100、BI-150:

bash /home/corex-installer-linux64-4.2.0_x86_64_10.2.run --silent --driver --toolkit

BI-100:

bash /home/corex-installer-linux64-3.2.1_x86_64_10.2.run --silent --driver --toolkit


设置环境变量

vi /root/.bashrc

添加如下内容:

export LD_LIBRARY_PATH=/usr/local/corex/lib:$
LD_LIBRARY_PATH

export PATH=/usr/local/corex/bin:$PATH

使环境变量生效

source /root/.bashrc

备注:有些低版本SDK需要对SDK目录建立一些软链,比如:

ln -s /usr/local/corex-3.2.1 /usr/local/corex


检查是否安装成功
ixsmi

如果有加速卡信息输出则安装正常。


执行带宽、算力、p2p测试
cd /home

sh easy_check.sh  | tee -a test.txt

示例模型测试(conda环境)

安装Minconda3
 

bash /home/Miniconda3-latest-Linux-x86_64.sh -b -u -p /root/miniconda3

/root/miniconda3/bin/conda init

激活conda环境
source /root/.bashrc  

创建conda环境
conda create --name py310 python=3.10

conda env list

conda activate py310


安装模型依赖包
Ubuntu系统:
apt-get install -y libncursesw5 libjpeg-dev zlib1g-dev rustc cargo libmpich-dev libopenmpi-dev libgirepository1.0-dev cmake libcairo2-dev pkg-config

Kylin、CentOS等系统:
yum install -y ncurses-compat-libs libjpeg-turbo-devel zlib-devel rustc cargo mpich-devel openmpi-devel gobject-introspection-devel cmake cairo-devel pkg-config


conda install  mpi4py


安装示例模型学习框架 
cd /home

bash installwhl.sh /home/whl310 (时间有点长,耐心等待)


安装示例脚本
MR-50、MR-100:

tar -zxvf /home/inferencesamples-4.2.0.tgz

BI-100:

bash /home/corex-samples-3.2.1_x86_64.run

BI-150

tar -zxvf /home/deeplearningsamples-4.2.0.tgz


推理示例模型测试-mr50,mr100产品

安装深度学习推理框架示例脚本
cp /home/corex-inference-data-3.3.0.tar /home/inferencesamples/

cd /home/inferencesamples/

tar -xvf corex-inference-data-3.3.0.tar

准备推理前环境
cd /home/inferencesamples

bash quick_build_environment.sh


根据对应模型执行初始化脚本init.sh 模型是resnet,如下
cd /home/inferencesamples/executables/resnet

bash init.sh

执行推理模型resnet50测试
bash infer_resnet50_int8_accuracy_ixrt.sh

bash infer_resnet50_int8_accuracy_igie.sh

bash infer_resnet50_int8_performance_ixrt.sh

bash infer_resnet50_int8_performance_igie.sh

bash infer_resnet50_fp16_accuracy_ixrt.sh

bash infer_resnet50_fp16_accuracy_igie.sh

bash infer_resnet50_fp16_performance_igie.sh

bash infer_resnet50_fp16_performance_ixrt.sh


根据对应模型执行初始化脚本init.sh 模型是yolov5s,如下
cd /home/inferencesamples/executables/yolov5s

bash init.sh

执行推理模型yolov5s测试
bash infer_yolov5s_fp16_accuracy_igie.sh

bash infer_yolov5s_fp16_accuracy_ixrt.sh

bash infer_yolov5s_fp16_performance_igie.sh

bash infer_yolov5s_fp16_performance_ixrt.sh

示例模型-bi100产品

安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/

cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples

tar -xvf corex-test-tools-data-mini-2.1.0.tar

准备环境
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples

bash quick_build_environment.sh


Pytorch框架使用演示
单卡
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet

bash init_torch.sh

bash train_resnet50_torch.sh

单机多卡分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet

CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh

使用AMP混合精度
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet

bash train_resnet50_amp_torch.sh

使用horovod分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/horovod

CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh

使用Apex加速
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/apex

bash init_torch.sh

bash train_bert_pretrain_apex_torch.sh

使用DALI加速库
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/dali

bash init_torch.sh

bash train_resnet50_dali_torch.sh


示例模型-bi150,bi150s产品

安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /home/deeplearningsamples/

cd /home/deeplearningsamples

tar -xvf corex-test-tools-data-mini-2.1.0.tar

准备环境
cd /home/deeplearningsamples

bash quick_build_environment.sh


Pytorch框架使用演示
单卡
cd /home/deeplearningsamples/executables/resnet

bash init_torch.sh

bash train_resnet50_torch.sh

单机多卡分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯

cd /home/deeplearningsamples/executables/resnet

CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh

使用AMP混合精度
cd /home/deeplearningsamples/executables/resnet

bash train_resnet50_amp_torch.sh

使用horovod分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯)
cd /home/deeplearningsamples/executables/horovod

CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh

使用Apex加速
cd /home/deeplearningsamples/executables/apex

bash init_torch.sh

bash train_bert_pretrain_apex_torch.sh

使用DALI加速库
cd /home/deeplearningsamples/executables/dali

bash init_torch.sh

bash train_resnet50_dali_torch.sh


示例模型测试(docker环境)

安装docker
cd /home

tar -zxvf docker-20.10.14.tgz

cp docker/* /usr/bin/

cp /home/docker.service /etc/systemd/system/

chmod +x /etc/systemd/system/docker.service

systemctl daemon-reload

systemctl start docker

systemctl enable docker.service

cp /home/daemon.json /etc/docker/

systemctl daemon-reload

systemctl restart docker


导入天数base镜像文件
(这个步骤时间比较长,耐心等待)

MR-50、MR-100产品
bash /home/corex-docker-installer-4.2.0-10.2-ubuntu20.04-py3.10-x86_64.run --silent --disable-dkms  #ubuntu系统镜像执行这个

bash /home/corex-docker-installer-4.2.0-10.2-centos7.8.2003-py3.10-x86_64.run --silent --disable-dkms  #centos系统镜像执行这个

BI-100产品
bash /home/corex-docker-installer-3.2.1-10.2-ubuntu20.04-py3.10-x86_64.run --silent --disable-dkms  #Ubuntu系统镜像执行这个

bash /home/corex-docker-installer-3.2.1-10.2-centos7.8.2003-py3.10-x86_64.run --silent --disable-dkms  #Centos系统镜像执行这个

BI-150产品

bash /home/corex-docker-installer-4.2.0-10.2-ubuntu20.04-py3.10-x86_64.run --silent --disable-dkms  #Ubuntu系统镜像执行这个

bash /home/corex-docker-installer-4.2.0-10.2-centos7.8.2003-py3.10-x86_64.run --silent --disable-dkms  #centos系统镜像执行这个


创建base镜像容器
MR-50、MR-100产品
docker run --shm-size="32g" -it -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home:/home --network=host --name=test --privileged --cap-add=ALL --pid=host corex:4.2.0

BI-100产品
docker run --shm-size="32g" -it -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home:/home --network=host --name=test --privileged --cap-add=ALL --pid=host corex:3.2.1

BI-150产品
docker run --shm-size="32g" -it -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home:/home --network=host --name=test --privileged --cap-add=ALL --pid=host corex:4.2.0


安装示例脚本
MR-50、MR-100产品
bash /home/corex-samples-4.2.0_x86_64.run

BI-100产品
bash /home/corex-samples-3.2.1_x86_64.run

BI-150产品
bash /home/corex-samples-4.2.0_x86_64.run


推理示例模型测试——MR-50、MR-100产品

安装深度学习推理框架示例脚本
cp /home/corex-inference-data-3.3.0.tar /home/inferencesamples/

cd /home/inferencesamples/

tar -xvf corex-inference-data-3.3.0.tar

准备推理前环境
cd /home/inferencesamples

bash quick_build_environment.sh

pip3 install /home/whl310/pycuda-2022.2.2+corex.4.2.0-cp310-cp310-linux_x86_64.whl


根据对应模型执行初始化脚本init.sh 模型是resnet,如下
cd /home/inferencesamples/executables/resnet

bash init.sh

执行推理模型resnet50测试
bash infer_resnet50_int8_accuracy_ixrt.sh

bash infer_resnet50_int8_accuracy_igie.sh

bash infer_resnet50_int8_performance_ixrt.sh

bash infer_resnet50_int8_performance_igie.sh

bash infer_resnet50_fp16_accuracy_ixrt.sh

bash infer_resnet50_fp16_accuracy_igie.sh

bash infer_resnet50_fp16_performance_igie.sh

bash infer_resnet50_fp16_performance_ixrt.sh


根据对应模型执行初始化脚本init.sh 模型是yolov5s,如下
cd /home/inferencesamples/executables/yolov5s

bash init.sh

执行推理模型yolov5s测试
bash infer_yolov5s_fp16_accuracy_igie.sh

bash infer_yolov5s_fp16_accuracy_ixrt.sh

bash infer_yolov5s_fp16_performance_igie.sh

bash infer_yolov5s_fp16_performance_ixrt.sh


示例模型——BI-100产品

安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/

cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples

tar -xvf corex-test-tools-data-mini-2.1.0.tar

准备环境
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples

bash quick_build_environment.sh


Pytorch框架使用演示
单卡
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet

bash init_torch.sh

bash train_resnet50_torch.sh

单机多卡分布式
(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)

cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet

CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh

使用AMP混合精度
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/resnet

bash train_resnet50_amp_torch.sh

使用horovod分布式
(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)

cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/horovod

CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh

使用Apex加速
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/apex

bash init_torch.sh

bash train_bert_pretrain_apex_torch.sh

使用DALI加速库
cd /root/corex-samples-3.2.1_x86_64/samples/deeplearningsamples/executables/dali

bash init_torch.sh

bash train_resnet50_dali_torch.sh


示例模型——BI-150产品

安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /home/deeplearningsamples/

cd /home/deeplearningsamples

tar -xvf corex-test-tools-data-mini-2.1.0.tar

准备前环境
cd /home/deeplearningsamples

bash quick_build_environment.sh


Pytorch框架使用演示
单卡
cd /home/deeplearningsamples/executables/resnet

bash init_torch.sh

bash train_resnet50_torch.sh

单机多卡分布式
(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯)

cd /home/deeplearningsamples/executables/resnet

CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh

使用AMP混合精度
cd /home/deeplearningsamples/executables/resnet

bash train_resnet50_amp_torch.sh

使用horovod分布式
(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯)

cd /home/deeplearningsamples/executables/horovod

CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh

使用Apex加速
cd /home/deeplearningsamples/executables/apex

bash init_torch.sh

bash train_bert_pretrain_apex_torch.sh

使用DALI加速库
cd /home/deeplearningsamples/executables/dali

bash init_torch.sh

bash train_resnet50_dali_torch.sh


报错处理方法

报错内容:

准备环境时——bash quick_build_environment.sh报错:

ERROR: Could not find a version that satisfies the requirement scikit-build (from versions: none) ERROR: No matching distribution found for scikit-build

处理方法:

添加pip清华源:pip install scikit-build -i https://pypi.tuna.tsinghua.edu.cn/simple



笔记



  目录
    天数智芯知识库系统 -V 5.2.6 -wcp