14  
查询码: 00000194
arm平台+bi卡下SDK安装、conda安装、docker安装+示例模型测试(详细步骤)
专家 作者: 宋美霞 于 2025年02月19日 ,于 2025年05月14日 编辑

驱动安装前准备

ubuntu或者kylin系统安装完成后,执行以下步骤:
建议切换到root用户 
sudo su -
安装系统软件包(缺的时候装)
ubuntu:
apt-get update
apt-get install -y make gcc unzip tar
kylin:
yum install -y make unzip tar
内核禁止自动升级(ubuntu系统需要执行这个)
apt-mark hold linux-image-generic linux-headers-generic linux-headers-$(uname -r) linux-image-$(uname -r) linux-modules-$(uname -r) linux-modules-extra-$(uname -r)
内核禁止自动升级(centos系统需要执行这个)
echo "exclude=kernel*" >>/etc/yum.conf
安装GCC12(ubuntu内核版本高于5.19或者安装天数驱动时报错errorcode32时需要安装)
apt install software-properties-common
apt-get update
add-apt-repository ppa:ubuntu-toolchain-r/test
apt install gcc-12
查看gcc12是否装成功
gcc-12 --version
如果需要在宿主机或者conda环境下跑示例模型需要下载天数框架包,准备下载包存放目录/home/whl310。如果只需要安装驱动,不需要执行这步
cd /home
mkdir whl310

在sftp上下载需要的安装包-天垓产品(bi100,bi150)

天垓产品(bi100,bi150) sftp密码:联系天数工程师
如果只需要安装驱动,只需下载以下四-五个文件就可以了
sftp -P 29880 iluvatar_corex@iftp.iluvatar.com.cn
get /partial_install_cuda_header.zip /home 
get /client_tmp/support/arm/ixfw-bi_v100-5.2.1_aarch64.run /home (天垓100下载这个)
get /BI_150/BI150_4.1.3/arm/firmware/fw_tool/bi150/ixfw-bi_v150-2.0.0_aarch64.run /home (天垓150下载这个)
get /BI_100/BI-100-3.2.1-0124/corex-installer-linux64-3.2.1_arm64_10.2.run /home (天垓100下载这个)
get /BI_150/BI150_4.1.3/arm/sdk/corex-installer-linux64-4.1.3_aarch64_10.2.run /home (天垓150下载这个)
get /BI_100/BI-100-3.2.1-0124/corex-samples-3.2.1_arm64.run /home (天垓100下载这个)
如果需要在宿主机或者conda环境下跑示例模型需要下载天数框架包,下载框架包和框架自动安装脚本到/home/whl310
get /client_tmp/support/installwhl.sh /home
get /BI_100/3.2.1_BI100/arm/apps-py3.10/* /home/whl310 (天垓100下载这个)
get /BI_150/BI150_4.1.3/arm/apps/py3.10/* /home/whl310 (天垓150下载这个)
get /BI_150/BI150_4.1.3/arm/add-on/py3.10/* /home/whl310 (天垓150下载这个)
如果需要创建conda环境,需要下载conda安装包到/home目录下
get /client_tmp/support/arm/Miniconda3-latest-Linux-aarch64.sh /home
如果需要创建docker环境,需要下载docker离线安装包到/home目录下
get /client_tmp/support/arm/docker-27.3.0.tgz /home
get /client_tmp/support/docker.service /home
get /client_tmp/support/daemon.json /home
get /BI_100/3.2.1_BI100/arm/docker-installer/corex-docker-installer-3.2.1-10.2-ubuntu20.04-py3.10-arm64.run /home (天垓100,ubuntu系统下载这个)
get /BI_100/3.2.1_BI100/arm/docker-installer/corex-docker-installer-3.2.1-10.2-centos7.9.2009-py3.10-arm64.run /home (天垓100,kylin系统下载这个)
get /BI_150/BI150_4.1.3/arm/sdk/corex-docker-installer-4.1.3-10.2-ubuntu20.04-py3.10-aarch64.run /home (天垓150,ubuntu系统下载这个)
get /BI_150/BI150_4.1.3/arm/sdk/corex-docker-installer-4.1.3-10.2-centos7.9.2009-py3.10-aarch64.run /home (天垓150,kylin系统下载这个)
如果需要执行示例模型测试,需要下载示例模型包和数据集
get /BI_150/BI150_4.1.3/arm/sdk/deeplearningsamples-4.1.3.tgz /home (天垓150下载这个)
get /data/corex-test-tools-data-mini-2.1.0.tar /home

安装驱动和SDK步骤

安装cuda header
天垓产品
unzip partial_install_cuda_header.zip
统一执行以下安装步骤
cd partial_install_cuda_header
bash install-cuda-header.sh
安装驱动前升级固件
bi100产品
bash /home/ixfw-bi_v100-5.2.1_aarch64.run
bi150产品
bash /home/ixfw-bi_v150-2.0.0_aarch64.run
宿主机上安装驱动和软件栈
bi100产品
bash /home/corex-installer-linux64-3.2.1_arm64_10.2.run --silent --driver --toolkit
bi150产品
bash /home/corex-installer-linux64-4.1.3_aarch64_10.2.run --silent --driver --toolkit
设置环境变量
bi100产品
vi /root/.bashrc
export LD_LIBRARY_PATH=/usr/local/corex-3.2.1/lib
export PATH=/usr/local/corex-3.2.1/bin:$PATH
使环境变量生效
source /root/.bashrc
bi150产品
vi /root/.bashrc
export LD_LIBRARY_PATH=/usr/local/corex-4.1.3/lib
export PATH=/usr/local/corex-4.1.3/bin:$PATH
使环境变量生效
source /root/.bashrc
检查是否安装成功
ixsmi
检查加速卡带宽是否都为x16
lspci -vvv | grep -A 25 1e3e | grep -E "1e3e|LnkSta|Memory"

示例模型测试(conda环境)

安装Minconda3 
bash /home/Miniconda3-latest-Linux-aarch64.sh -b -u -p ~/miniconda3
~/miniconda3/bin/conda init
激活conda环境
source ~/.bashrc  
创建conda环境
conda create --name py310 python=3.10
conda env list
conda activate py310
安装模型依赖包
apt-get install -y libncursesw5 libjpeg-dev zlib1g-dev rustc cargo libmpich-dev libopenmpi-dev libgirepository1.0-dev cmake libcairo2-dev pkg-config(ubuntu执行这个)
yum install -y ncurses-compat-libs libjpeg-turbo-devel zlib-devel rustc cargo mpich-devel openmpi-devel gobject-introspection-devel cmake cairo-devel pkg-config(kylin执行这个)
conda install  mpi4py
安装示例模型学习框架 
cd /home
bash installwhl.sh /home/whl310 (时间有点长,耐心等待)
安装示例脚本
bi150产品
tar -zxvf /home/deeplearningsamples-4.1.3.tgz

示例模型-bi100产品

安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples
tar -xvf corex-test-tools-data-mini-2.1.0.tar
准备环境
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples
bash quick_build_environment.sh
Pytorch框架使用演示
单卡
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/resnet
bash init_torch.sh
bash train_resnet50_torch.sh
单机多卡分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/resnet
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh
使用AMP混合精度
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/resnet
bash train_resnet50_amp_torch.sh
使用horovod分布式 (CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/horovod
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh
使用Apex加速
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/apex
bash init_torch.sh
bash train_bert_pretrain_apex_torch.sh
使用DALI加速库
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/dali
bash init_torch.sh
bash train_resnet50_dali_torch.sh

示例模型-bi150产品

安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /home/deeplearningsamples/
cd /home/deeplearningsamples
tar -xvf corex-test-tools-data-mini-2.1.0.tar
准备环境
cd /home/deeplearningsamples
bash quick_build_environment.sh
Pytorch框架使用演示
单卡
cd /home/deeplearningsamples/executables/resnet
bash init_torch.sh
bash train_resnet50_torch.sh
单机多卡分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯)
cd /home/deeplearningsamples/executables/resnet
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh
使用AMP混合精度
cd /home/deeplearningsamples/executables/resnet
bash train_resnet50_amp_torch.sh
使用horovod分布式 (CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯)
cd /home/deeplearningsamples/executables/horovod
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh
使用Apex加速
cd /home/deeplearningsamples/executables/apex
bash init_torch.sh
bash train_bert_pretrain_apex_torch.sh
使用DALI加速库
cd /home/deeplearningsamples/executables/dali
bash init_torch.sh
bash train_resnet50_dali_torch.sh

示例模型测试(docker环境)

安装docker
cd /home
tar -zxvf docker-27.3.0.tgz 
cp docker/* /usr/bin/
cp /home/docker.service /etc/systemd/system/
chmod +x /etc/systemd/system/docker.service
systemctl daemon-reload
systemctl start docker
systemctl enable docker.service
cp /home/daemon.json /etc/docker/
systemctl daemon-reload
systemctl restart docker
导入天数base镜像文件(这个步骤时间比较长,耐心等待)
bi100产品
bash /home/corex-docker-installer-3.2.1-10.2-ubuntu20.04-py3.10-arm64.run --silent --disable-dkms  #ubuntu系统镜像执行这个
bash /home/corex-docker-installer-3.2.1-10.2-centos7.9.2009-py3.10-arm64.run --silent --disable-dkms  #centos系统镜像执行这个
bi150产品
bash /home/corex-docker-installer-4.1.3-10.2-ubuntu20.04-py3.10-aarch64.run --silent --disable-dkms  #ubuntu系统镜像执行这个
bash /home/corex-docker-installer-4.1.3-10.2-centos7.9.2009-py3.10-aarch64.run --silent --disable-dkms  #centos系统镜像执行这个
创建base镜像容器
bi100产品
docker run --shm-size="32g" -it -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home:/home --network=host --name=test --privileged --cap-add=ALL --pid=host corex:3.2.1
bi150产品
docker run --shm-size="32g" -it -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home:/home --network=host --name=test --privileged --cap-add=ALL --pid=host corex:4.1.3
安装示例脚本
bi100产品
bash /home/corex-samples-3.2.1_arm64.run
bi150产品
tar -zxvf /home/deeplearningsamples-4.1.3.tgz

示例模型-bi100产品

安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples
tar -xvf corex-test-tools-data-mini-2.1.0.tar
准备环境
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples
bash quick_build_environment.sh
Pytorch框架使用演示
单卡
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/resnet
bash init_torch.sh
bash train_resnet50_torch.sh
单机多卡分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/resnet
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh
使用AMP混合精度
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/resnet
bash train_resnet50_amp_torch.sh
使用horovod分布式 (CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用四卡)
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/horovod
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh
使用Apex加速
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/apex
bash init_torch.sh
bash train_bert_pretrain_apex_torch.sh
使用DALI加速库
cd /root/corex-samples-3.2.1_arm64/samples/deeplearningsamples/executables/dali
bash init_torch.sh
bash train_resnet50_dali_torch.sh

示例模型-bi150产品

安装深度学习框架示例脚本
cp /home/corex-test-tools-data-mini-2.1.0.tar /home/deeplearningsamples/
cd /home/deeplearningsamples
tar -xvf corex-test-tools-data-mini-2.1.0.tar
准备环境
cd /home/deeplearningsamples
bash quick_build_environment.sh
Pytorch框架使用演示
单卡
cd /home/deeplearningsamples/executables/resnet
bash init_torch.sh
bash train_resnet50_torch.sh
单机多卡分布式(CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯)
cd /home/deeplearningsamples/executables/resnet
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_dist_torch.sh
使用AMP混合精度
cd /home/deeplearningsamples/executables/resnet
bash train_resnet50_amp_torch.sh
使用horovod分布式 (CUDA_VISIBLE_DEVICES=0,1,2,3 表示使用两卡四芯)
cd /home/deeplearningsamples/executables/horovod
CUDA_VISIBLE_DEVICES=${gpus} bash train_resnet50_horovod_torch.sh
使用Apex加速
cd /home/deeplearningsamples/executables/apex
bash init_torch.sh
bash train_bert_pretrain_apex_torch.sh
使用DALI加速库
cd /home/deeplearningsamples/executables/dali
bash init_torch.sh
bash train_resnet50_dali_torch.sh

笔记



  目录
    天数智芯知识库系统 -V 5.2.6 -wcp