6  
查询码: 00000265
天数卡调度slurm 22.05.8
作者: 柳玉明 于 2025年03月24日 ,于 2025年05月14日 编辑


安装Slurm

1. 下载Slurm源码: 

你可以从Slurm的官方网站或者镜像站点下载Slurm 22.05.8的源码。

以下是从官方网站下载的命令示例: 

wget https://download.schedmd.com/slurm/slurm-22.05.8.tar.bz2

2. 编译和安装Slurm

3. export PATH=$PATH:/usr/local/slurm/bin:/usr/local/slurm/sbin

4. 在release sdk镜像环境下,依次运行:

apt-get update

apt-get install -y slurm-wlm

apt-get install -y slurmctld

apt-get install -y systemctl

mkdir -p /etc/slurm-llnl/

启动Slurm服务

systemctl restart slurmctld

启动slurmd服务:

systemctl restart slurmd

在/etc/slurm-llnl/路径下新建slurm.conf和gres.conf

需要定义 Slurm 控制节点的主机名,可以使用以下命令获取主机名:

hostname -s

Slurm 计算节点(slurmd)的配置信息:

slurmd -C

slurm.conf如下:

ClusterName=cool

ControlMachine=f3eb0d61ffe6

#ControlAddr=

#BackupController=

#BackupAddr=

#

MailProg=/usr/bin/s-nail

SlurmUser=root

#SlurmdUser=root

SlurmctldPort=6817


SlurmdPort=6818

AuthType=auth/munge

#JobCredentialPrivateKey=

#JobCredentialPublicCertificate=

StateSaveLocation=/var/spool/slurmctld

SlurmdSpoolDir=/var/spool/slurmd

SwitchType=switch/none

MpiDefault=none

SlurmctldPidFile=/var/run/slurmctld.pid

SlurmdPidFile=/var/run/slurmd.pid

ProctrackType=proctrack/pgid

#PluginDir=

#FirstJobId=

ReturnToService=0

#MaxJobCount=

#PlugStackConfig=

#PropagatePrioProcess=

#PropagateResourceLimits=

#PropagateResourceLimitsExcept=

#Prolog=

#Epilog=

#SrunProlog=

#SrunEpilog=

#TaskProlog=

#TaskEpilog=

#TaskPlugin=

#TrackWCKey=no

#TreeWidth=50

#TmpFS=

#UsePAM=

#

# TIMERS

SlurmctldTimeout=300

SlurmdTimeout=300

InactiveLimit=0

MinJobAge=300

KillWait=30

Waittime=0

#

# SCHEDULING

SchedulerType=sched/backfill

#SchedulerAuth=

#SelectType=select/linear

#PriorityType=priority/multifactor

#PriorityDecayHalfLife=14-0

#PriorityUsageResetPeriod=14-0

#PriorityWeightFairshare=100000

#PriorityWeightAge=1000

#PriorityWeightPartition=10000

#PriorityWeightJobSize=1000

#PriorityMaxAge=1-0

#

# LOGGING


SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log

SlurmdDebug=info

SlurmdLogFile=/var/log/slurm-llnl/slurmd.log

JobCompType=jobcomp/none

#JobCompLoc=

#

# ACCOUNTING

#JobAcctGatherType=jobacct_gather/linux

#JobAcctGatherFrequency=30

#

#AccountingStorageType=accounting_storage/slurmdbd

#AccountingStorageHost=

#AccountingStorageLoc=

#AccountingStoragePass=

#AccountingStorageUser=

#

# COMPUTE NODES

GresTypes=gpu

PartitionName=test_gpu Default=NO MaxTime=INFINITE MaxNodes=16 MinNodes=1 Nodes=f3eb0d61ffe6 State=UP

NodeName=f3eb0d61ffe6 Gres=gpu:2 CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=31968 State=IDLE

#NodeName=master State=UNKNOWN

#NodeName=master Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN

gres.conf 如下:

NodeName=f3eb0d61ffe6 Name=gpu File=/dev/iluvatar0

NodeName=f3eb0d61ffe6 Name=gpu File=/dev/iluvatar1

启动Slurm服务

systemctl start slurmctld

启动slurmd服务:

systemctl restart slurmd

sinfo 显示队列或节点状态

如果遇到以下报错:

sinfo: error: If munged is up, restart with --num-threads=10

sinfo: error: Munge encode failed: Invalid file type for socket "/run/munge/munge.socket.2"

sinfo: error: authentication: Invalid authentication credential

slurm_load_partitions: Protocol authentication error

检查 /run/munge/ 目录下是否存在 Munge 的套接字文件。如果该目录为空或不存在,尝试以下步骤:

ls -l /run/munge/

如果目录不存在,创建它并设置正确的所有权:

mkdir -p /run/munge/

chown munge:munge /run/munge/

检查 Munge 密钥文件: 确保 /etc/munge/munge.key 文件存在并且权限正确(应为 400,只有 munge 用户可读):

ls -l /etc/munge/munge.key

chmod 400 /etc/munge/munge.key

chown munge:munge /etc/munge/munge.key

重启 Munge 服务: 在确认密钥文件和目录权限正确后,重启 Munge 服务:

systemctl restart munge

正常的sinfo输出:

运行

srun --partition=test_gpu --gres=gpu:1 bash -c 'ixsmi -i ${CUDA_VISIBLE_DEVICES}; sleep 1800'

可以看到file=/dev/iluvatar[0-1]方式,slurm支持天数卡的调度。


如果遇到其他问题,检查Slurm的日志文件,/var/log/slurm-llnl/slurmctld.log slurmd.log以获取更详细的错误信息。根据日志信息进行进一步的故障排除。


在此记录一下在测试中遇到的问题

节点state状态为drain,/var/log/slurm-llnl/slurmctld.log报错信息:

多运行几次

systemctl restart slurmctld

systemctl restart slurmd


笔记



 附件

附件类型

PNGPNG

  目录
    天数智芯知识库系统 -V 5.2.6 -wcp