00000265
1. 下载Slurm源码:
你可以从Slurm的官方网站或者镜像站点下载Slurm 22.05.8的源码。
以下是从官方网站下载的命令示例:
wget https://download.schedmd.com/slurm/slurm-22.05.8.tar.bz2
2. 编译和安装Slurm:
3. export PATH=$PATH:/usr/local/slurm/bin:/usr/local/slurm/sbin
4. 在release sdk镜像环境下,依次运行:
apt-get update
apt-get install -y slurm-wlm
apt-get install -y slurmctld
apt-get install -y systemctl
mkdir -p /etc/slurm-llnl/
systemctl restart slurmctld
启动slurmd服务:
systemctl restart slurmd
在/etc/slurm-llnl/路径下新建slurm.conf和gres.conf
需要定义 Slurm 控制节点的主机名,可以使用以下命令获取主机名:
hostname -s
Slurm 计算节点(slurmd)的配置信息:
slurmd -C
slurm.conf如下:
ClusterName=cool
ControlMachine=f3eb0d61ffe6
#ControlAddr=
#BackupController=
#BackupAddr=
#
MailProg=/usr/bin/s-nail
SlurmUser=root
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurmctld
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SelectType=select/linear
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
GresTypes=gpu
PartitionName=test_gpu Default=NO MaxTime=INFINITE MaxNodes=16 MinNodes=1 Nodes=f3eb0d61ffe6 State=UP
NodeName=f3eb0d61ffe6 Gres=gpu:2 CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=31968 State=IDLE
#NodeName=master State=UNKNOWN
#NodeName=master Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
gres.conf 如下:
NodeName=f3eb0d61ffe6 Name=gpu File=/dev/iluvatar0
NodeName=f3eb0d61ffe6 Name=gpu File=/dev/iluvatar1
启动Slurm服务
systemctl start slurmctld
启动slurmd服务:
systemctl restart slurmd
sinfo 显示队列或节点状态
如果遇到以下报错:
sinfo: error: If munged is up, restart with --num-threads=10
sinfo: error: Munge encode failed: Invalid file type for socket "/run/munge/munge.socket.2"
sinfo: error: authentication: Invalid authentication credential
slurm_load_partitions: Protocol authentication error
检查 /run/munge/ 目录下是否存在 Munge 的套接字文件。如果该目录为空或不存在,尝试以下步骤:
ls -l /run/munge/
如果目录不存在,创建它并设置正确的所有权:
mkdir -p /run/munge/
chown munge:munge /run/munge/
检查 Munge 密钥文件: 确保 /etc/munge/munge.key 文件存在并且权限正确(应为 400,只有 munge 用户可读):
ls -l /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key
重启 Munge 服务: 在确认密钥文件和目录权限正确后,重启 Munge 服务:
systemctl restart munge
正常的sinfo输出:
运行
srun --partition=test_gpu --gres=gpu:1 bash -c 'ixsmi -i ${CUDA_VISIBLE_DEVICES}; sleep 1800'
可以看到file=/dev/iluvatar[0-1]方式,slurm支持天数卡的调度。
如果遇到其他问题,检查Slurm的日志文件,/var/log/slurm-llnl/,slurmctld.log slurmd.log以获取更详细的错误信息。根据日志信息进行进一步的故障排除。
在此记录一下在测试中遇到的问题
节点state状态为drain,/var/log/slurm-llnl/slurmctld.log报错信息:
多运行几次
systemctl restart slurmctld
systemctl restart slurmd