mstation:slurm
差别
这里会显示出您选择的修订版和当前版本之间的差别。
| 后一修订版 | 前一修订版 | ||
| mstation:slurm [2024/01/18 15:19] – 创建 pengge | mstation:slurm [2024/01/19 14:00] (当前版本) – 创建 pengge | ||
|---|---|---|---|
| 行 1: | 行 1: | ||
| - | ====== | + | ====== |
| - | ==== 使用环境 ==== | + | <code bash> |
| - | + | srun -N 1 -n 4 --gres=gpu:4 --nodelist g01 hostname | |
| - | 用户注册, | + | |
| - | + | ||
| - | * 测试 slurm 资源调度 | + | |
| - | + | ||
| - | <code bash: | + | |
| - | srun -p 3080ti --mem=1 --time=1 | + | |
| </ | </ | ||
| - | |||
| - | <code bash: | ||
| - | sbatch -p 3080ti --mem=1 --time=1 --gres=gpu: | ||
| - | </ | ||
| - | |||
| - | > –mem=1 请求 1 MB 内存 | ||
| - | > | ||
| - | > –time=1 作业总的运行时间限制为 1 分钟 | ||
| - | > | ||
| - | > –gres=gpu: | ||
| - | > | ||
| - | > hostname 为要执行的作业命令 | ||
| - | > | ||
| - | > –output=%j.out 为作业的输出 | ||
| - | > | ||
| - | > –error=%j.err 为作业的错误输出 | ||
| - | |||
| - | * 查看存储空间 | ||
| - | |||
| - | <code bash: | ||
| - | [pengg@login ~]$ mmlsquota --block-size auto | ||
| - | Block Limits | ||
| - | Filesystem type | ||
| - | hpc USR 17.95M | ||
| - | </ | ||
| - | |||
| - | 其中: 软限制为: | ||
| - | |||
| - | * 软件环境 | ||
| - | |||
| - | 集群预装了大量软件,通过 module 来管理, 用户也可以在自己家目录下安装其它软件 | ||
| - | |||
| - | <code bash: | ||
| - | [pengge@login ~]$ module ava | ||
| - | |||
| - | ------------------------------------------------------------------------------------------------------------------------- / | ||
| - | atat/ | ||
| - | Auger_Decay_Rate/ | ||
| - | bandup/ | ||
| - | Boltzman-NAMD/ | ||
| - | |||
| - | ---------------------------------------------------------------------------------------------------------------- / | ||
| - | advisor/ | ||
| - | advisor/ | ||
| - | ccl/ | ||
| - | ccl/ | ||
| - | clck/ | ||
| - | clck/ | ||
| - | compiler/ | ||
| - | compiler/ | ||
| - | </ | ||
| - | |||
| - | ==== slurm 使用 ==== | ||
| - | |||
| - | === 资源请求 === | ||
| - | |||
| - | <code bash: | ||
| - | |||
| - | --ntasks-per-node 或 -N # Slurm Node = Physical node | ||
| - | |||
| - | --ntasks-per-socket | ||
| - | |||
| - | -c, --cpus-per-task | ||
| - | OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK: | ||
| - | |||
| - | 资源一: | ||
| - | --nodes=2 --gres=gpu: | ||
| - | |||
| - | 资源二: | ||
| - | #SBATCH --partition=cpu | ||
| - | #SBATCH --nodes=2 | ||
| - | #SBATCH --ntasks=4 | ||
| - | #SBATCH --ntasks-per-node=2 | ||
| - | |||
| - | 资源三: | ||
| - | #SBATCH --partition=cpu | ||
| - | #SBATCH --nodes=2 | ||
| - | #SBATCH --ntasks=4 | ||
| - | #SBATCH --ntasks-per-node=2 | ||
| - | #SBATCH --cpus-per-task=32 | ||
| - | OMP_NUM_THREADS=32 | ||
| - | |||
| - | 资源四: | ||
| - | #!/bin/sh | ||
| - | #SBATCH --partition=3090 | ||
| - | #SBATCH --job-name=pwmat | ||
| - | #SBATCH --nodes=1 | ||
| - | #SBATCH --ntasks-per-node=4 | ||
| - | #SBATCH --gres=gpu: | ||
| - | #SBATCH --gpus-per-task=1 | ||
| - | |||
| - | module load intel/2020 | ||
| - | module load cuda/11.6 | ||
| - | module load pwmat/ | ||
| - | |||
| - | mpirun -np $SLURM_NPROCS -iface ib0 PWmat | tee output | ||
| - | |||
| - | 资源五: | ||
| - | #!/bin/bash | ||
| - | #SBATCH -N 1 | ||
| - | #SBATCH -n 96 | ||
| - | #SBATCH --ntasks-per-node=96 | ||
| - | #SBATCH --partition=9242 | ||
| - | #SBATCH --output=%j.out | ||
| - | #SBATCH --error=%j.err | ||
| - | source / | ||
| - | ulimit -s unlimited | ||
| - | export PATH=/ | ||
| - | </ | ||
| - | |||
| - | <code bash: | ||
| - | #!/bin/bash | ||
| - | |||
| - | #SBATCH --job-name=sim_1 | ||
| - | #SBATCH --output=log.%x.job_%j | ||
| - | #SBATCH --time=1: | ||
| - | #SBATCH --partition=gpXY | ||
| - | #SBATCH --exclusive | ||
| - | #SBATCH --mem=20G | ||
| - | #SBATCH --threads-per-core=1 | ||
| - | #SBATCH --cpus-per-task=4 | ||
| - | |||
| - | ## nodes allocation | ||
| - | #SBATCH --nodes=2 | ||
| - | #SBATCH --ntasks-per-node=2 | ||
| - | |||
| - | ## GPU allocation - variant A | ||
| - | #SBATCH --gres=gpu: | ||
| - | |||
| - | ## GPU allocation - variant B | ||
| - | ## #SBATCH --gpus-per-task=1 | ||
| - | ## #SBATCH --gpu-bind=single: | ||
| - | |||
| - | # start the job in the directory it was submitted from | ||
| - | cd " | ||
| - | |||
| - | # program execution - variant 1 | ||
| - | mpirun ./sim | ||
| - | |||
| - | # program execution - variant 2 | ||
| - | #srun ./sim | ||
| - | </ | ||
| - | |||
| - | === 交互式作业 === | ||
| - | |||
| - | 方法一: | ||
| - | |||
| - | <code bash: | ||
| - | srun --time=00: | ||
| - | echo $SLURM_NODELIST | ||
| - | </ | ||
| - | |||
| - | 方法二: | ||
| - | |||
| - | <code bash: | ||
| - | salloc -p cpu -N 1 -n 6 -t 2:00:00 # salloc | ||
| - | # | ||
| - | ssh cn1 # 直接登录到刚刚申请到的节点 cn1 调试作业 | ||
| - | scancel 12667 # 计算资源使用完后取消作业 | ||
| - | squeue -j 12667 # 查看作业是否还在运行,确保作业已经退出 | ||
| - | |||
| - | 或: | ||
| - | |||
| - | salloc | ||
| - | srun --pty /bin/bash | ||
| - | scancel JOBID | ||
| - | </ | ||
| - | |||
| - | === 批处理作业 === | ||
| - | |||
| - | 编写作业脚本 '' | ||
| - | |||
| - | <code bash: | ||
| - | #!/bin/sh | ||
| - | #SBATCH --partition=3090 | ||
| - | #SBATCH --job-name=pwmat | ||
| - | #SBATCH --nodes=1 | ||
| - | #SBATCH --ntasks-per-node=4 | ||
| - | #SBATCH --gres=gpu: | ||
| - | #SBATCH --gpus-per-task=1 | ||
| - | |||
| - | module load intel/2020 | ||
| - | module load cuda/11.6 | ||
| - | module load pwmat/ | ||
| - | |||
| - | mpirun -np $SLURM_NPROCS -iface ib0 PWmat | tee output | ||
| - | </ | ||
| - | |||
| - | 提交作业 | ||
| - | |||
| - | <code bash: | ||
| - | sbatch pwmat.sh | ||
| - | </ | ||
| - | |||
| - | === 作业监控 === | ||
| - | |||
| - | <code bash: | ||
| - | squeue | ||
| - | scontrol show --detail jobid=< | ||
| - | </ | ||
| - | |||
| - | 查看完成的作业 | ||
| - | |||
| - | <code bash: | ||
| - | sacct | ||
| - | sacct -j < | ||
| - | |||
| - | sacct --starttime=2022-01-1 --endtime=2022-05-1 | ||
| - | sacct --starttime=2022-01-1 --endtime=2022-05-1 --format=User, | ||
| - | sacct --help | ||
| - | </ | ||
| - | |||
| - | === 作业修改 === | ||
| - | |||
| - | * 更改作业属性 | ||
| - | |||
| - | <code shell: | ||
| - | # | ||
| - | scontrol update JobId=$JobID timelimit=< | ||
| - | |||
| - | # | ||
| - | scontrol update JobId=$JobID_1 dependency=afterany: | ||
| - | </ | ||
| - | |||
| - | * 控制作业 | ||
| - | |||
| - | <code shell: | ||
| - | scontrol hold < | ||
| - | scontrol release < | ||
| - | scontrol requeue < | ||
| - | </ | ||
| - | |||
| - | * 取消作业 | ||
| - | |||
| - | <code shell: | ||
| - | scancel < | ||
| - | scancel -u < | ||
| - | scancel -u < | ||
| - | </ | ||
| - | |||
| - | === slurm 环境变量 === | ||
| - | |||
| - | ^变量 | ||
| - | |$SLURM_JOB_ID | ||
| - | |$SLURM_SUBMIT_DIR | ||
| - | |$SLURM_SUBMIT_HOST | ||
| - | |$SLURM_JOB_NODELIST | ||
| - | |$SLURM_GPUS | ||
| - | |$SLURM_MEM_PER_GPU | ||
| - | |$SLURM_MEM_PER_NODE | ||
| - | |$SLURM_NTASKS | ||
| - | |$SLURM_NTASKS_PER_GPU | ||
| - | |$SLURM_NTASKS_PER_NODE |Number of tasks requested per node. | | ||
| - | |$SLURM_NTASKS_PER_CORE |Number of tasks requested per core. | | ||
| - | |$SLURM_NPROCS | ||
| - | |$SLURM_NNODES | ||
| - | |$SLURM_TASKS_PER_NODE | ||
| - | |$SLURM_ARRAY_JOB_ID | ||
| - | |$SLURM_ARRAY_TASK_ID | ||
| - | |$SLURM_ARRAY_TASK_COUNT|Total number of tasks in a job array. | ||
| - | |$SLURM_ARRAY_TASK_MAX | ||
| - | |$SLURM_ARRAY_TASK_MIN | ||
mstation/slurm.1705562397.txt.gz · 最后更改: 2024/01/18 15:19 由 pengge
