差别

这里会显示出您选择的修订版和当前版本之间的差别。

--- mstation:slurm [2024/01/18 15:19] – 创建 pengge
+++ mstation:slurm [2024/01/19 14:00] (当前版本) – 创建 pengge
@@ 行 1: / 行 1: @@
-====== slurm 使用 ======
+====== 测试 gpu 调度 ======
-==== 使用环境 ====
+<code bash>
+srun -N 1 -n 4 --gres=gpu:4 --nodelist g01 hostname
-用户注册, 登录 mcloud 后:
-  * 测试 slurm 资源调度
-<code bash:no-line-numbers>
-srun -p 3080ti --mem=1 --time=1 --gres=gpu:1 hostname
 </code>
-<code bash:no-line-numbers>
-sbatch -p 3080ti --mem=1 --time=1 --gres=gpu:1 --output=%j.out --error=%j.err --wrap="hostname"
-</code>
-> –mem=1 请求 1 MB 内存
->
-> –time=1 作业总的运行时间限制为 1 分钟
->
-> –gres=gpu:1 请求 1 块 gpu 卡
->
-> hostname 为要执行的作业命令
->
-> –output=%j.out 为作业的输出
->
-> –error=%j.err 为作业的错误输出
-  * 查看存储空间
-<code bash:no-line-numbers>
-[pengg@login ~]$ mmlsquota --block-size auto
-                         Block Limits                                    |     File Limits
-Filesystem type         blocks      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace  Remarks
-hpc        USR          17.95M       500G       800G          0     none |       45       0        0        0     none
-</code>
-其中: 软限制为: 500G, 硬限制为 800G
-  * 软件环境
-集群预装了大量软件，通过 module 来管理, 用户也可以在自己家目录下安装其它软件
-<code bash:no-line-numbers>
-[pengge@login ~]$ module ava
-------------------------------------------------------------------------------------------------------------------------- /share/app/modulefiles -------------------------------------------------------------------------------------------------------------------------
-atat/3.36                   calypso_pwmat_interface/2.0 cuda/11.3                   emc/1.0                     intel/2016                  nvhpc-byo-compiler/22.2     pwmat/2022.01.30            pwmat/2022.03.29            python/3.8.3
-Auger_Decay_Rate/latest     Cross_Section/latest        cuda/11.6                   gui/2022.03.03              intel/2020                  nvhpc-nompi/22.2            pwmat/2022.02.28            pwmat/2022.04.22
-bandup/latest               cuda/10.1                   disorder/latest             gui/2022.04.08              lammps/29Sep2021            openmpi/2.1.0               pwmat/2022.03.02            pypwmat/1.0.9               yambo/4.3.0
-Boltzman-NAMD/latest        cuda/11.0                   ELPWmat/1.0.0               gui/test                    nvhpc/22.2                  plot_interp_2nd/latest      pwmat/2022.03.25            python/2.7.15
----------------------------------------------------------------------------------------------------------------- /share/app/intel/oneapi/2022/modulefiles ----------------------------------------------------------------------------------------------------------------
-advisor/2022.0.0             compiler32/2022.0.2          debugger/2021.5.0            dnnl-cpu-iomp/2022.0.2       icc/2022.0.2                 intel_ippcp_ia32/2021.5.1    itac/2021.5.0                oclfpga/2022.0.2             vtune/2022.0.0
-advisor/latest               compiler32/latest            debugger/latest              dnnl-cpu-iomp/latest         icc/latest                   intel_ippcp_ia32/latest      itac/latest                  oclfpga/latest               vtune/latest
-ccl/2021.5.1                 compiler-rt/2022.0.2         dev-utilities/2021.5.2       dnnl-cpu-tbb/2022.0.2        icc32/2022.0.2               intel_ippcp_intel64/2021.5.1 mkl/2022.0.2                 tbb/2021.5.1
-ccl/latest                   compiler-rt/latest           dev-utilities/latest         dnnl-cpu-tbb/latest          icc32/latest                 intel_ippcp_intel64/latest   mkl/latest                   tbb/latest
-clck/2021.5.0                compiler-rt32/2022.0.2       dnnl/2022.0.2                dpct/2022.0.0                init_opencl/2022.0.2         intel_ipp_ia32/2021.5.2      mkl32/2022.0.2               tbb32/2021.5.1
-clck/latest                  compiler-rt32/latest         dnnl/latest                  dpct/latest                  init_opencl/latest           intel_ipp_ia32/latest        mkl32/latest                 tbb32/latest
-compiler/2022.0.2            dal/2021.5.3                 dnnl-cpu-gomp/2022.0.2       dpl/2021.6.0                 inspector/2022.0.0           intel_ipp_intel64/2021.5.2   mpi/2021.5.1                 vpl/2022.0.0
-compiler/latest              dal/latest                   dnnl-cpu-gomp/latest         dpl/latest                   inspector/latest             intel_ipp_intel64/latest     mpi/latest                   vpl/latest
-</code>
-==== slurm 使用 ====
-=== 资源请求 ===
-<code bash:no-line-numbers>
---ntasks-per-node 或 -N  # Slurm Node = Physical node
---ntasks-per-socket      # Slurm Socket = Physical Socket/CPU/Processor
--c, --cpus-per-task      # Slurm CPU = Physical CORE
-OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} # Default to 1 if SLURM_CPUS_PER_TASK not set
-资源一:
---nodes=2 --gres=gpu:2   # 请求 2 个节点, 每个节点 2块 gpu 卡, 总请求 4 块 gpu 卡
-资源二:
-#SBATCH --partition=cpu      # 使用 cpu 队列
-#SBATCH --nodes=2            # 需要 2 个节点
-#SBATCH --ntasks=4           # 共需要 4 个 cpu 进程, 缺省每个进程 1个核, 共需要 4个 cpu 核
-#SBATCH --ntasks-per-node=2  # 每个节点 2 个 cpu 进程
-资源三:
-#SBATCH --partition=cpu      # 使用 cpu 队列
-#SBATCH --nodes=2            # 需要 2 个节点
-#SBATCH --ntasks=4           # 共需要 4 个 cpu 进程, 每个进程 32 个核, 共需要 128 个 cpu 核
-#SBATCH --ntasks-per-node=2  # 每个节点 2 个 cpu 进程
-#SBATCH --cpus-per-task=32   # 每个 cpu 进程需要 32 个核
-OMP_NUM_THREADS=32           # openmp 配置
-资源四:
-#!/bin/sh
-#SBATCH --partition=3090     # 使用 3090 队列
-#SBATCH --job-name=pwmat
-#SBATCH --nodes=1            # 使用 1 个节点
-#SBATCH --ntasks-per-node=4  # 每个节点 4 个 cpu 核 (缺省每个进程 1个核)
-#SBATCH --gres=gpu:4         # 每个节点使用 4 块 gpu 卡
-#SBATCH --gpus-per-task=1    # 每个 cpu 核 使用 1 块 gpu 卡
-module load intel/2020
-module load cuda/11.6
-module load pwmat/2022.01.30
-mpirun -np $SLURM_NPROCS -iface ib0 PWmat | tee output
-资源五:
-#!/bin/bash
-#SBATCH -N 1
-#SBATCH -n  96
-#SBATCH --ntasks-per-node=96
-#SBATCH --partition=9242
-#SBATCH --output=%j.out
-#SBATCH --error=%j.err
-source /data/app/intel/bin/compilervars.sh intel64
-ulimit -s unlimited
-export PATH=/data/app/vasp.5.4.4/bin:$PATH
-</code>
-<code bash:no-line-numbers>
-#!/bin/bash
-#SBATCH --job-name=sim_1        # job name (default is the name of this file)
-#SBATCH --output=log.%x.job_%j  # file name for stdout/stderr (%x will be replaced with the job name, %j with the jobid)
-#SBATCH --time=1:00:00          # maximum wall time allocated for the job (D-H:MM:SS)
-#SBATCH --partition=gpXY        # put the job into the gpu partition
-#SBATCH --exclusive             # request exclusive allocation of resources
-#SBATCH --mem=20G               # RAM per node
-#SBATCH --threads-per-core=1    # do not use hyperthreads (i.e. CPUs = physical cores below)
-#SBATCH --cpus-per-task=4       # number of CPUs per process
-## nodes allocation
-#SBATCH --nodes=2               # number of nodes
-#SBATCH --ntasks-per-node=2     # MPI processes per node
-## GPU allocation - variant A
-#SBATCH --gres=gpu:2            # number of GPUs per node (gres=gpu:N)
-## GPU allocation - variant B
-## #SBATCH --gpus-per-task=1       # number of GPUs per process
-## #SBATCH --gpu-bind=single:1     # bind each process to its own GPU (single:<tasks_per_gpu>)
-# start the job in the directory it was submitted from
-cd "$SLURM_SUBMIT_DIR"
-# program execution - variant 1
-mpirun ./sim
-# program execution - variant 2
-#srun ./sim
-</code>
-=== 交互式作业 ===
-方法一:
-<code bash:no-line-numbers>
-srun  --time=00:10:00 --mem=200 --gres=gpu:1 --pty /bin/bash
-echo $SLURM_NODELIST
-</code>
-方法二:
-<code bash:no-line-numbers>
-salloc -p cpu -N 1 -n 6 -t 2:00:00 # salloc
-#申请成功后会返回申请到的节点和作业ID等信息，假设申请到的是 cn1 节点，作业ID为 12667
-ssh cn1           # 直接登录到刚刚申请到的节点 cn1 调试作业
-scancel 12667     # 计算资源使用完后取消作业
-squeue -j 12667   # 查看作业是否还在运行，确保作业已经退出
-或:
-salloc  --time=01:00:00 --mem=500 --gres=gpu:2
-srun --pty /bin/bash
-scancel JOBID
-</code>
-=== 批处理作业 ===
-编写作业脚本 ''%%pwmat.sh%%''
-<code bash:no-line-numbers>
-#!/bin/sh
-#SBATCH --partition=3090
-#SBATCH --job-name=pwmat
-#SBATCH --nodes=1
-#SBATCH --ntasks-per-node=4
-#SBATCH --gres=gpu:4
-#SBATCH --gpus-per-task=1
-module load intel/2020
-module load cuda/11.6
-module load pwmat/2022.01.30
-mpirun -np $SLURM_NPROCS -iface ib0 PWmat | tee output
-</code>
-提交作业
-<code bash:no-line-numbers>
-sbatch pwmat.sh
-</code>
-=== 作业监控 ===
-<code bash:no-line-numbers>
-squeue
-scontrol show --detail jobid=<JobID>
-</code>
-查看完成的作业
-<code bash:no-line-numbers>
-sacct
-sacct -j <JobID> --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize
-sacct  --starttime=2022-01-1 --endtime=2022-05-1
-sacct  --starttime=2022-01-1 --endtime=2022-05-1 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize
-sacct --help
-</code>
-=== 作业修改 ===
-  * 更改作业属性
-<code shell:no-line-numbers>
-#增加作业时间限制
-scontrol update JobId=$JobID timelimit=<new timelimit>
-#更改作业依赖
-scontrol update JobId=$JobID_1 dependency=afterany:$JobID_2
-</code>
-  * 控制作业
-<code shell:no-line-numbers>
-scontrol hold <job_id>     # 防止挂起作业, 获得调度机会, 开始执行
-scontrol release <job_id>  # 把原来 'hold' 状态的排队作业, 释放出来
-scontrol requeue <job_id>  # 取消作业, 重新排队
-</code>
-  * 取消作业
-<code shell:no-line-numbers>
-scancel <job_id>   # 取消运行或挂起的作业
-scancel -u <user>  # 取消用户所有的作业, 包括运行的作业
-scancel -u <user> --state=PENDING # 取消用户所有挂起的作业
-</code>
-=== slurm 环境变量 ===
-^变量                     ^描述                                                     ^
-|$SLURM_JOB_ID          |此作业 JobID                                              |
-|$SLURM_SUBMIT_DIR      |作业提交目录的路径                                              |
-|$SLURM_SUBMIT_HOST     |作业提交节点的主机名                                             |
-|$SLURM_JOB_NODELIST    |分配给作业的节点列表                                             |
-|$SLURM_GPUS            |分配 GPUs 数量                                             |
-|$SLURM_MEM_PER_GPU     |每个 GPU 内存                                              |
-|$SLURM_MEM_PER_NODE    |每个节点的内存 Same as –mem                                   |
-|$SLURM_NTASKS          |Same as –ntasks. The number of tasks.                  |
-|$SLURM_NTASKS_PER_GPU  |Number of tasks requested per GPU.                     |
-|$SLURM_NTASKS_PER_NODE |Number of tasks requested per node.                    |
-|$SLURM_NTASKS_PER_CORE |Number of tasks requested per core.                    |
-|$SLURM_NPROCS          |Same as –ntasks. See $SLURM_NTASKS.                    |
-|$SLURM_NNODES          |Total number of nodes in the job’s resource allocation.|
-|$SLURM_TASKS_PER_NODE  |Number of tasks to be initiated on each node.          |
-|$SLURM_ARRAY_JOB_ID    |Job array’s master job ID number.                      |
-|$SLURM_ARRAY_TASK_ID   |Job array ID (index) number.                           |
-|$SLURM_ARRAY_TASK_COUNT|Total number of tasks in a job array.                  |
-|$SLURM_ARRAY_TASK_MAX  |Job array’s maximum ID (index) number.                 |
-|$SLURM_ARRAY_TASK_MIN  |Job array’s minimum ID (index) number.                 |