fqa:mstation:q201
差别
这里会显示出您选择的修订版和当前版本之间的差别。
| 两侧同时换到之前的修订记录前一修订版后一修订版 | 前一修订版 | ||
| fqa:mstation:q201 [2024/03/08 10:04] – pengge | fqa:mstation:q201 [2024/03/22 09:51] (当前版本) – pengge | ||
|---|---|---|---|
| 行 1: | 行 1: | ||
| ~~NOTOC~~ | ~~NOTOC~~ | ||
| - | ===== q201 slurm 提交报错 ===== | + | ===== q201 slurm 提交报错 |
| <WRAP em> | <WRAP em> | ||
| 行 11: | 行 11: | ||
| scontrol update node=mstation state=idle | scontrol update node=mstation state=idle | ||
| </ | </ | ||
| + | |||
| + | <WRAP em> | ||
| + | ==== root cause 原因 ==== | ||
| + | </ | ||
| + | |||
| + | <code bash> | ||
| + | scontrol show node mstation | ||
| + | |||
| + | # 在输出中找 Reason | ||
| + | |||
| + | Reason=Not responding [slurm@2024-03-21T14: | ||
| + | </ | ||
| + | |||
| + | > 节点处于 drain, down 等状态, 我们先通过上面的命令找原因 | ||
| + | |||
| + | 1. '' | ||
| + | |||
| + | <code bash> | ||
| + | systemctl restart slurmd | ||
| + | scontrol update node=mstation state=idle | ||
| + | </ | ||
| + | |||
| + | |||
| + | <WRAP lo> | ||
| + | ==== mstation 提交指定显卡 ==== | ||
| + | </ | ||
| + | |||
| + | <code bash> | ||
| + | # 在提交脚本中加上如下环境变量 | ||
| + | export CUDA_VISIBLE_DEVICES=0, | ||
| + | </ | ||
| + | |||
| + | > <wrap hi> | ||
| + | |||
| + | <code bash> | ||
| + | export CUDA_VISIBLE_DEVICES=0, | ||
| + | </ | ||
| + | |||
fqa/mstation/q201.1709863446.txt.gz · 最后更改: 2024/03/08 10:04 由 pengge
