====== 查出故障 gpu 显卡具体槽位 ======
===== 在疑似故障 gpu 显卡上跑作业 =====
1. 下载作业脚本 {{ :mstation:512_si_pbe_md.tgz | 512_si_pbe_md.tgz}}
2. 上传到服务器
3. 解压, 编辑 ''run.sh'' 将其中的 export CUDA_VISIBLE_DEVICES=3 改成对应的显卡编号
显卡编号通过 nvidia-smi 查看
[pengge@mstation ok]$ tar -zxf 512_si_pbe_md.tgz
[pengge@mstation ok]$ cd 512_si_pbe_md
[pengge@mstation 512_si_pbe_md]$ vim run.sh
#!/bin/sh
module load mkl mpi
module load cuda/12.1
module load pwmat
export CUDA_VISIBLE_DEVICES=3
mpirun -np 1 PWmat | tee output
4. 执行脚本 ./run.sh 即可, 要终止可以按 ctrl + c
===== 查出故障 gpu 显卡具体槽位 =====
1. 进入系统后输入命令: nvidia-smi
Fri Aug 16 15:47:10 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.76 Driver Version: 550.76 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 D On | 00000000:16:00.0 Off | Off |
| 0% 45C P8 25W / 425W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 D On | 00000000:34:00.0 Off | Off |
| 0% 37C P8 20W / 425W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 D On | 00000000:52:00.0 Off | Off |
| 0% 40C P8 17W / 425W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 D On | 00000000:CA:00.0 Off | Off |
| 30% 44C P2 223W / 425W | 16108MiB / 24564MiB | 99% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 3 N/A N/A 15863 C PWmat 16100MiB |
+-----------------------------------------------------------------------------------------+
2. 以序号 3 为例, 记录 3 号显卡 Bus-Id 00000000:CA:00.0
3. 用 ''root'' 账号登录, 输入命令 dmidecode -t slot
[root@mstation ~]# dmidecode -t slot | grep -i -10 CA:00.0
Handle 0x000D, DMI type 9, 17 bytes
System Slot Information
Designation: CPU SLOT1 PCIe 5.0 X16
Type: x16
Current Usage: In Use
Length: Long
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:ca:00.0
Handle 0x000E, DMI type 9, 17 bytes
System Slot Information
Designation: CPU SLOT3 PCIe 5.0 X16
Type: x16
Current Usage: In Use
Length: Long
Characteristics:
3.3 V is provided
Opening is shared
3 号显卡对应的槽位是 Designation: CPU SLOT1 PCIe 5.0 X16
在服务器主板 PCI插槽旁边有相应的数字表示槽位号, 找到对应的插槽即可
- nvidia-smi 输出的 busid 00000000:CA:00.0
- dmidecode -t slot 输出的 Bus Address: 0000:ca:00.0
{{:mstation:busid.png?700|}}