目录

查出故障 gpu 显卡具体槽位

在疑似故障 gpu 显卡上跑作业

1. 下载作业脚本 512_si_pbe_md.tgz

2. 上传到服务器

3. 解压, 编辑 run.sh 将其中的 export CUDA_VISIBLE_DEVICES=3 改成对应的显卡编号

显卡编号通过 nvidia-smi 查看

[pengge@mstation ok]$ tar -zxf 512_si_pbe_md.tgz
[pengge@mstation ok]$ cd 512_si_pbe_md
[pengge@mstation 512_si_pbe_md]$ vim run.sh
#!/bin/sh
 
module load mkl mpi
module load cuda/12.1
module load pwmat
 
export CUDA_VISIBLE_DEVICES=3
 
mpirun -np 1 PWmat | tee output

4. 执行脚本 ./run.sh 即可, 要终止可以按 ctrl + c

查出故障 gpu 显卡具体槽位

1. 进入系统后输入命令: nvidia-smi

Fri Aug 16 15:47:10 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.76                 Driver Version: 550.76         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:16:00.0 Off |                  Off |
|  0%   45C    P8             25W /  425W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      On  |   00000000:34:00.0 Off |                  Off |
|  0%   37C    P8             20W /  425W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090 D      On  |   00000000:52:00.0 Off |                  Off |
|  0%   40C    P8             17W /  425W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090 D      On  |   00000000:CA:00.0 Off |                  Off |
| 30%   44C    P2            223W /  425W |   16108MiB /  24564MiB |     99%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
 
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    3   N/A  N/A     15863      C   PWmat                                       16100MiB |
+-----------------------------------------------------------------------------------------+

2. 以序号 3 为例, 记录 3 号显卡 Bus-Id 00000000:CA:00.0

3. 用 root 账号登录, 输入命令 dmidecode -t slot

[root@mstation ~]# dmidecode -t slot | grep -i -10 CA:00.0
Handle 0x000D, DMI type 9, 17 bytes
System Slot Information
	Designation: CPU SLOT1 PCIe 5.0 X16
	Type: x16 <OUT OF SPEC>
	Current Usage: In Use
	Length: Long
	Characteristics:
		3.3 V is provided
		Opening is shared
		PME signal is supported
	Bus Address: 0000:ca:00.0
 
Handle 0x000E, DMI type 9, 17 bytes
System Slot Information
	Designation: CPU SLOT3 PCIe 5.0 X16
	Type: x16 <OUT OF SPEC>
	Current Usage: In Use
	Length: Long
	Characteristics:
		3.3 V is provided
		Opening is shared

3 号显卡对应的槽位是 Designation: CPU SLOT1 PCIe 5.0 X16

在服务器主板 PCI插槽旁边有相应的数字表示槽位号, 找到对应的插槽即可

  1. nvidia-smi 输出的 busid 00000000:CA:00.0
  2. dmidecode -t slot 输出的 Bus Address: 0000:ca:00.0