Nova has over 200+ GPUs available for research purposes with ten varieties of GPUs. A full list of hardware is available at: here.
Requesting a GPU
The syntax for requesting a gpu is --gres=gpu:<TYPE>:<COUNT>where TYPE is one of the available gpu types and COUNT is the number of GPUs to request. For a complete list of available gpus see here under "SLURM Code".
For example, to request two a100 GPUs for an interactive session:
salloc -n 8 -N 1 --mem=16G --gres=gpu:a100:2 --time=2:00:00
Account Limits
Each account has a limited number of available GPUs to use. Because of this GPUs tend to be a contentious resource. To view the limits for your account you can use sacctmgr:sacctmgr show -s account name=<ACCOUNT> format=grptres%40,grptresrunmins%50
Where ACCOUNT is the account you wish to view information for.
The output looks like this:
GrpTRES GrpTRESRunMins
---------------------------------------- --------------------------------------------------
cpu=7200,gres/gpu=11,mem=38000000M cpu=222851268,gres/gpu=275517,mem=989036690043MHere under GrpTRES we see we have 11 GPUs available for this account.
To see how many GPUs are actually in use for an account:
squeue -A <ACCOUNT> -O StateCompact:3,JobID:10,UserName,tres-alloc:70,ReasonList:20,TimeLimit,TimeLeft | grep gres\/gpu
If you happen to run squeue --me and receive a job reason code of "AssocGrpGRES" or "AssocGrpGRESMinutes" then your account is using all it's available GPUs.
nvidia-smi
All the compute nodes with GPUs have the nvidia-smi program installed. An example of the output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:81:00.0 Off | 0 |
| N/A 67C P0 501W / 500W | 72629MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2714847 C ./gpu_burn 72618MiB |
+-----------------------------------------------------------------------------------------+Using nvidia-smiusers are able to track their processes GPU utilization.