GPUs

Nova has over 200+ GPUs available for research purposes with ten varieties of GPUs.  A full list of hardware is available at: here.

Requesting a GPU

The syntax for requesting a gpu is --gres=gpu:<TYPE>:<COUNT>where TYPE is one of the available gpu types and COUNT is the number of GPUs to request.  For a complete list of available gpus see here under "SLURM Code".  

For example, to request two a100 GPUs for an interactive session:

salloc -n 8 -N 1 --mem=16G --gres=gpu:a100:2 --time=2:00:00

Account Limits

Each account has a limited number of available GPUs to use.  Because of this GPUs tend to be a contentious resource. To view the limits for your account you can use sacctmgr:
sacctmgr show -s account name=<ACCOUNT> format=grptres%40,grptresrunmins%50

Where ACCOUNT is the account you wish to view information for.  

The output looks like this:

                                 GrpTRES                                     GrpTRESRunMins
---------------------------------------- --------------------------------------------------
      cpu=7200,gres/gpu=11,mem=38000000M    cpu=222851268,gres/gpu=275517,mem=989036690043M

Here under GrpTRES we see we have 11 GPUs available for this account.  

To see how many GPUs are actually in use for an account:

squeue -A <ACCOUNT> -O StateCompact:3,JobID:10,UserName,tres-alloc:70,ReasonList:20,TimeLimit,TimeLeft | grep gres\/gpu

If you happen to run squeue --me and receive a job reason code of "AssocGrpGRES" or "AssocGrpGRESMinutes" then your account is using all it's available GPUs.  

nvidia-smi

All the compute nodes with GPUs have the nvidia-smi program installed.  An example of the output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:81:00.0 Off |                    0 |
| N/A   67C    P0            501W /  500W |   72629MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         2714847      C   ./gpu_burn                            72618MiB |
+-----------------------------------------------------------------------------------------+

Using nvidia-smiusers are able to track their processes GPU utilization.