GPUs

Nova has over 200+ GPUs available for research purposes with ten varieties of GPUs.  A full list of hardware is available at: here.

Requesting a GPU

The syntax for requesting a gpu is --gres=gpu:<TYPE>:<COUNT>where TYPE is one of the available gpu types and COUNT is the number of GPUs to request.  For a complete list of available gpus see here under "SLURM Code".  In general it's in your best interest to specify the most generic request possible as it'll ensure you leave the queue quicker since you'll receive any gpu available.  To request a single gpu of any type you can do:

salloc -n 8 -N 1 --mem=16G --gres=gpu:1 --time=2:00:00

To request a specific card, for example two a100s, it would look like this:

salloc -n 8 -N 1 --mem=16G --gres=gpu:a100:2 --time=2:00:00

Account Limits

Each account has a limited number of available GPUs to use.  Because of this GPUs tend to be a contentious resource. To view the limits for your account you can use sacctmgr:
sacctmgr show -s account name=<ACCOUNT> format=grptres%40,grptresrunmins%50

Where ACCOUNT is the account you wish to view information for.  

The output looks like this:

                                 GrpTRES                                     GrpTRESRunMins
---------------------------------------- --------------------------------------------------
      cpu=7200,gres/gpu=11,mem=38000000M    cpu=222851268,gres/gpu=275517,mem=989036690043M

Here under GrpTRES we see we have 11 GPUs available for this account.  

To see how many GPUs are actually in use for an account:

squeue -A <ACCOUNT> -O StateCompact:3,JobID:10,UserName,tres-alloc:70,ReasonList:20,TimeLimit,TimeLeft | grep gres\/gpu

If you happen to run squeue --me and receive a job reason code of "AssocGrpGRES" or "AssocGrpGRESMinutes" then your account is using all it's available GPUs.  

nvidia-smi

All the compute nodes with GPUs have the nvidia-smi program installed.  An example of the output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:81:00.0 Off |                    0 |
| N/A   67C    P0            501W /  500W |   72629MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         2714847      C   ./gpu_burn                            72618MiB |
+-----------------------------------------------------------------------------------------+

Using nvidia-smi users are able to track their processes GPU utilization.