Nova has over 200+ GPUs available for research purposes with ten varieties of GPUs. A full list of hardware is available at: here.
Requesting a GPU
The syntax for requesting a gpu is --gres=gpu:<TYPE>:<COUNT>where TYPE is one of the available gpu types and COUNT is the number of GPUs to request. For a complete list of available gpus see here under "SLURM Code". In general it's in your best interest to specify the most generic request possible as it'll ensure you leave the queue quicker since you'll receive any gpu available. To request a single gpu of any type you can do:
salloc -n 8 -N 1 --mem=16G --gres=gpu:1 --time=2:00:00
To request a specific card, for example two a100s, it would look like this:
salloc -n 8 -N 1 --mem=16G --gres=gpu:a100:2 --time=2:00:00
Account Limits
Each account has a limited number of available GPUs to use. Because of this GPUs tend to be a contentious resource. To view the limits for your account you can use sacctmgr:sacctmgr show -s account name=<ACCOUNT> format=grptres%40,grptresrunmins%50
Where ACCOUNT is the account you wish to view information for.
The output looks like this:
GrpTRES GrpTRESRunMins
---------------------------------------- --------------------------------------------------
cpu=7200,gres/gpu=11,mem=38000000M cpu=222851268,gres/gpu=275517,mem=989036690043MHere under GrpTRES we see we have 11 GPUs available for this account.
To see how many GPUs are actually in use for an account:
squeue -A <ACCOUNT> -O StateCompact:3,JobID:10,UserName,tres-alloc:70,ReasonList:20,TimeLimit,TimeLeft | grep gres\/gpu
If you happen to run squeue --me and receive a job reason code of "AssocGrpGRES" or "AssocGrpGRESMinutes" then your account is using all it's available GPUs.
nvidia-smi
All the compute nodes with GPUs have the nvidia-smi program installed. An example of the output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:81:00.0 Off | 0 |
| N/A 67C P0 501W / 500W | 72629MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2714847 C ./gpu_burn 72618MiB |
+-----------------------------------------------------------------------------------------+Using nvidia-smi users are able to track their processes GPU utilization.