How Jobs are Scheduled
What is a job?
In the context of high-performance computing (HPC) and Slurm (Simple Linux Utility for Resource Management), a job is a unit of work submitted to the computing cluster. It consists of tasks that require computational resources, such as CPU time, memory, and storage. A job can be as simple as running a single executable or as complex as a set of interdependent tasks.
What is a scheduler?
A scheduler is a system software component that manages the allocation of computational resources to various jobs in an HPC environment. It determines when and where jobs are executed based on resource availability, job priority, and scheduling policies. Slurm acts as both a job scheduler and a resource manager.
How do they work together?
Slurm manages jobs by queuing them and allocating the required resources based on the cluster's scheduling policies and the job's resource requests. When a user submits a job, Slurm places it in a queue. The scheduler then decides the order and the nodes on which the jobs will run. This decision is influenced by factors like job priority, resource availability, and scheduling policies (e.g., FIFO, backfilling, fair share).
Serial versus Parallel Jobs
What is a serial job?
A serial job is a type of job that runs on a single processor core and executes a single task at a time. It does not utilize multiple cores or processors simultaneously. Serial jobs are often simpler to program and are suitable for tasks that do not require extensive computational power.
What is a parallel job?
A parallel job, on the other hand, is designed to run across multiple processor cores or nodes simultaneously. It divides the task into smaller sub-tasks that can be processed concurrently, thus significantly reducing the overall execution time for large-scale computations. Parallel jobs require careful design and programming to ensure efficient communication and synchronization between processes.
Serial and Parallel jobs – Pros and Cons
Serial Jobs:
- Pros:
- Simplicity in programming and debugging.
- No need for complex synchronization mechanisms.
- Lower resource requirements.
- Cons:
- Limited by the performance of a single core.
- Inefficient for large-scale computations or data-intensive tasks.
Parallel Jobs:
- Pros:
- Can handle large-scale computations efficiently.
- Reduced execution time by leveraging multiple cores or nodes.
- Suitable for data-intensive tasks and simulations.
- Cons:
- More complex to program and debug.
- Requires careful management of inter-process communication and synchronization.
- Higher resource requirements and potential for resource contention.
CPUs versus GPUs: What is the Difference?
What is a CPU?
A CPU (Central Processing Unit) is the primary component of a computer that performs most of the processing inside a computer. It is designed to handle a wide range of tasks and is optimized for single-threaded performance and low-latency execution of instructions.
Background on CPU:
CPUs have a few cores (typically 2 to 64) that are optimized for sequential task execution and rapid context switching. They have sophisticated control units capable of executing complex instruction sets and are designed to handle a variety of workloads, from simple calculations to running operating systems and applications.
What is a GPU?
A GPU (Graphics Processing Unit) is a specialized processor designed to accelerate graphics rendering and parallel processing tasks. It contains thousands of smaller cores designed for handling multiple tasks simultaneously, making it highly efficient for parallel processing.
Background on GPU:
GPUs were originally designed to handle the intense computational demands of graphics rendering, such as those required in video games and simulations. Over time, their architecture, which consists of thousands of lightweight cores, has proven to be highly effective for parallel computation tasks beyond graphics, such as scientific simulations, machine learning, and data analysis. This makes GPUs particularly well-suited for tasks that can be broken down into many parallel operations.
Queues on Nova & Pronto
What is a queue?
In the context of high-performance computing (HPC) and job scheduling, a queue is a waiting line where jobs are held until the necessary resources (e.g., CPU, memory, nodes) become available. Queues manage job submissions by holding jobs in order until they can be executed based on priority, resource availability, and scheduling policies.
Types of queues
Different types of queues can exist in an HPC environment, often tailored to specific types of jobs or users. Common types include:
- Default queue: The standard queue for general-purpose jobs.
- Priority queue: For jobs that need to be executed with higher priority, often for urgent tasks.
- Short queue: For jobs that require a short amount of computation time.
- Long queue: For jobs that require a longer amount of computation time.
- GPU queue: For jobs that require GPU resources.
- Test or development queue: For testing and development purposes, usually with a short runtime and limited resources.
A Typical Job
The Job Script
What is a job script?
A job script is a text file that contains a series of commands and resource specifications used to submit a job to the scheduler. It provides the necessary instructions for the scheduler to allocate resources and execute the job on the cluster.
What does a job script consist of?
A job script typically includes:
- Shebang line: Specifies the script interpreter (e.g.,
#!/bin/bash
). - Slurm directives: Lines beginning with
#SBATCH
that specify job parameters such as job name, number of nodes, CPUs, memory, and walltime. - Environment setup: Commands to load necessary modules and set environment variables.
- Job commands: The actual commands to run the job, such as executing a program or script.
Example of a simple job script:
#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --output=myjob.out
#SBATCH --error=myjob.err
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --mem=4GB
module load mysoftware
srun myprogram arg1 arg2
Running the Job
What do you need to run a job script?
To run a job script, you need:
- Access to the HPC cluster.
- A properly formatted job script.
- Appropriate permissions to submit jobs to the cluster.
- Use the
sbatch
command to submit the job script to Slurm.
Example command to submit a job script:
sbatch myjobscript.sh
Checking the Status
When to check the status of your job?
You should check the status of your job periodically after submission to monitor its progress and ensure it is running as expected.
How to check the status?
You can use the squeue
command to check the status of your jobs.
Example command to check job status:
squeue -u yourusername
Checking the Output
When to check the output of your job?
You should check the output of your job after it has completed or if it has been running for a significant amount of time to ensure it is producing the expected results.
What to look for:
- Output files: Check the specified output files for results (
myjob.out
). - Error files: Look at the error files for any issues or messages (
myjob.err
). - Log files: Review any log files generated by your job for additional information.