Resource Utilization and Monitoring¶
Now that we can check our job history, let's learn how we can check that we are utilizing our resources (CPUs, GPUs, memory) efficiently. There is no special way to infer usage other than to test directly.
Checking Usage Directly¶
You can use the jobinfo program with the ID to see which node your job is on and ssh directly there. Let's run our job again
Note
The use of --exclusive (or similar) is important here to not get polluted by others' jobs on the node. --exclusive is just a restraint that ensures that we are the only person running a job on the node that gets allocated.
Once you've done jobinfo to determine which node your job has landed on, you can ssh directly to the node. This is something you can do only if you have a Slurm job on the node. Once on the node, use a tool like htop to inspect CPU and memory activity (press q to quit).
Note
When using ssh within a cluster's subnet, there's no need to specify the subnet, or the username. Further, we don't allow users to ssh onto a node unless they have job running there.
Monitor¶
However, just sshing onto the node isn't real telemetry. We want to collect and store the data. To do this, we can use the monitor utility (which is RCAC specific) to gather data on CPU and GPU metrics.
This will run indefinitely, so we need to interrupt it with a keyboard interrupt. Windows and Linux users can press Ctrl+c to stop it. Mac users will use Cmd + ..
Note
Use --help or man monitor to check for usage details. You can also check our user guides for more recommendations.
Now let's do this for an actual Slurm job. Lets edit our example.sh submission script to look like this:
Be sure to ask for all the (with --exclusive) on the node so you don't collect data on your neighbor's job! Start each monitoring task before starting your application.
Why do we need to use the & on each of these commands in the script?
The & puts the process into the background! If we didn't, the node would be stuck on the monitor command until the walltime ran out. Check out Managing Processes from Week 3 if you need a refresher.
Now, let's run the new monitored submission file:
Once it's done, let's look at the output of the files:
Why is the CPU utilization so low around 0.5%?
Because currently, our workflow can only use a single CPU at a time, so the rest of the CPUs that we have allocated to our job are idle.
Next Section: Managing Workloads and Jobs