Resource Utilization and Monitoring¶
Previous Section: Job History
Now that we can submit multinode jobs, we should check to make sure they can utilize the resources efficiently. There is no special way to infer usage other than to test directly.
You can use the jobinfo program with the ID to see which node your job is on and ssh directly there.
Note
The use of --exclusive (or similar) is important here to not get polluted by others' jobs on the node.
Once you've done jobinfo to determine which node your job has landed on, you can ssh directly to the node. This is something you can do only if you have a Slurm job on the node. Once on the node, use a tool like htop to inspect CPU and memory activity (press q to quit).
Note
When using ssh within a cluster's subnet, there's no need to specify the subnet, or the username.
However, this isn't real telemetry. We want to collect and store the data. To do this, use the monitor utility (which is RCAC specific) to gather data on CPU and GPU metrics.
This will run indefinitely, so we need to interrupt it with a keyboard interrupt. Windows and Linux users can press Ctrl + c to stop it. Mac users will use Cmd + ..
Note
Use --help or man monitor to check for usage details. You can also check our user guides for more recommendations.
Now let's do this for an actual Slurm job. Edit your example.sh submission script to look like this:
Numpy Again
If you ran into the no module named numpy error earlier, you will need to activate your conda environment inside your job as well.
Be sure to ask for all the resources on the node so you don't collect data on your neighbor's job! Start each monitoring task before starting your application.
Quiz: Why do we need to use the & on each of these commands in the script?
Answer
If we didn't, the node would be stuck on the monitor command until the walltime ran out.
Now, let's run the new monitored submission file:
Once it's done, let's look at the output of the files:
Quiz: Why is the CPU utilization so low around 0.5%?
Answer
Because currently, our workflow can only use a single CPU at a time, so the rest of the CPUs that we have allocated to our job are idle.
Next Section: Managing Workloads and Jobs