1 min readJan 30, 2020
Hi Russell —
There are a couple possibilities here:
- Try restarting all the nodes, starting with the master node. Ensure that the slurmd and slurmctld services are running on the master node and the appropriate ports are open.
- Ensure that the munge service is running on the master node
- Ensure that the slurmd services are running on the client nodes. Ensure that the munge service is running on the client nodes.
- Test that munge authentication works between each of the nodes and the master node.
- Make sure that the IPs in the slurm.conf are correct, and you can ping the nodes from the master node.
- Using scontrol, try manually resetting the node states (e.g.
scontrol: update NodeName=node10 State=RESUME
)