Garrett Mills
1 min readJan 30, 2020

--

Hi Russell —

There are a couple possibilities here:

  1. Try restarting all the nodes, starting with the master node. Ensure that the slurmd and slurmctld services are running on the master node and the appropriate ports are open.
  2. Ensure that the munge service is running on the master node
  3. Ensure that the slurmd services are running on the client nodes. Ensure that the munge service is running on the client nodes.
  4. Test that munge authentication works between each of the nodes and the master node.
  5. Make sure that the IPs in the slurm.conf are correct, and you can ping the nodes from the master node.
  6. Using scontrol, try manually resetting the node states (e.g. scontrol: update NodeName=node10 State=RESUME)

--

--

Garrett Mills
Garrett Mills

Written by Garrett Mills

Hi, there. I’m a software developer and speaker who likes to make things: https://garrettmills.dev/

No responses yet