FAQ
Because I'm tired of explaining the same things over and over again
Using the Cluster
How do I get my code on the cluster?
From a workstation with your code on it:
Your code is now on infosphere
Where do I go to run code on the cluster?
From RAS or a workstation:
How do I compile code on the cluster?
If using a single C/C++ file like a pleb:
If using CMake like a boss:
Do I need to compile my code on the cluster?
YES, there is no other option, binaries will never be compatible. This is not a challenge, this is a statement of fact. Do not complain if you
a.out
from the workstation doesn't work.
Now that I did all that, how do I run code on the cluster?
If using an MPI binary (most parallel computing classes), or just running a non-MPI binary
n
number of times:Alternative (non-preferred) method of running an MPI binary:
That gives me some error about not having an account! What gives?
Ask a sysadmin if they can make you an account on the cluster. Point them to this document if they don't know how.
My job just hangs forever!
Either someone is inconsiderately using the entire cluster, or the cluster is broken.
To check the former, use the
sinfo
andsqueue
commands.
Managing the Cluster
Or, "I was put in charge of your mess Jack you better have documented everything." Don't worry, I didn't. Keeps you on your toes.
Note: everything here expects you have root access on infosphere, and are running all these commands after getting root with Kerberos kinit
+ ksu
.
What are the basic troubleshooting steps?
sinfo
to view the state of the clustersqueue
to see what jobs are holding things upChecking files in
/var/log/slurm
on infosphere and the affected nodes
Someone wants me to make an account for them!
Run this in
/root
on infosphere:
How do I run the ansible play?
From infosphere (this is important):
The above command basically runs the default, updating-only play on all the cluster nodes, including infosphere.
I want to upgrade a specific part of the cluster, how do?
Run the corresponding ansible
.yml
playbooks from infosphere
I would like to install a new version of something on the cluster, how do?
The current software names and versions are under
[fullcluster:vars]
in the hosts file in the ansible repository.Change the version number to the latest available, and run the following command replacing
<software_name>
with something likeabinit
oropenmpi
.The plays should automatically uninstall the old version for you when updating.
IMPORTANT NOTE: If you update
pmix
, you also need to reinstallslurm
andopenmpi
.
After a slurm upgrade, everything went down. How fix?
Check that the slurm version is the same everywhere
ansible fullcluster -i hosts -m command -a "bash -c 'systemctl disable slurmd; systemctl enable slurmd; systemctl restart slurmd'"
(disable slurmd on infosphere afterwards)Really, just check the logs yourself, listen to error messages, Google your way to victory.
How to recreate the
slurmdbd
database (last resort)?systemctl stop slurmdbd slurmctld && systemctl restart mariadb && mysql
CREATE DATABASE slurm;
If this command fails, skip to the end of this list
USE slurm;
SHOW TABLES;
For every table in the database,
DROP TABLE <table_name>;
exit
systemctl start slurmdbd slurmctld
sacctmgr add cluster hpc
find /cluster -type d | /root/make_new_users.sh
Last updated