Because I'm tired of explaining the same things over and over again
How do I get my code on the cluster?
From a workstation with your code on it:
scp -rp folder_with_your_code/ infosphere:~
Your code is now on infosphere
How do I compile code on the cluster?
If using a single C/C++ file like a pleb:
mpicc your_file.c # or mpicxx if using C++
If using CMake like a boss:
cmake . -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxxmake
Do I need to compile my code on the cluster?
YES, there is no other option, binaries will never be compatible. This is not a challenge, this is a statement of fact. Do not complain if you
a.out from the workstation doesn't work.
Now that I did all that, how do I run code on the cluster?
If using an MPI binary (most parallel computing classes), or just running a non-MPI binary
n number of times:
srun -N <number_of_machines> -n <number_of_processes> --mpi=pmix_v2 ./your_binary --your-binary-flags
Alternative (non-preferred) method of running an MPI binary:
salloc -N <number_of_machines> -n <number_of_processes> mpiexec ./your_binary --your-binary-flags
That gives me some error about not having an account! What gives?
Ask a sysadmin if they can make you an account on the cluster. Point them to this document if they don't know how.
My job just hangs forever!
Either someone is inconsiderately using the entire cluster, or the cluster is broken.
To check the former, use the
Or, "I was put in charge of your mess Jack you better have documented everything." Don't worry, I didn't. Keeps you on your toes.
Note: everything here expects you have root access on infosphere, and are running all these commands after getting root with Kerberos
What are the basic troubleshooting steps?
sinfo to view the state of the cluster
squeue to see what jobs are holding things up
Checking files in
/var/log/slurm on infosphere and the affected nodes
Someone wants me to make an account for them!
Run this in
/root on infosphere:
cat your_user_list.txt | ./make_new_users.sh
How do I run the ansible play?
From infosphere (this is important):
cd ansibleansible-playbook -i hosts -f 50 --ask-vault-pass cluster.yml --skip-tags=install
The above command basically runs the default, updating-only play on all the cluster nodes, including infosphere.
I want to upgrade a specific part of the cluster, how do?
Run the corresponding ansible
.yml playbooks from infosphere
I would like to install a new version of something on the cluster, how do?
The current software names and versions are under
[fullcluster:vars] in the hosts file in the ansible repository.
Change the version number to the latest available, and run the following command replacing
<software_name> with something like
ansible-playbook -i hosts -f 50 cluster.yml --tags=install_<software_name>
The plays should automatically uninstall the old version for you when updating.
IMPORTANT NOTE: If you update
pmix, you also need to reinstall
After a slurm upgrade, everything went down. How fix?
Check that the slurm version is the same everywhere
ansible fullcluster -i hosts -m command -a "bash -c 'systemctl disable slurmd; systemctl enable slurmd; systemctl restart slurmd'" (disable slurmd on infosphere afterwards)
Really, just check the logs yourself, listen to error messages, Google your way to victory.
How to recreate the
slurmdbd database (last resort)?
systemctl stop slurmdbd slurmctld && systemctl restart mariadb && mysql
CREATE DATABASE slurm;
If this command fails, skip to the end of this list
For every table in the database,
DROP TABLE <table_name>;
systemctl start slurmdbd slurmctld
sacctmgr add cluster hpc
find /cluster -type d | /root/make_new_users.sh