FAQ
Because I'm tired of explaining the same things over and over again
Using the Cluster
How do I get my code on the cluster?
From a workstation with your code on it:
scp -rp folder_with_your_code/ infosphere:~Your code is now on infosphere
Where do I go to run code on the cluster?
From RAS or a workstation:
ssh infosphere
How do I compile code on the cluster?
If using a single C/C++ file like a pleb:
mpicc your_file.c # or mpicxx if using C++If using CMake like a boss:
cmake . -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx make
Do I need to compile my code on the cluster?
YES, there is no other option, binaries will never be compatible. This is not a challenge, this is a statement of fact. Do not complain if you
a.outfrom the workstation doesn't work.
Now that I did all that, how do I run code on the cluster?
If using an MPI binary (most parallel computing classes), or just running a non-MPI binary
nnumber of times:srun -N <number_of_machines> -n <number_of_processes> --mpi=pmix_v2 ./your_binary --your-binary-flagsAlternative (non-preferred) method of running an MPI binary:
salloc -N <number_of_machines> -n <number_of_processes> mpiexec ./your_binary --your-binary-flags
That gives me some error about not having an account! What gives?
Ask a sysadmin if they can make you an account on the cluster. Point them to this document if they don't know how.
My job just hangs forever!
Either someone is inconsiderately using the entire cluster, or the cluster is broken.
To check the former, use the
sinfoandsqueuecommands.
Managing the Cluster
Or, "I was put in charge of your mess Jack you better have documented everything." Don't worry, I didn't. Keeps you on your toes.
Note: everything here expects you have root access on infosphere, and are running all these commands after getting root with Kerberos kinit + ksu.
What are the basic troubleshooting steps?
sinfoto view the state of the clustersqueueto see what jobs are holding things upChecking files in
/var/log/slurmon infosphere and the affected nodes
Someone wants me to make an account for them!
Run this in
/rooton infosphere:cat your_user_list.txt | ./make_new_users.sh
How do I run the ansible play?
From infosphere (this is important):
cd ansible ansible-playbook -i hosts -f 50 --ask-vault-pass cluster.yml --skip-tags=installThe above command basically runs the default, updating-only play on all the cluster nodes, including infosphere.
I want to upgrade a specific part of the cluster, how do?
Run the corresponding ansible
.ymlplaybooks from infosphere
I would like to install a new version of something on the cluster, how do?
The current software names and versions are under
[fullcluster:vars]in the hosts file in the ansible repository.Change the version number to the latest available, and run the following command replacing
<software_name>with something likeabinitoropenmpi.ansible-playbook -i hosts -f 50 cluster.yml --tags=install_<software_name>The plays should automatically uninstall the old version for you when updating.
IMPORTANT NOTE: If you update
pmix, you also need to reinstallslurmandopenmpi.
After a slurm upgrade, everything went down. How fix?
Check that the slurm version is the same everywhere
ansible fullcluster -i hosts -m command -a "bash -c 'systemctl disable slurmd; systemctl enable slurmd; systemctl restart slurmd'"(disable slurmd on infosphere afterwards)Really, just check the logs yourself, listen to error messages, Google your way to victory.
How to recreate the
slurmdbddatabase (last resort)?systemctl stop slurmdbd slurmctld && systemctl restart mariadb && mysqlCREATE DATABASE slurm;If this command fails, skip to the end of this list
USE slurm;SHOW TABLES;For every table in the database,
DROP TABLE <table_name>;exit
systemctl start slurmdbd slurmctldsacctmgr add cluster hpcfind /cluster -type d | /root/make_new_users.sh
Last updated