Because I'm tired of explaining the same things over and over again

Using the Cluster

  • How do I get my code on the cluster?

    1. From a workstation with your code on it:

      scp -rp folder_with_your_code/ infosphere:~
    2. Your code is now on infosphere

  • Where do I go to run code on the cluster?

  • How do I compile code on the cluster?

    • If using a single C/C++ file like a pleb:

      mpicc your_file.c # or mpicxx if using C++
    • If using CMake like a boss:

      cmake . -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx
  • Do I need to compile my code on the cluster?

    • YES, there is no other option, binaries will never be compatible. This is not a challenge, this is a statement of fact. Do not complain if you a.out from the workstation doesn't work.

  • Now that I did all that, how do I run code on the cluster?

    • If using an MPI binary (most parallel computing classes), or just running a non-MPI binary n number of times:

      srun -N <number_of_machines> -n <number_of_processes> --mpi=pmix_v2 ./your_binary --your-binary-flags
    • Alternative (non-preferred) method of running an MPI binary:

      salloc -N <number_of_machines> -n <number_of_processes> mpiexec ./your_binary --your-binary-flags
  • That gives me some error about not having an account! What gives?

    • Ask a sysadmin if they can make you an account on the cluster. Point them to this document if they don't know how.

  • My job just hangs forever!

    • Either someone is inconsiderately using the entire cluster, or the cluster is broken.

    • To check the former, use the sinfo and squeue commands.

Managing the Cluster

Or, "I was put in charge of your mess Jack you better have documented everything." Don't worry, I didn't. Keeps you on your toes.

Note: everything here expects you have root access on infosphere, and are running all these commands after getting root with Kerberos kinit + ksu.

  • What are the basic troubleshooting steps?

    • sinfo to view the state of the cluster

    • squeue to see what jobs are holding things up

    • Checking files in /var/log/slurm on infosphere and the affected nodes

  • Someone wants me to make an account for them!

    • Run this in /root on infosphere:

      cat your_user_list.txt | ./make_new_users.sh
  • How do I run the ansible play?

    • From infosphere (this is important):

      cd ansible
      ansible-playbook -i hosts -f 50 --ask-vault-pass cluster.yml --skip-tags=install
    • The above command basically runs the default, updating-only play on all the cluster nodes, including infosphere.

  • I want to upgrade a specific part of the cluster, how do?

    • There are a couple partitions in ansible:

      • hpc: All the HPC nodes

      • hpcgpu: Just Zoidberg

      • ibm: All the Borg nodes

      • ibmgpu: All the Borg nodes with GPUs in them

      • clustermaster: infosphere itself

    • Run the corresponding ansible .yml playbooks from infosphere

  • I would like to install a new version of something on the cluster, how do?

    • The current software names and versions are under [fullcluster:vars] in the hosts file in the ansible repository.

    • Change the version number to the latest available, and run the following command replacing <software_name> with something like abinit or openmpi.

      ansible-playbook -i hosts -f 50 cluster.yml --tags=install_<software_name>
    • The plays should automatically uninstall the old version for you when updating.

    • IMPORTANT NOTE: If you update pmix, you also need to reinstall slurm and openmpi.

  • After a slurm upgrade, everything went down. How fix?

    • Check that the slurm version is the same everywhere

    • ansible fullcluster -i hosts -m command -a "bash -c 'systemctl disable slurmd; systemctl enable slurmd; systemctl restart slurmd'" (disable slurmd on infosphere afterwards)

    • Really, just check the logs yourself, listen to error messages, Google your way to victory.

  • How to recreate the slurmdbd database (last resort)?

    • systemctl stop slurmdbd slurmctld && systemctl restart mariadb && mysql

      1. CREATE DATABASE slurm;

        • If this command fails, skip to the end of this list

      2. USE slurm;

      3. SHOW TABLES;

      4. For every table in the database, DROP TABLE <table_name>;

      5. exit

    • systemctl start slurmdbd slurmctld

    • sacctmgr add cluster hpc

    • find /cluster -type d | /root/make_new_users.sh

Last updated