TJ CSL
  • TJ CSL
  • Services
    • Ion
      • Development
        • Overview
        • Setup
          • Docker Setup
          • Vagrant Setup
        • Environment
        • Fixtures
        • PR Workflow
        • Style Guide
        • Maintainer Workflow
        • Repository Maintenance
        • Data Generation
      • Production
      • User Experience
        • User Interface
    • Director
      • Development
        • Vagrant Setup
        • PR Workflow
        • Style Guide
        • Maintainer Workflow
      • Production
    • Workstations
    • Signage
      • Setup
      • Administration
      • Monitoring
      • Troubleshooting
      • Experimental
        • IonTap
        • SignageAdmin
    • Remote Access
      • Setup
      • Administration
    • Cluster
      • FAQ
      • Setup
        • SSH Setup
      • Administration
      • Slurm
      • Slurm Administration
      • Borg
    • Printing
      • Setup
      • Troubleshooting
    • WWW
      • Administration
      • Sites
        • Web Proxy
      • Setup
      • Troubleshooting
    • Academic Services
      • Tin
      • Othello
        • Administration
        • Setup
  • Technologies
    • Web
      • Nginx
      • Django
      • PHP-FPM
      • Node.js
      • Supervisord
    • DBs
      • PostgreSQL
      • MySQL
    • Authentication
      • Passcard
        • GPG Usage
      • SSHD
        • SSH Passwordless Login
      • FreeIPA
    • Storage
      • NFS
      • Ceph
        • Setup
        • Backups
        • CephFS
    • Operating Systems
      • Ubuntu Server
      • AlmaLinux
      • Debian
    • Tools
      • Ansible
      • Slack
      • GitBook
      • GitLab
        • Setup
        • Updating
    • Virtualization
      • QEMU/KVM
      • Libvirt
    • Advanced Computing
      • MPI
      • Tensorflow
    • Networking
      • Netbox
      • Cisco
      • Netboot
      • DNS
      • DHCP
      • NTP
      • BGP
    • Mail
      • Postfix
      • Dovecot
    • Monitoring
      • Prometheus
      • Grafana
      • Sentry
      • Uptime Robot
  • Machines
    • VM Servers
      • Utonium
      • Blossom
      • Bubbles
      • Buttercup
      • Antipodes
      • Chatham
      • Cocos
      • Galapagos
      • Gandalf
      • Gorgona
      • Overlord
      • Waverider
      • Torch
    • Ceph
      • Karel
      • Stobar
      • Wumpus
      • Waitaha
      • Barrel
      • Valdes
    • HPC Cluster
      • Zoidberg
    • Borg Cluster
    • Compute Sticks
    • Other
      • ASM
      • Duke
      • Snowy
      • Sauron
      • Sun Servers
        • Altair
        • Centauri
        • Deneb
        • Sirius
        • Vega
        • Betelgeuse
        • Ohare
    • Switches
      • Core0
      • Xnor
      • Xor
      • Imply
    • UPS
    • History
      • 2008 Sun AEG
      • 2011 Sun Upgrades
      • 2017 VM Disaster
      • 2018 Purchases
      • 2018 Cephpocalypse
    • VLANs
    • Remote Management
      • iLO
      • LOMs
    • Understudy
      • Switch Configuration
      • Server Configuration
        • Setting Up the Operating System
        • Network Configuration
        • Saruman
        • Fiordland
  • General
    • Sysadmins List
    • Organization
    • Documentation
      • Security
      • Runbooks
    • Communication
      • Terminology
    • Understudies
    • Account Structure
    • Machine Room
    • Branding
    • History
      • Fridge
      • The Brick
  • Procedures
    • Data Recovery
    • Account Provisioning
    • tjSTAR
      • Tech Support
    • Onboarding
      • New Sysadmin Onboarding
  • Guides
    • VM Creation
    • sshuttle Usage
    • Linux Wifi Setup
    • VNC Usage
    • Password Changes
    • Sun Server RAID Configuration
  • Policies
    • Data Release Policy
    • Upgrade Policy
    • Account Policy
    • Election Policy
  • Obsolete
    • Arcturus
    • Chuku
    • Cray SV1 Supercomputer
    • Ekhi
    • Mihr
    • Moloch
    • Sol
    • Rockhopper
    • Kerberos
    • LDAP
    • Agni
    • Moon
    • Apocalypse
    • AFS
      • OpenAFS
      • Setup
      • Client Setup
      • Administration
      • Troubleshooting
      • Directory Structure
      • Backups
      • Cross-Cell Authentication
    • Observium
    • OpenVPN
Powered by GitBook
On this page
  • Using the Cluster
  • Managing the Cluster
  1. Services
  2. Cluster

FAQ

PreviousClusterNextSetup

Last updated 5 years ago

Because I'm tired of explaining the same things over and over again

Using the Cluster

  • How do I get my code on the cluster?

    1. From a with your code on it:

      scp -rp folder_with_your_code/ infosphere:~
    2. Your code is now on infosphere

  • Where do I go to run code on the cluster?

    • From or a :

      ssh infosphere
  • How do I compile code on the cluster?

    • If using a single C/C++ file like a pleb:

      mpicc your_file.c # or mpicxx if using C++
    • If using CMake like a boss:

      cmake . -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx
      make
  • Do I need to compile my code on the cluster?

    • YES, there is no other option, binaries will never be compatible. This is not a challenge, this is a statement of fact. Do not complain if you a.out from the workstation doesn't work.

  • Now that I did all that, how do I run code on the cluster?

    • If using an MPI binary (most parallel computing classes), or just running a non-MPI binary n number of times:

      srun -N <number_of_machines> -n <number_of_processes> --mpi=pmix_v2 ./your_binary --your-binary-flags
    • Alternative (non-preferred) method of running an MPI binary:

      salloc -N <number_of_machines> -n <number_of_processes> mpiexec ./your_binary --your-binary-flags
  • That gives me some error about not having an account! What gives?

    • Ask a sysadmin if they can make you an account on the cluster. Point them to this document if they don't know how.

  • My job just hangs forever!

    • Either someone is inconsiderately using the entire cluster, or the cluster is broken.

    • To check the former, use the sinfo and squeue commands.

Managing the Cluster

Or, "I was put in charge of your mess Jack you better have documented everything." Don't worry, I didn't. Keeps you on your toes.

  • What are the basic troubleshooting steps?

    • sinfo to view the state of the cluster

    • squeue to see what jobs are holding things up

    • Checking files in /var/log/slurm on infosphere and the affected nodes

  • Someone wants me to make an account for them!

    • Run this in /root on infosphere:

      cat your_user_list.txt | ./make_new_users.sh
    • From infosphere (this is important):

      cd ansible
      ansible-playbook -i hosts -f 50 --ask-vault-pass cluster.yml --skip-tags=install
    • The above command basically runs the default, updating-only play on all the cluster nodes, including infosphere.

  • I want to upgrade a specific part of the cluster, how do?

    • There are a couple partitions in ansible:

      • ibmgpu: All the Borg nodes with GPUs in them

      • clustermaster: infosphere itself

    • Run the corresponding ansible .yml playbooks from infosphere

  • I would like to install a new version of something on the cluster, how do?

    • Change the version number to the latest available, and run the following command replacing <software_name> with something like abinit or openmpi.

      ansible-playbook -i hosts -f 50 cluster.yml --tags=install_<software_name>
    • The plays should automatically uninstall the old version for you when updating.

    • IMPORTANT NOTE: If you update pmix, you also need to reinstall slurm and openmpi.

  • After a slurm upgrade, everything went down. How fix?

    • Check that the slurm version is the same everywhere

    • ansible fullcluster -i hosts -m command -a "bash -c 'systemctl disable slurmd; systemctl enable slurmd; systemctl restart slurmd'" (disable slurmd on infosphere afterwards)

    • Really, just check the logs yourself, listen to error messages, Google your way to victory.

  • How to recreate the slurmdbd database (last resort)?

    • systemctl stop slurmdbd slurmctld && systemctl restart mariadb && mysql

      1. CREATE DATABASE slurm;

        • If this command fails, skip to the end of this list

      2. USE slurm;

      3. SHOW TABLES;

      4. For every table in the database, DROP TABLE <table_name>;

      5. exit

    • systemctl start slurmdbd slurmctld

    • sacctmgr add cluster hpc

    • find /cluster -type d | /root/make_new_users.sh

Note: everything here expects you have root access on infosphere, and are running all these commands after getting root with kinit + ksu.

How do I run the play?

hpc: All the nodes

hpcgpu: Just

ibm: All the nodes

The current software names and versions are under [fullcluster:vars] in the hosts file in the repository.

workstation
RAS
workstation
Kerberos
ansible
HPC
Zoidberg
Borg
ansible