TJ CSL
  • TJ CSL
  • Services
    • Ion
      • Development
        • Overview
        • Setup
          • Docker Setup
          • Vagrant Setup
        • Environment
        • Fixtures
        • PR Workflow
        • Style Guide
        • Maintainer Workflow
        • Repository Maintenance
        • Data Generation
      • Production
      • User Experience
        • User Interface
    • Director
      • Development
        • Vagrant Setup
        • PR Workflow
        • Style Guide
        • Maintainer Workflow
      • Production
    • Workstations
    • Signage
      • Setup
      • Administration
      • Monitoring
      • Troubleshooting
      • Experimental
        • IonTap
        • SignageAdmin
    • Remote Access
      • Setup
      • Administration
    • Cluster
      • FAQ
      • Setup
        • SSH Setup
      • Administration
      • Slurm
      • Slurm Administration
      • Borg
    • Printing
      • Setup
      • Troubleshooting
    • WWW
      • Administration
      • Sites
        • Web Proxy
      • Setup
      • Troubleshooting
    • Academic Services
      • Tin
      • Othello
        • Administration
        • Setup
  • Technologies
    • Web
      • Nginx
      • Django
      • PHP-FPM
      • Node.js
      • Supervisord
    • DBs
      • PostgreSQL
      • MySQL
    • Authentication
      • Passcard
        • GPG Usage
      • SSHD
        • SSH Passwordless Login
      • FreeIPA
    • Storage
      • NFS
      • Ceph
        • Setup
        • Backups
        • CephFS
    • Operating Systems
      • Ubuntu Server
      • AlmaLinux
      • Debian
    • Tools
      • Ansible
      • Slack
      • GitBook
      • GitLab
        • Setup
        • Updating
    • Virtualization
      • QEMU/KVM
      • Libvirt
    • Advanced Computing
      • MPI
      • Tensorflow
    • Networking
      • Netbox
      • Cisco
      • Netboot
      • DNS
      • DHCP
      • NTP
      • BGP
    • Mail
      • Postfix
      • Dovecot
    • Monitoring
      • Prometheus
      • Grafana
      • Sentry
      • Uptime Robot
  • Machines
    • VM Servers
      • Utonium
      • Blossom
      • Bubbles
      • Buttercup
      • Antipodes
      • Chatham
      • Cocos
      • Galapagos
      • Gandalf
      • Gorgona
      • Overlord
      • Waverider
      • Torch
    • Ceph
      • Karel
      • Stobar
      • Wumpus
      • Waitaha
      • Barrel
      • Valdes
    • HPC Cluster
      • Zoidberg
    • Borg Cluster
    • Compute Sticks
    • Other
      • ASM
      • Duke
      • Snowy
      • Sauron
      • Sun Servers
        • Altair
        • Centauri
        • Deneb
        • Sirius
        • Vega
        • Betelgeuse
        • Ohare
    • Switches
      • Core0
      • Xnor
      • Xor
      • Imply
    • UPS
    • History
      • 2008 Sun AEG
      • 2011 Sun Upgrades
      • 2017 VM Disaster
      • 2018 Purchases
      • 2018 Cephpocalypse
    • VLANs
    • Remote Management
      • iLO
      • LOMs
    • Understudy
      • Switch Configuration
      • Server Configuration
        • Setting Up the Operating System
        • Network Configuration
        • Saruman
        • Fiordland
  • General
    • Sysadmins List
    • Organization
    • Documentation
      • Security
      • Runbooks
    • Communication
      • Terminology
    • Understudies
    • Account Structure
    • Machine Room
    • Branding
    • History
      • Fridge
      • The Brick
  • Procedures
    • Data Recovery
    • Account Provisioning
    • tjSTAR
      • Tech Support
    • Onboarding
      • New Sysadmin Onboarding
  • Guides
    • VM Creation
    • sshuttle Usage
    • Linux Wifi Setup
    • VNC Usage
    • Password Changes
    • Sun Server RAID Configuration
  • Policies
    • Data Release Policy
    • Upgrade Policy
    • Account Policy
    • Election Policy
  • Obsolete
    • Arcturus
    • Chuku
    • Cray SV1 Supercomputer
    • Ekhi
    • Mihr
    • Moloch
    • Sol
    • Rockhopper
    • Kerberos
    • LDAP
    • Agni
    • Moon
    • Apocalypse
    • AFS
      • OpenAFS
      • Setup
      • Client Setup
      • Administration
      • Troubleshooting
      • Directory Structure
      • Backups
      • Cross-Cell Authentication
    • Observium
    • OpenVPN
Powered by GitBook
On this page
  • Background
  • Cause
  • Reactions
  • From Sysadmins
  • From other students
  • From Staff
  • Remedial Actions
  • Trying to fix Ceph
  • Un-Cephing everything
  • Moving things back to new Ceph
  • What we learned
  • Results
  1. Machines
  2. History

2018 Cephpocalypse

Previous2018 PurchasesNextVLANs

Last updated 5 years ago

The Cephpocalypse was an event occurring in the fall of the 2018-2019 school year, when the Ceph cluster which hosted our central network storage went completely offline. This incident demonstrated the capability of the Sysadmin team and prompted us to start thinking about ways to remove that one point of failure (say, through a backup system).

The purpose of this document is to record our mistakes and remedial actions so that future generations may learn from them.

Background

After delays in obtaining approval for the new storage servers, we finally received these new G10 servers. In anticipation for the eventual transfer of our Ceph cluster to these new servers, we mounted the servers and prepared the new servers. One of those preparations was a needed upgrade to our production Ceph cluster that went horribly wrong.

Cause

On a Sunday in mid-September of 2018, the Storage Lead began the process of of upgrading the component servers of our production Ceph cluster to latest major version. We had been running jewel and we needed to get to mimic. The Storage Lead, in quick succession, upgraded these servers up two major releases. A later independent review suggests that this rapid upgrade of two major releases was to blame for the Cephpocalypse.

Reactions

From Sysadmins

When we received UptimeRobot notifications, we initially thought that the Storage Lead would be able to fix his mistake fairly quickly. When a fix did not materialize,

From other students

We were featured . For most students not in the SysLab,

From Staff

Most TJ staff did not notice much disruption in our services since we were able to quickly restore our most public service, Ion.

Remedial Actions

Trying to fix Ceph

Un-Cephing everything

Moving things back to new Ceph

What we learned

  • It is important to keep off-site backups.

  • It is nice to have multiple people know how Ceph operates.

  • We lack contigency plans.

  • Teamwork is important.

  • We lack documentation.

Results

After more than two excruciating weeks, the Storage Lead recovered data from the old cluster (albeit partially corrupted) and we began the process of moving everything back onto Ceph. This process ensued for the upcoming months.

on tjToday