# Coder Disaster Recovery Plan ## Overview This document outlines procedures for disaster recovery of a Coder deployment, focusing on recovering the system's critical components: the PostgreSQL database and persistent volume claims (PVCs). Following these procedures will help minimize downtime and data loss in case of system failures or catastrophic events. ## Critical Components Coder relies on two primary components that require backup and recovery procedures: 1. **PostgreSQL Database**: Stores all user data, workspace state, template definitions, audit logs, and configuration 2. **Persistent Volume Claims (PVCs)**: Store workspace data, user home directories, and development artifacts ## Backup Procedures ### PostgreSQL Database Backup #### Scheduled Automated Backups 1. **Configure Daily Database Backups** ```bash # Example cron job for daily backups at 2:00 AM 0 2 * * * /path/to/backup-script.sh ``` 2. **Backup Script Contents** ```bash #!/bin/bash # PostgreSQL backup script for Coder # Configuration BACKUP_DIR="/path/to/backups" RETENTION_DAYS=30 PG_USER="coder" PG_DB="coder" TIMESTAMP=$(date +%Y%m%d-%H%M%S) BACKUP_FILE="${BACKUP_DIR}/coder-db-${TIMESTAMP}.sql.gz" # Create backup directory if it doesn't exist mkdir -p "$BACKUP_DIR" # For external PostgreSQL pg_dump -U "$PG_USER" "$PG_DB" | gzip > "$BACKUP_FILE" # For managed PostgreSQL services (e.g., AWS RDS), use their native backup tools # aws rds create-db-snapshot --db-instance-identifier coder-instance --db-snapshot-identifier coder-snapshot-${TIMESTAMP} # For Kubernetes-hosted PostgreSQL # kubectl exec -n coder postgres-pod -- pg_dump -U "$PG_USER" "$PG_DB" | gzip > "$BACKUP_FILE" # Remove backups older than retention period find "$BACKUP_DIR" -name "coder-db-*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete # Log backup completion echo "Backup completed: $BACKUP_FILE" >> "${BACKUP_DIR}/backup.log" ``` 3. **Pre-Update Backups** Always take a full database backup before upgrading Coder versions: ```bash # For external PostgreSQL pg_dump -U coder coder | gzip > coder-db-pre-update-$(date +%Y%m%d).sql.gz # For Kubernetes-hosted PostgreSQL kubectl exec -n coder postgres-pod -- pg_dump -U coder coder | gzip > coder-db-pre-update-$(date +%Y%m%d).sql.gz ``` #### Backup Verification 1. **Validate Backups Regularly** ```bash # Create a test database createdb -U coder coder_test # Restore backup to test database gunzip -c latest-backup.sql.gz | psql -U coder coder_test # Verify data integrity with sample queries psql -U coder coder_test -c "SELECT COUNT(*) FROM users;" psql -U coder coder_test -c "SELECT COUNT(*) FROM workspaces;" # Drop test database after verification dropdb -U coder coder_test ``` 2. **Store Backups Off-Site** ```bash # Example for copying to a remote backup server rsync -avz --delete /path/to/backups/ backup-server:/backup/coder/database/ # Example for copying to S3 aws s3 sync /path/to/backups/ s3://coder-backups/database/ ``` ### Persistent Volume Claims Backup 1. **Identify Critical PVCs** ```bash # List all PVCs used by Coder workspaces kubectl get pvc -n coder-workspaces ``` 2. **Configure Snapshot Schedule** For cloud-based Kubernetes clusters, use CSI volume snapshots: ```yaml # Example VolumeSnapshotClass configuration apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: coder-snapshot-class driver: deletionPolicy: Retain parameters: # Driver-specific parameters ``` 3. **Create Automated PVC Snapshots** ```yaml # Example CronJob for PVC snapshots apiVersion: batch/v1 kind: CronJob metadata: name: pvc-snapshots namespace: coder spec: schedule: "0 3 * * *" # Daily at 3:00 AM jobTemplate: spec: template: spec: serviceAccountName: snapshot-creator containers: - name: snapshot-creator image: bitnami/kubectl:latest command: - /bin/sh - -c - | for pvc in $(kubectl get pvc -n coder-workspaces -o jsonpath='{.items[*].metadata.name}'); do timestamp=$(date +%Y%m%d-%H%M%S) kubectl create -f - <.sql.gz | psql -U postgres coder ``` For Kubernetes-hosted PostgreSQL: ```bash # Copy backup file to pod kubectl cp /path/to/backups/coder-db-.sql.gz coder/postgres-pod:/tmp/ # Create empty database (if needed) kubectl exec -n coder postgres-pod -- createdb -U postgres coder # Restore from backup kubectl exec -n coder postgres-pod -- bash -c "gunzip -c /tmp/coder-db-.sql.gz | psql -U postgres coder" ``` 3. **Verify Database Integrity** ```bash # Run basic checks psql -U postgres coder -c "SELECT COUNT(*) FROM users;" psql -U postgres coder -c "SELECT COUNT(*) FROM workspaces;" ``` 4. **Restart Coder Services** ```bash # For Kubernetes deployments kubectl scale deployment coder --replicas=3 -n coder # For other deployments systemctl start coder ``` #### Point-in-Time Recovery For managed PostgreSQL services that support point-in-time recovery: ```bash # Example for AWS RDS aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier coder-production \ --target-db-instance-identifier coder-recovery \ --restore-time 2023-06-01T13:15:00Z ``` ### Persistent Volume Claims Recovery 1. **Restore Volumes from Snapshots** ```yaml # Example for restoring a PVC from snapshot apiVersion: v1 kind: PersistentVolumeClaim metadata: name: restored-workspace-home namespace: coder-workspaces spec: dataSource: name: snapshot-workspace-home-20230601-235959 kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.io accessModes: - ReadWriteOnce resources: requests: storage: 50Gi ``` 2. **Re-associate PVCs with Workspaces** Update workspace manifests to use the restored PVCs: ```bash # Example command to modify workspace deployment kubectl patch deployment workspace-deployment -n coder-workspaces --patch ' { "spec": { "template": { "spec": { "volumes": [ { "name": "home", "persistentVolumeClaim": { "claimName": "restored-workspace-home" } } ] } } } }' ``` ## Full Cluster Recovery In case of complete cluster failure, follow these steps: 1. **Re-deploy Kubernetes Cluster** Use infrastructure as code tools (e.g., Terraform) to recreate the cluster: ```bash terraform apply -var-file=production.tfvars ``` 2. **Install Coder** ```bash # Using Helm helm repo add coder https://helm.coder.com helm repo update helm install coder coder/coder -n coder --create-namespace -f values.yaml ``` 3. **Restore PostgreSQL Database** Follow the PostgreSQL database recovery procedure above. 4. **Restore PVCs** Follow the PVC recovery procedure above. 5. **Verify System Integrity** ```bash # Check component health coder health # Verify template availability coder templates ls # Verify workspace functionality coder workspaces ls ``` 6. **Perform User Acceptance Testing** Validate system functionality with sample user workflows: - Creating new workspaces - Connecting to existing workspaces - Running applications in workspaces - Accessing workspace file system ## Disaster Recovery Testing Schedule regular disaster recovery testing to ensure the procedures work as expected: 1. **Quarterly Recovery Simulations** - Simulate database failures - Practice full database restoration - Validate PVC recovery processes 2. **Annual Full-Scale DR Test** - Stand up separate cluster - Perform full recovery - Validate all system functionality ## Disaster Recovery Process Improvement 1. **Post-Incident Reviews** - Document all recovery actions taken - Identify areas for improvement - Update recovery procedures 2. **Recovery Process Updates** - Maintain this documentation - Update after major Coder version changes - Test procedures after significant infrastructure changes ## Additional Recommendations 1. **Database Encryption Key Backup** - If using [database encryption](https://coder.com/docs/v2/latest/admin/security/database-encryption), securely back up encryption keys - Store keys separately from database backups 2. **High Availability Configuration** - Deploy Coder with multiple replicas - Use managed PostgreSQL with high availability - Consider multi-region deployments for critical environments 3. **Monitoring and Alerting** - Configure alerts for backup failures - Monitor database and PVC storage usage - Set up proactive monitoring for system failures 4. **Documentation** - Maintain detailed environment configurations - Document provider-specific backup/restore procedures - Keep recovery contact list updated Remember to adapt these procedures to your specific environment, cloud provider, and infrastructure setup.