# Coder Disaster Recovery Plan

## Overview

This document outlines procedures for disaster recovery of a Coder deployment, focusing on recovering the system's critical components: the PostgreSQL database and persistent volume claims (PVCs). Following these procedures will help minimize downtime and data loss in case of system failures or catastrophic events.

## Critical Components

Coder relies on two primary components that require backup and recovery procedures:

1. **PostgreSQL Database**: Stores all user data, workspace state, template definitions, audit logs, and configuration
2. **Persistent Volume Claims (PVCs)**: Store workspace data, user home directories, and development artifacts

## Backup Procedures

### PostgreSQL Database Backup

#### Scheduled Automated Backups

1. **Configure Daily Database Backups**

   ```bash
   # Example cron job for daily backups at 2:00 AM
   0 2 * * * /path/to/backup-script.sh
   ```

2. **Backup Script Contents**

   ```bash
   #!/bin/bash
   # PostgreSQL backup script for Coder
   
   # Configuration
   BACKUP_DIR="/path/to/backups"
   RETENTION_DAYS=30
   PG_USER="coder"
   PG_DB="coder"
   TIMESTAMP=$(date +%Y%m%d-%H%M%S)
   BACKUP_FILE="${BACKUP_DIR}/coder-db-${TIMESTAMP}.sql.gz"
   
   # Create backup directory if it doesn't exist
   mkdir -p "$BACKUP_DIR"
   
   # For external PostgreSQL
   pg_dump -U "$PG_USER" "$PG_DB" | gzip > "$BACKUP_FILE"
   
   # For managed PostgreSQL services (e.g., AWS RDS), use their native backup tools
   # aws rds create-db-snapshot --db-instance-identifier coder-instance --db-snapshot-identifier coder-snapshot-${TIMESTAMP}
   
   # For Kubernetes-hosted PostgreSQL
   # kubectl exec -n coder postgres-pod -- pg_dump -U "$PG_USER" "$PG_DB" | gzip > "$BACKUP_FILE"
   
   # Remove backups older than retention period
   find "$BACKUP_DIR" -name "coder-db-*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
   
   # Log backup completion
   echo "Backup completed: $BACKUP_FILE" >> "${BACKUP_DIR}/backup.log"
   ```

3. **Pre-Update Backups**

   Always take a full database backup before upgrading Coder versions:

   ```bash
   # For external PostgreSQL
   pg_dump -U coder coder | gzip > coder-db-pre-update-$(date +%Y%m%d).sql.gz
   
   # For Kubernetes-hosted PostgreSQL
   kubectl exec -n coder postgres-pod -- pg_dump -U coder coder | gzip > coder-db-pre-update-$(date +%Y%m%d).sql.gz
   ```

#### Backup Verification

1. **Validate Backups Regularly**

   ```bash
   # Create a test database
   createdb -U coder coder_test
   
   # Restore backup to test database
   gunzip -c latest-backup.sql.gz | psql -U coder coder_test
   
   # Verify data integrity with sample queries
   psql -U coder coder_test -c "SELECT COUNT(*) FROM users;"
   psql -U coder coder_test -c "SELECT COUNT(*) FROM workspaces;"
   
   # Drop test database after verification
   dropdb -U coder coder_test
   ```

2. **Store Backups Off-Site**

   ```bash
   # Example for copying to a remote backup server
   rsync -avz --delete /path/to/backups/ backup-server:/backup/coder/database/
   
   # Example for copying to S3
   aws s3 sync /path/to/backups/ s3://coder-backups/database/
   ```

### Persistent Volume Claims Backup

1. **Identify Critical PVCs**

   ```bash
   # List all PVCs used by Coder workspaces
   kubectl get pvc -n coder-workspaces
   ```

2. **Configure Snapshot Schedule**

   For cloud-based Kubernetes clusters, use CSI volume snapshots:

   ```yaml
   # Example VolumeSnapshotClass configuration
   apiVersion: snapshot.storage.k8s.io/v1
   kind: VolumeSnapshotClass
   metadata:
     name: coder-snapshot-class
   driver: <your-csi-driver>
   deletionPolicy: Retain
   parameters:
     # Driver-specific parameters
   ```

3. **Create Automated PVC Snapshots**

   ```yaml
   # Example CronJob for PVC snapshots
   apiVersion: batch/v1
   kind: CronJob
   metadata:
     name: pvc-snapshots
     namespace: coder
   spec:
     schedule: "0 3 * * *"  # Daily at 3:00 AM
     jobTemplate:
       spec:
         template:
           spec:
             serviceAccountName: snapshot-creator
             containers:
             - name: snapshot-creator
               image: bitnami/kubectl:latest
               command:
               - /bin/sh
               - -c
               - |
                 for pvc in $(kubectl get pvc -n coder-workspaces -o jsonpath='{.items[*].metadata.name}'); do
                   timestamp=$(date +%Y%m%d-%H%M%S)
                   kubectl create -f - <<EOF
                   apiVersion: snapshot.storage.k8s.io/v1
                   kind: VolumeSnapshot
                   metadata:
                     name: snapshot-${pvc}-${timestamp}
                     namespace: coder-workspaces
                   spec:
                     volumeSnapshotClassName: coder-snapshot-class
                     source:
                       persistentVolumeClaimName: ${pvc}
                   EOF
                 done
             restartPolicy: OnFailure
   ```

4. **Backup Critical User Data**

   For additional protection, configure workspaces to periodically back up important user data:

   ```bash
   # Example script to run inside workspaces for backing up important data to external storage
   #!/bin/bash
   
   # Configuration
   BACKUP_DEST="s3://coder-user-backups/${USER}/"
   
   # Backup important directories
   tar czf /tmp/workspace-backup.tar.gz ~/projects ~/important-configs
   
   # Upload to external storage
   aws s3 cp /tmp/workspace-backup.tar.gz ${BACKUP_DEST}
   
   # Cleanup
   rm /tmp/workspace-backup.tar.gz
   ```

## Recovery Procedures

### PostgreSQL Database Recovery

#### Full Database Restore

1. **Stop Coder Services**

   ```bash
   # For Kubernetes deployments
   kubectl scale deployment coder --replicas=0 -n coder
   
   # For other deployments
   systemctl stop coder
   ```

2. **Restore Database**

   For external PostgreSQL:

   ```bash
   # Create empty database (if needed)
   createdb -U postgres coder
   
   # Restore from backup
   gunzip -c /path/to/backups/coder-db-<timestamp>.sql.gz | psql -U postgres coder
   ```

   For Kubernetes-hosted PostgreSQL:

   ```bash
   # Copy backup file to pod
   kubectl cp /path/to/backups/coder-db-<timestamp>.sql.gz coder/postgres-pod:/tmp/
   
   # Create empty database (if needed)
   kubectl exec -n coder postgres-pod -- createdb -U postgres coder
   
   # Restore from backup
   kubectl exec -n coder postgres-pod -- bash -c "gunzip -c /tmp/coder-db-<timestamp>.sql.gz | psql -U postgres coder"
   ```

3. **Verify Database Integrity**

   ```bash
   # Run basic checks
   psql -U postgres coder -c "SELECT COUNT(*) FROM users;"
   psql -U postgres coder -c "SELECT COUNT(*) FROM workspaces;"
   ```

4. **Restart Coder Services**

   ```bash
   # For Kubernetes deployments
   kubectl scale deployment coder --replicas=3 -n coder
   
   # For other deployments
   systemctl start coder
   ```

#### Point-in-Time Recovery

For managed PostgreSQL services that support point-in-time recovery:

```bash
# Example for AWS RDS
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier coder-production \
  --target-db-instance-identifier coder-recovery \
  --restore-time 2023-06-01T13:15:00Z
```

### Persistent Volume Claims Recovery

1. **Restore Volumes from Snapshots**

   ```yaml
   # Example for restoring a PVC from snapshot
   apiVersion: v1
   kind: PersistentVolumeClaim
   metadata:
     name: restored-workspace-home
     namespace: coder-workspaces
   spec:
     dataSource:
       name: snapshot-workspace-home-20230601-235959
       kind: VolumeSnapshot
       apiGroup: snapshot.storage.k8s.io
     accessModes:
       - ReadWriteOnce
     resources:
       requests:
         storage: 50Gi
   ```

2. **Re-associate PVCs with Workspaces**

   Update workspace manifests to use the restored PVCs:

   ```bash
   # Example command to modify workspace deployment
   kubectl patch deployment workspace-deployment -n coder-workspaces --patch '
   {
     "spec": {
       "template": {
         "spec": {
           "volumes": [
             {
               "name": "home",
               "persistentVolumeClaim": {
                 "claimName": "restored-workspace-home"
               }
             }
           ]
         }
       }
     }
   }'
   ```

## Full Cluster Recovery

In case of complete cluster failure, follow these steps:

1. **Re-deploy Kubernetes Cluster**

   Use infrastructure as code tools (e.g., Terraform) to recreate the cluster:

   ```bash
   terraform apply -var-file=production.tfvars
   ```

2. **Install Coder**

   ```bash
   # Using Helm
   helm repo add coder https://helm.coder.com
   helm repo update
   helm install coder coder/coder -n coder --create-namespace -f values.yaml
   ```

3. **Restore PostgreSQL Database**

   Follow the PostgreSQL database recovery procedure above.

4. **Restore PVCs**

   Follow the PVC recovery procedure above.

5. **Verify System Integrity**

   ```bash
   # Check component health
   coder health

   # Verify template availability
   coder templates ls

   # Verify workspace functionality
   coder workspaces ls
   ```

6. **Perform User Acceptance Testing**

   Validate system functionality with sample user workflows:
   - Creating new workspaces
   - Connecting to existing workspaces
   - Running applications in workspaces
   - Accessing workspace file system

## Disaster Recovery Testing

Schedule regular disaster recovery testing to ensure the procedures work as expected:

1. **Quarterly Recovery Simulations**
   - Simulate database failures
   - Practice full database restoration
   - Validate PVC recovery processes

2. **Annual Full-Scale DR Test**
   - Stand up separate cluster
   - Perform full recovery
   - Validate all system functionality

## Disaster Recovery Process Improvement

1. **Post-Incident Reviews**
   - Document all recovery actions taken
   - Identify areas for improvement
   - Update recovery procedures

2. **Recovery Process Updates**
   - Maintain this documentation
   - Update after major Coder version changes
   - Test procedures after significant infrastructure changes

## Additional Recommendations

1. **Database Encryption Key Backup**
   - If using [database encryption](https://coder.com/docs/v2/latest/admin/security/database-encryption), securely back up encryption keys
   - Store keys separately from database backups

2. **High Availability Configuration**
   - Deploy Coder with multiple replicas
   - Use managed PostgreSQL with high availability
   - Consider multi-region deployments for critical environments

3. **Monitoring and Alerting**
   - Configure alerts for backup failures
   - Monitor database and PVC storage usage
   - Set up proactive monitoring for system failures

4. **Documentation**
   - Maintain detailed environment configurations
   - Document provider-specific backup/restore procedures
   - Keep recovery contact list updated

Remember to adapt these procedures to your specific environment, cloud provider, and infrastructure setup.