Skip to main content

Runbooks - Incident Response

This document consolidates operational procedures for handling common production incidents.

1. Service Down

Symptoms

Error messages in the client, or monitoring shows 'Red' status.

Actions

  1. Connect to the server via SSH.
  2. Check container status: docker ps.
  3. If a container is down, check logs: docker logs <container_name> --tail 100.
  4. Restart the service: docker-compose restart <service_name>.

2. High Database Load

Symptoms

System-wide slowness, API request timeouts.

Investigation

  1. Check active connections in MongoDB (via Mongo Express or CLI).
  2. Search for slow queries in the logs.
  3. Ensure appropriate indexes exist for all common queries.

3. Devices Not Updating (MQTT Issue)

Symptoms

Changes to prayer times are not reflected on screens in real-time.

Actions

  1. Check if the Mosquitto service is running.
  2. Try sending a manual message to a specific device's Topic using a third-party tool (MQTT Explorer) to see if it's received.
  3. If the Broker is "stuck", restart Mosquitto. Devices will reconnect automatically.

4. Database Restore from Backup

Scenario

Accidental data deletion or database corruption.

Procedure

  1. Locate the latest backup file (/backups/automated/...).
  2. Stop the NestJS server to prevent writes during restoration.
  3. Run the restore command:
    mongorestore --uri="mongodb://..." --drop --archive=<backup_file>
  4. Restart the server and verify data integrity.