Skip to main content

Runbooks - Incident Response

This document consolidates operational procedures for handling common production incidents.

1. Service Down

Symptoms

Error messages in the client, or monitoring shows 'Red' status.

Actions

Connect to the server via SSH.
Check container status: docker ps.
If a container is down, check logs: docker logs <container_name> --tail 100.
Restart the service: docker-compose restart <service_name>.

2. High Database Load

Symptoms

System-wide slowness, API request timeouts.

Investigation

Check active connections in MongoDB (via Mongo Express or CLI).
Search for slow queries in the logs.
Ensure appropriate indexes exist for all common queries.

3. Devices Not Updating (MQTT Issue)

Symptoms

Changes to prayer times are not reflected on screens in real-time.

Actions

Check if the Mosquitto service is running.
Try sending a manual message to a specific device's Topic using a third-party tool (MQTT Explorer) to see if it's received.
If the Broker is "stuck", restart Mosquitto. Devices will reconnect automatically.

4. Database Restore from Backup

Scenario

Accidental data deletion or database corruption.

Procedure

Locate the latest backup file (/backups/automated/...).
Stop the NestJS server to prevent writes during restoration.

Run the restore command:

mongorestore --uri="mongodb://..." --drop --archive=<backup_file>

Restart the server and verify data integrity.

1. Service Down
- Symptoms
- Actions
2. High Database Load
- Symptoms
- Investigation
3. Devices Not Updating (MQTT Issue)
- Symptoms
- Actions
4. Database Restore from Backup
- Scenario
- Procedure