Runbooks - Incident Response
This document consolidates operational procedures for handling common production incidents.
1. Service Down
Symptoms
Error messages in the client, or monitoring shows 'Red' status.
Actions
- Connect to the server via SSH.
- Check container status:
docker ps. - If a container is down, check logs:
docker logs <container_name> --tail 100. - Restart the service:
docker-compose restart <service_name>.
2. High Database Load
Symptoms
System-wide slowness, API request timeouts.
Investigation
- Check active connections in MongoDB (via Mongo Express or CLI).
- Search for slow queries in the logs.
- Ensure appropriate indexes exist for all common queries.
3. Devices Not Updating (MQTT Issue)
Symptoms
Changes to prayer times are not reflected on screens in real-time.
Actions
- Check if the Mosquitto service is running.
- Try sending a manual message to a specific device's Topic using a third-party tool (MQTT Explorer) to see if it's received.
- If the Broker is "stuck", restart Mosquitto. Devices will reconnect automatically.
4. Database Restore from Backup
Scenario
Accidental data deletion or database corruption.
Procedure
- Locate the latest backup file (
/backups/automated/...). - Stop the NestJS server to prevent writes during restoration.
- Run the restore command:
mongorestore --uri="mongodb://..." --drop --archive=<backup_file> - Restart the server and verify data integrity.