It turns out the mon & app servers failed to reboot for some reason (no traces in syslog) a few days before crashing for good. The mon server failed to reboot June 27 and crashed a few days later. The app server failed to reboot June 29 and also crashed a few days later. There are not much to go on for forensic analysis, unfortunately. The machines were powered off and on again, back to their normal routine. I’ll login daily to verify they rebooted the day before. I’m hoping to get more information about the cause of the reboot failure before it crashes.
Here is a theory:
- reboot runs
- kills processes including syslog
- blocks on something that prevents reboot
- system goes for a few days despite missing a few services
- system eventually crashes because one or more of the missing services creates cumulative problems that lead to a system crash
A possible mitigation could be to change
install_files/ansible-base/roles/common/templates/cron-apt-cron-job.j2 to use something like
timeout 1800 reboot || reboot --force --no-sync
So if reboot is stuck contacting init services for more than half an hour, it fallsback to a brutal reboot. But since the SecureDrop instance is not in production yet, it is worth taking the time to investigate.
To be continued!