An experimental SecureDrop instance based on two BRIX was running fine for over a month, in a stable electrical / network environment. Yesterday both machines just stopped, presumably not rebooting properly. This happened once before. I’m not sure what is the cause but I will update this topic when I get more information.
It turns out the mon & app servers failed to reboot for some reason (no traces in syslog) a few days before crashing for good. The mon server failed to reboot June 27 and crashed a few days later. The app server failed to reboot June 29 and also crashed a few days later. There are not much to go on for forensic analysis, unfortunately. The machines were powered off and on again, back to their normal routine. I’ll login daily to verify they rebooted the day before. I’m hoping to get more information about the cause of the reboot failure before it crashes.
Here is a theory:
- reboot runs
- kills processes including syslog
- blocks on something that prevents reboot
- system goes for a few days despite missing a few services
- system eventually crashes because one or more of the missing services creates cumulative problems that lead to a system crash
A possible mitigation could be to change
install_files/ansible-base/roles/common/templates/cron-apt-cron-job.j2 to use something like
timeout 1800 reboot || reboot --force --no-sync
So if reboot is stuck contacting init services for more than half an hour, it fallsback to a brutal reboot. But since the SecureDrop instance is not in production yet, it is worth taking the time to investigate.
To be continued!
Odd. Thanks for reporting this @dachary. I need to gather some funds and get a BRIX myself. I’ve heard from someone else that they have had issues with the BRIX too, but the one in my office seems to be doing fine. I’ll have to double check.
And this is running on the latest kernel I assume?
Yes, running grsec-4.4.135. I have a hunch that it is a transient and rare problem, most likely driver related. I’ve set a hourly check today to verify the reboot happens as it should. I’ll get more info when it fails.