Gitlab recovery


#1

This morning, 9:16 AM in Paris, gitlab-host was down. It seems it happened this morning, 8:30 europe/paris.

The vm was SHUTOFF; I started it. The host recovered progressively.

Gitlab was still down.

The logs
$ sudo docker logs b800e75c8def| tail -n 100
Missing Rails.application.secrets.openid_connect_signing_key for production environment. The secret will be generated and stored in config/secrets.yml.
2018-05-14 22:40:16,672 INFO reaped unknown pid 18169
2018-05-14 22:40:17,328 INFO reaped unknown pid 18173
2018-05-14 22:40:17,987 INFO reaped unknown pid 18175
2018-05-14 22:40:18,652 INFO reaped unknown pid 18179
2018-05-14 22:40:19,366 INFO reaped unknown pid 18181
2018-05-14 23:25:03,757 INFO reaped unknown pid 18830
2018-05-14 23:25:04,411 INFO reaped unknown pid 18832
2018-05-14 23:25:05,101 INFO reaped unknown pid 18838
2018-05-15 01:50:04,069 INFO reaped unknown pid 20802
2018-05-15 01:50:04,764 INFO reaped unknown pid 20806
2018-05-15 01:50:05,467 INFO reaped unknown pid 20812
2018-05-15 02:10:07,006 INFO reaped unknown pid 21286
2018-05-15 02:10:07,764 INFO reaped unknown pid 21288
2018-05-15 02:10:08,427 INFO reaped unknown pid 21294
2018-05-15 03:20:04,067 INFO reaped unknown pid 22298
2018-05-15 03:20:04,696 INFO reaped unknown pid 22300
2018-05-15 03:20:05,356 INFO reaped unknown pid 22302
Initializing logdir...
Initializing datadir...
Installing configuration templates...
SSL Key, SSL Certificate and DHParam were not found.
Assuming that the container is running behind a HTTPS enabled load balancer.
Configuring gitlab...
Configuring gitlab::database.....
Configuring gitlab::redis
Configuring gitlab::secrets...
Configuring gitlab::sidekiq...
Configuring gitlab::gitaly...
Configuring gitlab::monitoring...
Configuring gitlab::gitlab-workhorse...
Configuring gitlab::unicorn...
Configuring gitlab::timezone...
Configuring gitlab::rack_attack...
Configuring gitlab::ci...
Configuring gitlab::artifacts...
Configuring gitlab::lfs...
Configuring gitlab::uploads...
Configuring gitlab::mattermost...
Configuring gitlab::project_features...
Configuring gitlab::smtp_settings...
Configuring gitlab::oauth...
Configuring gitlab::oauth::github...
Configuring gitlab::ldap...
Configuring gitlab::cron_jobs...
Configuring gitlab::backups...
Configuring gitlab::registry...
Configuring gitlab::pages...
Configuring gitlab-shell...
Configuring nginx...
Configuring nginx::gitlab...
Configuring nginx::gitlab::hsts...
2018-05-15 07:22:09,926 CRIT Supervisor running as root (no user in config file)
2018-05-15 07:22:09,929 WARN Included extra file "/etc/supervisor/conf.d/cron.conf" during parsing
2018-05-15 07:22:09,929 WARN Included extra file "/etc/supervisor/conf.d/gitaly.conf" during parsing
2018-05-15 07:22:09,929 WARN Included extra file "/etc/supervisor/conf.d/gitlab-workhorse.conf" during parsing
2018-05-15 07:22:09,929 WARN Included extra file "/etc/supervisor/conf.d/mail_room.conf" during parsing
2018-05-15 07:22:09,930 WARN Included extra file "/etc/supervisor/conf.d/nginx.conf" during parsing
2018-05-15 07:22:09,930 WARN Included extra file "/etc/supervisor/conf.d/sidekiq.conf" during parsing
2018-05-15 07:22:09,930 WARN Included extra file "/etc/supervisor/conf.d/sshd.conf" during parsing
2018-05-15 07:22:09,930 WARN Included extra file "/etc/supervisor/conf.d/unicorn.conf" during parsing
2018-05-15 07:22:09,953 INFO RPC interface 'supervisor' initialized
2018-05-15 07:22:09,953 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2018-05-15 07:22:09,954 INFO supervisord started with pid 1
2018-05-15 07:22:10,958 INFO spawned: 'gitaly' with pid 615
2018-05-15 07:22:10,977 INFO spawned: 'sidekiq' with pid 616
2018-05-15 07:22:10,981 INFO spawned: 'unicorn' with pid 617
2018-05-15 07:22:11,000 INFO spawned: 'gitlab-workhorse' with pid 618
2018-05-15 07:22:11,017 INFO spawned: 'cron' with pid 619
2018-05-15 07:22:11,021 INFO spawned: 'nginx' with pid 620
2018-05-15 07:22:11,024 INFO spawned: 'sshd' with pid 621
2018-05-15 07:22:12,497 INFO success: gitaly entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-15 07:22:12,497 INFO success: sidekiq entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-15 07:22:12,497 INFO success: unicorn entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-15 07:22:12,497 INFO success: gitlab-workhorse entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-15 07:22:12,497 INFO success: cron entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-15 07:22:12,497 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-15 07:22:12,497 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-15 07:22:13,997 INFO exited: unicorn (exit status 1; not expected)
2018-05-15 07:22:14,809 INFO spawned: 'unicorn' with pid 654
2018-05-15 07:22:15,812 INFO success: unicorn entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-15 07:22:16,170 INFO exited: unicorn (exit status 1; not expected)
2018-05-15 07:22:17,186 INFO spawned: 'unicorn' with pid 729
2018-05-15 07:22:17,930 INFO exited: unicorn (exit status 1; not expected)
2018-05-15 07:22:18,934 INFO spawned: 'unicorn' with pid 734
2018-05-15 07:22:20,713 INFO success: unicorn entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-15 07:22:20,713 INFO exited: unicorn (exit status 1; not expected)
2018-05-15 07:22:21,716 INFO spawned: 'unicorn' with pid 737
2018-05-15 07:22:22,615 INFO exited: unicorn (exit status 1; not expected)
2018-05-15 07:22:23,993 INFO spawned: 'unicorn' with pid 740
2018-05-15 07:22:25,003 INFO success: unicorn entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-05-15 07:22:25,067 INFO exited: unicorn (exit status 1; not expected)
2018-05-15 07:22:26,071 INFO spawned: 'unicorn' with pid 743
2018-05-15 07:22:26,924 INFO exited: unicorn (exit status 1; not expected)
2018-05-15 07:22:28,788 INFO spawned: 'unicorn' with pid 748
2018-05-15 07:22:29,738 INFO exited: unicorn (exit status 1; not expected)
2018-05-15 07:22:32,268 INFO spawned: 'unicorn' with pid 751
2018-05-15 07:22:33,204 INFO exited: unicorn (exit status 1; not expected)
2018-05-15 07:22:36,735 INFO spawned: 'unicorn' with pid 758
2018-05-15 07:22:37,596 INFO exited: unicorn (exit status 1; not expected)
2018-05-15 07:22:38,598 INFO gave up: unicorn entered FATAL state, too many start retries too quickly

I entered in the container via

sudo docker exec -it b800e75c8def bash

After some struggling, I finally found /var/log/gitlab/gitlab/unicorn.stderr.log which complained in this way:

 /home/git/gitlab/vendor/bundle/ruby/2.3.0/gems/unicorn-5.1.0/lib/unicorn/http_server.rb:195:in `pid=': Already running on PID:652 (or pid=/home/git/gitlab/tmp/pids/unicorn.pid is stale) (ArgumentError)

So I removed all pids files in /home/git/gitlab/tmp/pids and I rebooted the VM.

Gitlab was back.


#2

Thanks for the quick repair. It’s not good that .pid files block restarting the server. Did you research why that is? If not I’d be happy to do it.


#3

I suppose it is a mix between buggy launchers and unexpected failure (host toughly stopped?)

Did you research why that is? If not I’d be happy to do it.

I looked at it a little bit, but no idea about what happened; may it be an ovh failure?


#4

That’s most probably what happened, yes. Fortunately this is rare.