Healing the sick CI with GitLab CI

dachary · January 12, 2018, 11:15am

Bonjour,

In the past six months our CI (i.e. whatever is triggered when someone submits a pull request) has rare been stable. I remember it ran smoothly during a two or three weeks, a few months ago. The rest of the time it failed in various ways, daily. It has never been completely down but its instability is a recurring roadblock for developer:

It is no longer trusted. When the false negative are too frequent, they are discarded and legitimate errors will go unnoticed. That’s what’s happening with travis at the moment. When the CI requires unstable tests it becomes a burden and make people angry. Then the unstable test is made optional. And ignored. And removed.
It is difficult to diagnose. When the CI fail for reasons unrelated to the code change, only experienced developers are able to figure it out. The occasional developer will always be confused.
It require special permissions. When the error is transient, the only way for a regular developer is to rebase and repush which is a little known trick.

There are two sources of problem: racy tests and environmental failures. There has been a fair share of racy tests but we managed to reduce them to almost nothing. The last occurrence was this week but it was so rare that I don’t think anyone noticed. Because the environmental failure make so much noise that they efficiently hide them

AWS provisioning failng
Travis failing
Code coverage reporting services failing
CircleCI network issues
Ubuntu, tor, FPF repositories failure
GitHub repositories unavailable

Despite our best efforts these potential failures, some of them very rare, combined with each other lead to a CI that breaks daily. We need to take a hard look at this problem and rethink how we approach it. In my opinion the two root problems to be fixed are:

Too many moving parts.
The developers do not fix the CI.

I think we can fix this by:

Reducing the number of moving parts so the sum of the failure probability of all parts is 1 per week or less
Turning the CI infrastructure into code so the developers are responsible for fixing it

The concrete implementation could be:

Setting up GItLab and GitLab CI with provisioning of VMs on the same machine (via libvirt for instance), use coverage reports to compare visually the previous and current coverage so the source of environmental failure becomes:
- GitLab machine unavailable
- Ubuntu, tor, FPF repositories failure
The GitLab & GitLab CI setup is in an ansible playbook running on the securedrop.club infrastructure:
- Developers are empowered to fix the CI ansible playbook
- Developers with merge permissions can run the fixed ansible playbook, reboot machines etc.

Getting there is going to be a significant amount of work but I’m confident we’ll end up with something stable. And the total amount of work required to set this up is eventually going to be much less than constantly fighting an unstable infrastructure. This is a textbook example of how using exclusively Free Software and running a self hosted infrastructure is the better choice. If the web services we’re currently using were Free Software, we could solve the problem by self hosting all of them and therefore significantly reduce the failure rate we’re experiencing.

What do you think ?

mike · January 12, 2018, 3:16pm

Hey @dachary!

I agree with all your points regarding current CI pains – I understand it’s been especially frustrating regarding Circle CI with the inability for non-repository owner’s to re-start failed jobs. We can definitely reduce failures even more to make it a less frustrating experience. I do have a bunch of things to add though, I think we should have a separate video conference to hash some of these aspects out.

The current CI workflow is in code, its completely reproducible using ansible. The caveat is that you’ll need an AWS account to reproduce the AWS bits. If we move that to libvirt/kvm, we’d need physical hosts, right? I don’t know any cloud providers that currently allow nested virtualization. Whats the provider you currently utilize and does it allow this (aws certainly doesnt)? The problem I see of moving to libvirt though is that all of FPF is eventually planning to use Qubes as their daily workstations so we wouldn’t be able to spin up local VMs and would rely on the public cloud spin-up workflow. We could still have both points of testing so we dont require developers to have cloud accounts, but obviously I’d prefer to have developers test exactly whats in CI if possible.
Travis is on it’s way out the door – thats our really old legacy CI flow and it sucks. It’s technically not required for merge right now but there is a PR to rip it out completely thats slowly snaking it’s way through. I realize you don’t like seeing the red in github thats why its getting ripped out
CircleCI – I’ll be honest, I don’t love CircleCI but it’s also relatively stable. I see issues with networking sometimes but I believe there are things we can do with the provisioning process to make that more resilient. The entire SD provisioning process is very frail if you think about it. This is one of the reasons I really was pushing that we move towards slinging a file-system versus doing configuration on each workstation. Thats another fight for another day though

My big hang up of Gitlab CI though (please correct me if I’m mistaken) is that:

it doesnt appear to support external triggers for building (https://gitlab.com/gitlab-org/gitlab-ce/issues/38091)
we have to keep github and gitlab in sync
and I’m concerned especially with how well it handles forks (https://gitlab.com/gitlab-org/gitlab-ce/issues/21991) (especially github forks).

The fork issue I see as the biggest one, it’s the main reason we are using CircleCI now instead of Jenkins. I previously had Jenkins really far into development but had issues automatically building from forks with the plugins. So I guess my questions for you are —

can we overcome each of these concerns while using Gitlab CI?
Do we have to maintain a sync between gitlab and github to effectively have two code-bases? I really dont want to have two places where our code lives that we have to keep in sync – that seems to be counter of reducing moving pieces in CI. So then the discussion becomes – do we want to consider moving to Gitlab completely.
Can you create jobs in Gitlab CI independent of gitlab/github repositories? Just say run X code at X time ?

I also know you hate Jenkins – but do you have some time to look over what I’ve set up for our Jenkins? I use the pipeline and jenkins DSL feature… means you don’t use the UI to create jobs… and instead all the logic is in code Jenkinsfile that sits in the repo (see devops/jenkins/TorNightlyPipeline in current securedrop repo for an example). When was the last time you’ve used Jenkins? They’ve really made a lot of headway in the last year especially with the BlueOcean UI redesign and their introduction of pipelines (their build logic as code feature). I’ve had a lot of trouble in the past with Jenkins but a lot of community members I respect continue to use it and have built stable CI with it – so I’m not ready to dismiss it completely from the equation.

Here’s the biggest elephant in the room I see though… The entire installation process takes too long and has too many moving pieces. It takes about 20 minutes for the provisioning to get through to even find out if we failing tests. Does that entire process need to get run for every PR? I would say no – we should move towards running basic application tests on each PR and then regularly run a large test suite every hour, make that more resilient and then make it very loud when there are problems. Where a problem in CI means, hey stop what you are doing and help fix the build. At least thats where I envision the long-term goal.

I think that should be a parallel goal here in reducing CI chance of failure in addition to whatever CI fixes we all come to agreement on.

Sorry I swear I’m not trying to shoot down this idea and really want to have a larger discussion about it solve it. CI issues are the worssssssstttt for a new developer so anything we can do to squash it I say we fully investigate.

–SHeiny!

dachary · January 12, 2018, 3:40pm

Yes ! And it’s really good. I meant that the infrastructure as a whole is not reproducible. We can’t self host GitHub, CircleCI, codecov.

I tried maintaining GitLab and GitHub repo in parallel and it’s a pain. GitHub has an API but it’s unstable like you would not believe. I have snippets of code trying to cope with it but in the end it’s never going to be sustaining. Not only GitHub changes the undocumented API behavior without notice (i.e. what cause errors because of internal factors such as permissions, throttling, …). They also deprecate API every few years and force you to rewrite things to keep up: not something sustainable for a relatively small project.

The short answer is that moving to GitLab entirely is the right move. Doing so without disrupting the SecureDrop development is non trivial.

I gave up November 5, 2015. You can browse the commit where I deleted my best attempt at sanitizing Jenkins and I still have PTSD from implementing a script to create credentials.

Right … but it is stable and does not frustrate us. At least not in the past few months

dachary · January 21, 2018, 9:52am

I did not mention Jenkins as an additional part in the CI because I did not realize it was now part of the release process, in the context of maintaining the Tor mirror. I did not follow this development and missed the fact that Jenkins was added. It is not triggered every pull request because the maintenance of the Tor repo is not so tightly coupled. But Jenkins run depends on the SecureDrop repository (devops/jenkins/TorNightlyPipeline) and we may end up breaking it or getting bug reports in the SecureDrop repository because of the Jenkins pipeline.

dachary · January 26, 2018, 12:21pm

I will go ahead and implement a GitLab based CI pipelline. Highly motivated by daily breakage of the CI available on GitHub.