The Pain That Is Github Actions

The Pain That Is Github Actions

Gerd Zellweger
Gerd ZellwegerHead of Engineering / Co-Founder
| March 17, 2025

For the past two weeks, I’ve been spending most of my time rewriting our CI scripts in GitHub Actions. This is the third time we’ve had to redo our CI setup—first GitHub Actions, then Earthly (which we moved away from because it was discontinued), and now, reluctantly, back to GitHub Actions.

Our CI is complex: merge queues, multiple runners (self-hosted, blacksmith.sh, GitHub-hosted), Rust builds, Docker images, and heavy integration tests. Every PR we merge burns through an hour of CI time, running across multiple parallel runners.

There are a few things we'd like to have (which we deem as "good software practice") but it's nothing unheard of:

  1. Everything that goes into `main` must pass all tests.
  2. Trivial mistakes (formatting, unused deps, lint issues) should be fixed automatically, not cause failures.
  3. The artifacts we test with in CI should be the exact ones we release.
  4. CI should complete quickly (to keep developers happy).

GitHub Actions technically allows all of this—but setting it up is a frustrating mess, full of hidden gotchas, inconsistent behavior, and a debugging experience that makes me question my choices.

Strange Way to Enforce Status Checks with Merge Queue

The key to enforcing a clean main branch is GitHub’s merge queue, which rebases a PR onto main before running CI. Sounds great. But here’s the fun part:

  • We need CI to run before entering the queue to auto-fix trivial issues.
  • We need CI to run again inside the queue to verify the final merge.
  • GitHub Actions makes it weirdly hard to require both runs to pass.

The solution? Name the jobs identically in both phases. That’s it. GitHub treats them as the same check, so they both need to succeed. Solved by reading this answer in a Stack Overflow post after a few hours of debugging. Any other way you try to do this leads to either status checks being awaited before you put something in the queue (so it never starts the job) or worse, things just get merged even if the job you'd like to pass in the merge queue fails.

A security nightmare?

A few days ago, someone compromised a popular GitHub Action. The response? "Just pin your dependencies to a hash." Except as comments also pointed out, almost no one does.

Even setting aside supply chain attacks, GitHub’s security model is a confusing maze to me: My point of view is that if I can't understand a security model easily it's probably doomed to fail or break at some point. Disclaimer: I'm writing this as a github actions user with only a vague understanding of it so I'd be delighted to hear that it is not just "things piled on top of things until it's safe", which is my current impression. I do understand very well that the problem of having secure CI for distributed source control is complicated.

In github, there is a "default" token called GITHUB_TOKEN. The way it works is that it gets initialized with some default permissions. You can set that default in the settings of your repository (under Actions -> General -> Workflow Permissions). Here is what the github documentation says about it:

If the default permissions for the GITHUB_TOKEN are restrictive, you may have to elevate the permissions to allow some actions and commands to run successfully. If the default permissions are permissive, you can edit the workflow file to remove some permissions from the GITHUB_TOKEN.

- Github Documenation

Removing permission that aren't necessary sounds nice (though I do think a better "default" would be to start with no privileges and require the user to add whatever is needed). Unfortunately, there are many of them and it's hardly clear for all of them what they are protecting if you're not a github expert.

Your workflow permissions also don’t really depend on the action itself. Here is an example of such an instance, I'm using softprops/action-gh-release to automatically create a new release on github

- name: Release on GitHub
  if: env.version_exists == 'false'
  uses: softprops/action-gh-release@v2
  with:
    tag_name: v${{ env.CURRENT_VERSION }}
    generate_release_notes: true
    make_latest: true
    token: ${{ secrets.CI_RELEASE }}

Why do I need a custom token? Because without it, the release completes, but doesn’t trigger our post-release workflow. The sad part is that you don't get any indication about it until you eventually find an issue where someone had the same problem and that leads you in the right direction.

You can also elevate permissions in your workflow yaml file. That seems like a strange thing to do inside the code you're trying to protect. At least there are some limitations according to the github docs:

You can use the permissions key to add and remove read permissions for forked repositories, but typically you can't grant write access. The exception to this behavior is where an admin user has selected the Send write tokens to workflows from pull requests option in the GitHub Actions settings. For more information, see Managing GitHub Actions settings for a repository.

This is just one of many instances which I believe is the root of what makes the github actions security model so obscure: there are too many pitfalls accompanied by exceptions that you have to account for. Clearly the system is very powerful and allows you to do many things but it also expands the attack surface for breaking things.

As far as I can tell I'm not alone in this. Another instance of the same problem I ran into is when I read this paragraph where they recommend that you don't use self-hosted runners in public repositories:

We recommend that you only use self-hosted runners with private repositories. This is because forks of your public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.

- Github Documentation

However, github also has a setting for self-hosted runners where pull-requests from external collaborators need to be approved before running. A practical question that comes up for me is "are self-hosted runners in combination with this setting safe"? I do believe so, but github documentation doesn't say and there is no consensus in the rest of the internet about it. It's hard to be 100% confident given how much complexity there is. Even github documentation writers don't seem to understand their security model anymore.

Docker and Github Actions, an Unholy Combination

If you thought GitHub Actions was bad, try mixing in Docker.

GitHub lets you run jobs inside a container. This is great in theory—you can prepackage dependencies into a dev container instead of installing them every run. In practice:

  1. File permissions break constantly. A container builds files as one user, but GitHub runners may use another (different uid and gid) to run it. So it may be unable to either access the files in the container or in the github workspace and temporary host directories that get mounted.
  2. The $HOME directory moves. Your dev container may install tools into /home/ubuntu, but inside GitHub Actions, it’s suddenly /github/home. Tools that rely on files in $HOME may no longer find them.
  3. Any action that interacts with the host system might break now. For example, I use blacksmith’s sticky disk action to mount an NVMe drive for caching (since GitHub caches are limited to 10GB). It didn’t work inside a container until they made a fix for me (thanks to Aditya Jayaprakash from blacksmith.sh for the one day turn-around time on this!).

Meanwhile, the container field itself has weird limitations. Want to override the entrypoint? Nope. Want to run some steps inside a container and others outside? Nope.

Developing Workflows with YAML

All of this logic you end up writing in YAML can unfortunately get complicated pretty quickly and you're bound to make mistakes. I was using RustRover as my IDE when writing the YAML which had some linter checks for github YAML built-in and it helped a lot. I still found myself wishing for much better static checking for all of this. It doesn't help that you can't really try any of this locally (I know of act but it only supports a small subset of the things you're trying to do in CI). I found that the best way to debug CI is to create an identical repo to the one you're trying to make changes for and do git commit -a -m "wip" && git push test-ci branch until CI works as expected.

Since I didn't want to run the whole CI pipeline every time I made a change, I tried to keep individual workflows small and have them push artifacts at the end of their steps, then subsequent workflows could download the artifacts and re-use them instead of rebuilding everything from scratch. This lets you test workflows in isolation because you can just download artifacts from a previous run until it works (of course when downloading from a previous run one needs to provide a token to the download-artifact action, but this can just be the default token. Why does it still need to be provided then is yet another unsolved mystery...).

The main workflow file then becomes a chain of invoking other YAML files:

jobs:  
  invoke-build-rust:  
    name: Build Rust  
    uses: ./.github/workflows/build-rust.yml  
  
  invoke-build-java:  
    name: Build Java  
    uses: ./.github/workflows/build-java.yml  
  
  invoke-tests-unit:  
    name: Unit Tests  
    needs: [invoke-build-rust, invoke-build-java]  
    uses: ./.github/workflows/test-unit.yml  
  
  invoke-tests-adapter:  
    name: Adapter Tests  
    needs: [invoke-build-rust]  
    uses: ./.github/workflows/test-adapters.yml  
    secrets: inherit  
  
  invoke-build-docker:  
    name: Build Docker  
    needs: [invoke-build-rust, invoke-build-java]  
    uses: ./.github/workflows/build-docker.yml  
  
  invoke-tests-integration:  
    name: Integration Tests  
    needs: [invoke-build-docker]  
    uses: ./.github/workflows/test-integration.yml  
  
  invoke-tests-java:  
    name: Java Tests  
    needs: [invoke-build-java]  
    uses: ./.github/workflows/test-java.yml

Notice the secrets: inherit added to some jobs. Another gotcha that took me too long to figure out. Every time when I ran the entire CI pipeline things wouldn't work but when I ran the steps individually they would work just fine (that's because when you call a workflow from another workflow secrets aren't shared by default).

There are many more gotchas I wanted to write about, but the post is already quite long. Overall, I'm still happy with our new CI scripts because it reduced the time to merge significantly for us. I just wish the process to get there would be less time-consuming and that it would be easier to debug when things go wrong. I guess, I'm hoping for some innovation here.




Other articles you may like

Database computations on Z-sets

How can Z-sets be used to implement database computations

Implementing Batch Processes with Feldera

Feldera turns time-consuming database batch jobs into fast incremental updates.