Skip to main content

2 posts tagged with "ci"

View All Tags

Simple Database CI with Spawn and Github Actions

Running tests against databases in CI pipelines is an essential part of testing your application. 

Provisioning databases in CI pipelines can be hard work, however. Broadly speaking you have two options:

  • Have all your pipelines use shared databases
  • Use Docker to run containerised database instances

The first option has the advantage that you can test against real data, perhaps a recently restored copy of production, but it effectively serializes your pipelines as they contend for the shared database. You may be able to scale by adding multiple database servers, but ultimately the parallelism of your CI pipelines is limited by the number of database servers you have available to the pipelines.

The second option is a substantial improvement in terms of parallelism as each pipeline run now has a dedicated database spun up and torn down for exclusive use. However, the problem of testing against realistic data is now more acute. Typically where we see Docker being used to provision databases in CI pipelines, we see the use of seed data stored in the code repository used to populate the containerised database. This means you lose the confidence that you gain from testing against a realistic data set. If you don't want to go down the route of using seed data, you need to manage a docker volume inside your pipeline, or run a lengthy database restore operation in each pipeline run.

Development databases in Docker aren’t good enough

Development databases in Docker aren’t good enough on their own. Why? Because they’re almost always so far from the production environment characteristics that you get a false sense of security in development.

Having isolated databases is far better than a shared environment where other developers trample over your changes. But because dev databases tend to either be empty, or have “happy path” data within them, they never truly demonstrate the behaviours you’ll end up seeing in production.

This leads to a variety of different problems:

  • Unexpected data loss during schema migrations
  • Unacceptable latency on specific queries because of vastly different data sizes
  • Poor UX due to unanticipated user-provided data
  • UI glitches or performance issues not caught in lower environments because of unrealistic data
  • Entire branches of code left unexercised due to conditions on the data not caught in lower environments