Skip to main content

6 posts tagged with "development"

View All Tags

Instant GitHub Codespaces are dangerous without realistic data

Picture the scene:

  • You click a button on a GitHub repository
  • A new tab launches with VS Code containing your code
  • All of the prerequisites are preinstalled and you can start coding, compiling, and running your application instantly

Amazing! Forget about walking line-by-line through “” to configure your local machine.

But let's fast-forward a little bit more…

  • You start your web app and load it in a new tab
  • You log in to your web app
  • You’re greeted with an empty screen

There’s no data. There are no other users. There’s no realistic data in this application at all.

Codespaces with databases

With GitHub Codespaces you can set up a cloud-hosted, containerized VS Code environment. You can then connect to a codespace through the browser or through VS Code.

The main question we are trying to answer now is:

Can we have virtualized environments with databases created from backups or scripts, not just empty databases, or fake data sets?

Simple Database CI with Spawn and Github Actions

Running tests against databases in CI pipelines is an essential part of testing your application. 

Provisioning databases in CI pipelines can be hard work, however. Broadly speaking you have two options:

  • Have all your pipelines use shared databases
  • Use Docker to run containerised database instances

The first option has the advantage that you can test against real data, perhaps a recently restored copy of production, but it effectively serializes your pipelines as they contend for the shared database. You may be able to scale by adding multiple database servers, but ultimately the parallelism of your CI pipelines is limited by the number of database servers you have available to the pipelines.

The second option is a substantial improvement in terms of parallelism as each pipeline run now has a dedicated database spun up and torn down for exclusive use. However, the problem of testing against realistic data is now more acute. Typically where we see Docker being used to provision databases in CI pipelines, we see the use of seed data stored in the code repository used to populate the containerised database. This means you lose the confidence that you gain from testing against a realistic data set. If you don't want to go down the route of using seed data, you need to manage a docker volume inside your pipeline, or run a lengthy database restore operation in each pipeline run.

Development databases in Docker aren’t good enough

Development databases in Docker aren’t good enough on their own. Why? Because they’re almost always so far from the production environment characteristics that you get a false sense of security in development.

Having isolated databases is far better than a shared environment where other developers trample over your changes. But because dev databases tend to either be empty, or have “happy path” data within them, they never truly demonstrate the behaviours you’ll end up seeing in production.

This leads to a variety of different problems:

  • Unexpected data loss during schema migrations
  • Unacceptable latency on specific queries because of vastly different data sizes
  • Poor UX due to unanticipated user-provided data
  • UI glitches or performance issues not caught in lower environments because of unrealistic data
  • Entire branches of code left unexercised due to conditions on the data not caught in lower environments