How to - Scheduling data image creation

One of the many benefits of Spawn is the ability to work with production-like datasets in all environments regardless of the size due to instant data container creation.

However to take advantage of this you first need to have a data image containing that data.

In this guide, you'll explore how to set up a scheduled pipeline in a CI environment to regularly create data images from your production-like datasets.

Spawn is currently in open beta. Complete the installation instructions to get access.

Prerequisites#

We'll assume you already have a masked backup of your production environment.
We'll assume that the agent you're using to invoke Spawn has access to the masked production backup file.

Backing up your database#

Depending on your environment, you'll have to get hold of a backup of your database you'd like to create an image from. The following table gives some suggestions of how you can do this depending on your environment. This table is by no means exhaustive, but following these instructions to generate a backup file has been tested and confirmed to work with Spawn.

Engine	Documentation
PostgreSQL	pg_dump
MySQL	mysqldump
MSSQL RDS	Native MSSQL RDS Backups
MSSQL On-prem	MSSQL Backups
Mongo	mongodump

Setting up Spawn in CI#

Authenticating#

When you're using Spawn interactively, you'll start off by running spawnctl auth. The authentication token you receive has a configured expiration time. This is no good for CI environments as an interactive authentication workflow is impossible.

Therefore, we must use access tokens to authenticate against Spawn as these have no expiration (though can be revoked if necessary).

spawnctl create access-token --purpose "Scheduled data image creation from masked production backup"

Access token generated: <long_access_token_string>

This command will create an access token with a given purpose. It's best practice to give clear, human-readable purposes for your access tokens so you can understand what they're used for in the future.

Now that we have this access token, you should set it up as a secret in your CI pipeline of choice so that your agents can access it.

Creating the data image#

Spawn can be used in any CI environment that supports running scripts. In this case, we're using a Bash script on a Linux agent, but you could use whichever OS and scripting language you like.

Defining the data image to create#

First, we'll need a file in source control that represents the data image we'd like to create:

name: WidgetStore
sourceType: backup
engine: postgresql
version: 11.0
teams:
  - myorg:developers
  - myorg:dbas
tags:
  - latest-production
backups:
  - folder: /backups/
    file: production-masked-latest.bak

There's some important best practices to mention in this yaml:

The image is shared with multiple teams. In this case, Developers and DBAs in my organisation
The image is tagged with latest-production
- This means that consumers can always run spawnctl create data-container --image WidgetStore:latest-production and they'll receive a data container with the latest production data

As called out in the prerequisites, we've assumed this agent can access the masked production backup. The yaml assumes those backups reside in the /backups/ directory on the CI agent.

Creating the data image#

Now we have the data image yaml defined in source control, we'll actually create it in our CI pipeline.

Here's an example of the script we'll use to do just that:

#!/bin/bash

# Install the latest version of spawnctl on the agent
curl https://run.spawn.cc/install | sh
export PATH=$PWD:$HOME/.spawnctl/bin

# Create the data image
spawnctl create data-image \
  -f $GIT_CHECKOUT_DIR/widgetstore-backup.yaml \
  --accessToken $SPAWNCTL_ACCESS_TOKEN \
  --tag $PIPELINE_RUN_ID \
  --lifetime 48h \
  -q

This script is very short, as we're only downloading spawnctl and then creating a data image.

The data image YAML file contains all the information about how to construct that image.

This pipeline can be configured to run as often as you'd like to refresh your data images.

Authenticating#

The $SPAWNCTL_ACCESS_TOKEN environment variable is the access token we created and made available to the agents in previous steps.

An extra tag for tracing#

You'll notice that we've also appended the --tag $PIPELINE_RUN_ID flag to the command. This is another best practice, as it will add a tag in addition to latest-production defined in the YAML file. In this case, the additional tag is the pipeline run identifier that triggered this data image creation. This means you'll be able to identify which images were created by which pipeline invocation.

Image lifetimes#

We've also specified a lifetime for the image.

This sets a retention period for the data image. In this case, our data image is only valid for 7 days before automatically being cleaned up by Spawn. This prevents us from having stale data images that would no longer be useful.

Suppressing progress output#

We've also added the -q flag to suppress output from spawnctl to avoid polluting the CI pipeline logs with progress messages.

Reviewing the new images#

As a developer in my organisation, I can now see these newly created data images and start using them in development:

$ spawnctl get data-images
ID                  Name                      Tags                     Engine              Status              CreatedAt           Teams                              ExpiresAt
10001               WidgetStore               234                      PostgreSQL:11.0     Completed           3 days ago          myorg:developers, myorg:dbas       4 days from now
10002               WidgetStore               235                      PostgreSQL:11.0     Completed           2 days ago          myorg:developers, myorg:dbas       5 days from now
10003               WidgetStore               236, latest-production   PostgreSQL:11.0     Completed           18 hours ago        myorg:developers, myorg:dbas       6 days from now