Simple Backups to Tarsnap

It's important to keep backups of your data. Everyone knows this, but it's easy to let it slip, or to confuse having a RAID with having backups. I'm going to quickly go over a process I use to back up databases (or other data) to the online service Tarsnap, which provides deduplication and secure encryption on top of storage on Amazon S3 for very little money.

RAID isn't backups

If you have dedicated hardware for your server, it's probably got RAID storage. You may even have a RAID array on your local workstation, or your network-attached-storage that holds all your MP3 downloads. It's very common for people to consider RAID - especially RAID-1/mirroring - as a form of 'backup', but this is a terrible, terrible mistake.

RAID is about resiliency, about the uptime of the system using the RAID array. Backups are about point-in-time snapshots of your data, which you can keep distant from your primary working copy. A corrupted file will be replicated throughout your RAID array, but hopefully can be restored from an external backup, and no RAID array can protect you against fire, flood, theft, or zombie attack in your home/office/data-centre. For that you need off-site backups.

There is one aspect of using a RAID array which can be very much like backups, however, and we'll come back to that later.

Database Replicas aren't backups

Another system which can be confused for having a backup is running a replica of your database. A database replica - also known as a 'slave' or 'follower' - replicates the data from a 'primary' (or 'master', or 'leader') database to the replica, in real-time. Different kinds of replications work in different ways, but the core concept is that you have a copy of your data on another host which you can perform read-only operations on, or if your primary database fails you can promote the replica to primary.

Like RAID, database replicas are more about resilience and maintaining high availability than about keeping backups, and although a database replica can be used as part of a backup process, it's not enough to just have the replica by itself.

Tarsnap

There's a lot of options for handling offsite backups, but one service I particularly like is Tarsnap, a service that provides secure storage on Amazon's S3 service. The guy behind Tarsnap is well-respected as a cryptographer and computer security expert and it can be set up in such a way that not even he can decrypt your backups. This level of security is overkill for most individuals, but if you're handling personally identifying information in the EU, or credit card details anywhere, then your business needs this level of security.

Tarsnap is only available as source - the author doesn't make OS packages available - so you'll need to download the source and compile it yourself. If you're managing backups for a server, this shouldn't be traumatic - just follow the instructions on the Tarsnap website.

Tarsnap both encrypts your backups and authorises access using a single key file. I recommend using one key per host, and only accessing that set of backups from the single host. The Tarsnap website has instructions on setting up your first key.

It's important to be aware that Tarsnap's caching means that if the backup archives generated by a key are changed (archives added or deleted) in different places, the Tarsnap cache will need to be rebuilt before you can create or delete archives. Rebuilding the cache can take ages, so for this reason I recommend only altering the backups from one machine.

However, it's incredibly important that you keep a copy of the Tarsnap key you generate. If you lose access to that key you literally lose access to all your backups, and it cannot be recovered from the Tarsnap service - they don't have access to it, which is one of the ways in which they guarantee security of your data.

At a small scale, keeping a copy of the key in a secure password store like 1Password is a good balance between convenience and security.

Get your ducks in a row

It's good to spend a little time thinking about what data you actually need to back up. If your hosts are managed using configuration management (via Chef, Puppet, Ansible, or another tool), then there's no need to keep copies of things like the system configuration. That can be easily generated again from the configuration management scripts.

The main thing to focus on is going to be data provided by your users, and in the majority of cases this takes the form of stuff in a database.

The most effective way to get a file to back up with Tarsnap is to dump your database server, using a tool like mysql_dump or pg_dump. Be aware that some MySQL database engines require the entire database to be locked while the backup is being taken. This would effectively take down a website while backups were running, which may not be ideal. A common way to around this is to have a database replica which is used just for backups - this also avoids the relatively heavy load a big dump can place on the database server, by shifting it to a dedicated server.

If possible I recommend moving the database dump and any other related files (such as binlog files from MySQL, which can allow you to replay queries performed between dumps) to a single manageable location. This makes it easier to keep control over the exact files you're backing up, and keeps it separate from anything being used live in the database.

Backups and pruning

Tarsnap does block-level deduplication, which means that each 'full' backup only needs to send the data that's changed or new since the last backup. This both massively reduces the amount of actual data stored by Tarsnap, and also massively simplifies the backup process, as you can just do a 'full' backup at every run. This means that you can also do backup runs very frequently: for a production database containing data that's really important, you could run as frequently as every 15 minutes. For my personal email server I take a backup every two hours.

The actual backup command, running from cron, ends up looking something like:

/usr/local/bin/tarsnap --cachedir /var/cache/tarsnap --keyfile /root/tarsnap.key --quiet -cf "mail_$(date +\%Y\%m\%d-\%H\%M)" /srv/mail/mail.xybur.net

You can actually specify the cachedir and keyfile in the config file (/usr/local/etc/tarsnap.conf, by default) and keep this command a bit cleaner, but I prefer to keep it explicit during setup at least. Even if you do specify these options in the command, it's worth also ensuring the config file is updated as it makes using the tarsnapper utility simpler.

tarsnapper is a Python script which we'll use to expire older backups and keep archives according to a schedule we'll define. This isn't especially necessary for keeping costs down (due to Tarsnap's deduplication), but Tarsnap's list-archive and other commands get extremely unwieldy when there's a large number of archives.

tarsnapper can be used to create the archives as well as expire them, as detailed in the documentation, but I prefer to create the archives by hand myself. If expiring archives doesn't work, I'm just left with more archives than planned - I can live with that. If backups aren't created when expected then I'm going to end up losing data one day.

You need to decide how many backups to keep based on your own personal needs. I went for keeping one backup archive for each day for a week, each week for a month, and each month for a year. That gives me a reasonable set of snapshots back into my data, if I need to recover something later.

The actual command for this looks like:

/usr/local/bin/tarsnapper --target "mail_\$date" --deltas 1d 7d 30d 365d - expire

One 'gotcha' to watch out for: the actual tarsnap binary is in /usr/local/bin, which won't be in the path of cron's restricted environment. To get around this I have env PATH=/usr/local/bin:$PATH before my tarsnapper command in the crontab, for a full cron entry that looks like:

50 3 * * * env PATH=/usr/local/bin:$PATH /usr/local/bin/tarsnapper --target "mail_\$date" --deltas 1d 7d 30d 365d - expire

Another thing to watch for is that while tarsnapper is expiring archives, Tarsnap can't create new backups. I run backups infrequently enough that tarsnapper can always do a daily clean-up run without conflicting, but if you need more frequent backups then it's something to be aware of.

Untested backups don't exist

It's a rule of practical systems administration that untested backups will turn out to have some flaw which renders them useless which you will only discover in an emergency. For safety, you should assume that untested backups basically don't exist.

This is similar to RAID arrays, which without careful monitoring can suddenly turn out to be less reliable than planned. If the first you discover a RAID array is running degraded - that is, with a failing drive - is when a second drive fails then you'll be glad you had proper backups!

While it's best to automate backup restore tests, it can be difficult to justify spending the time in some cases. At the very least you should set a quarterly reminder in your calendar app to do a test restore - even if the manual process eats an afternoon, the loss of an afternoon once every quarter is generally going to be a lot less expensive than the loss of all your data when it turns out your backups were flawed or incomplete.

Partial automation of backup restores can allow testing as part of an everyday workflow - for example, if you have a script which can bootstrap a local developer database by downloading the most recent backup, then any time you set up a new development environment you're also proving that your backups work.

In smaller teams, where new developers aren't brought on very often, full automation might be preferable. A scripted restore followed by enough of a smoke test to show that the data is acceptably recent (perhaps check the sessions table on a database for a recent log-in, or some other action that you know your users perform at least daily) should give you confidence in your backups.

Other providers are available

Tarsnap isn't the be-all and end-all of online server backups, it's just an option that I personally like.

CrashPlan is a popular alternative 'backup as a service' provider. Unfortunately I couldn't find any information about cost, they don't offer their pro service in the UK, and they don't appear to offer a simple scriptable Linux client install, all of which put me off investigating them any further.

Least Authority S4 appears to offer unlimited storage for a set monthly fee, which could be more appealing than trying to work out how much Tarsnap is going to cost. (Tarsnap's idiosyncratic billing increments - and compression/block-deduplication - can make it difficult to work out what monthly costs are going to be before you implement it.) Otherwise they seem similar to Tarsnap. From my understanding of both services, Tarsnap is simpler to set up and has better documentation available. S4's website implies you need to run both a client and a special server process, which will then in turn feed your data out to storage server back-ends. It's also not completely clear if the fixed monthly fee includes Amazon S3 costs, which could make it much more expensive than Tarsnap.

S4 has plans to include other storage providers in the future, which could be nice if you don't want to use Amazon or if you want to keep your data across multiple companies just in case.

It's also possible to manage all the backups yourself - tools like s3cmd can give you what amounts to 'rsync to S3', however you're on the hook for ensuring decent encryption and archive snapshotting. It's worth paying extra for the security of having someone smarter than you work out the details - especially when it comes to encryption.

Similarly, traditional Linux backup services like Amanda aren't really applicable unless you've got a network worth of machines to manage backups for - and if you're in that position you probably don't need this guide.

While I prefer Tarsnap to the other options, the critical thing is to have backups, which must be encrypted and off-site, and that you do restore testing to ensure that you're as safe as you think you are.

Acknowledgements

Thanks to Alan Gardner, Mike McQuaid, and Adam Millerchip for comments and feedback. All errors in this piece remain entirely my own.