The Ideal Backup System: An Unwritten Program?

In light of my recent disk crash (which I also mentioned alongside yesterday’s poem), I’ve been thinking about backups, and (since my only backup of the now-dead hard drive had been manual synchronization of my most important directories to my desktop) what I want in a backup program. Because the list of features I outline below is quite likely so ambitious (or so naive) that no one program meets every one, we’ll call file this under “Unwritten Programs”. But I’d like your advice on choosing a “backup solution” that comes as close to fitting these as possible, or in revising my “wants” to better fit my real needs.

(Oh, and if you know of a reputable “magnetic forensics” service that could try to get my data off the platters of this mechanically-dead drive for less than a few hundred dollars, I’d be delighted to learn that the prices I’ve gathered such a thing invariably costs were incorrect, and that it’s actually fiscally possible for me …)

First of all, because my setup includes (well, will include once I get my replacement laptop hard drive and get my system reinstalled) two similar systems, each dual-booting Linux and Windows (different versions of Windows, by the way), any “backup solution” will need to be “cross-platform” and able to back up all of these to the same device(s), and have the backups play well together.

Second, while the drive I’m getting to use for my backups is plenty big, there’s no sense wasting space (or time), or putting unnecessary wear-and-tear on the drives, by backing up “system data.” (I also don’t want a restoration from backup and a “restoration from first principles,” i.e. a restoration by reinstalling, to end up with conflicts except where I explicitly made a change.) This means that any file that Windows, or the package manager, controls, that has not been modified (by me) since installation should not be backed up. (I’m of two minds about files controlled by Steam.) On the other hand, any such file that has been modified—such as a configuration file—should be backed up. (It’d also be nice to ignore the Portage tree—the “database” of scripts describing how to build the software that can be installed—and the source tarballs and such the package manager downloaded to build the installed software from, and the detritus left over from failed builds, but that’s less essential.)

Third, the “backup system” needs to be aware of and “play nice with” version control systems. Most of my most important data is already “backed up” on a DVCS hosting service, and so each “repository” already allows me to “restore” previous states far better than any “backup software” could. Ideally, the backup software I end up using should take advantage of that, rather than treating the repository as yet another collection of tiny files that need to be tracked in their own right. Files that the VCS has “ignored” should be backed up, as should any that have uncommitted changes, but the backup software should make its own “clone” of the repository rather than being naive about version control. (For checkouts of non-distributed version control systems, I fear nothing can be done; those remain as they are because the servers have vanished!)

Fourth, the “backup system” should avoid any unnecessary duplication. Many, if not most, of the files will be identical between the two systems, including many that are fairly or very large … and while, again, at this point the system could handle those naively without running out of space, it shouldn’t back up the same hundred-gigabyte file twice just because it appears on both systems.

Fifth, on the other hand, the system should add enough “redundancy” to recover from disk corruption. I plan on using it in RAID mirroring mode, but I want another level of checking on top of that … and in particular I want to be able to recover from consistency problems. I’ll have the disk space; this is one good use of it.

Sixth, it should keep a good rotation of incremental backups. For ordinary text files and such I’d like something like a version control system, but that would not work at all well for the binary files that are most of the system, and some things (like some log files) don’t have any reason to be kept around indefinitely. (Though I would like some way to keep around Portage build logs until that version is neither installed nor “in the tree” anymore …) So a standard full/incremental/sub-incremental backup system feels like the way to go … but I’m not certain.

Seventh, the software needs to be essentially “painless.” I should be able to install the software once, then plug the drive into whichever computer I’m using today and have it get and keep the backup up-to-date. And since I have enough problems with latency that’s probably mostly caused by disk I/O, it needs to work in the background and avoid “taking over the system.” But, on the gripping hand, I don’t like to leave a computer on overnight, so it shouldn’t schedule backups for 2 in the morning or something like that.

Eighth, the software shouldn’t use a proprietary archive format. If I need to restore something—or, worse, everything—from backups, I don’t want to have to use the interface its developers designed just to browse through the backed-up files. Except for the performance problems lots and lots of hard-links induce (as I know from experience lately …), and the fact that they wouldn’t work for the Windows side of things, I would say that directory trees using hard-linked duplicates for files that haven’t changed would be idea. But it should certainly use archive formats that widely-available programs can read. (Zip—though not the recent proprietary compression algorithms—or RAR in a pinch, but tarballs compressed with a major free-software compression program would be my first idea. Even something like dar would do.)

Ninth. Speaking of compression. Because I keep running into disk space limitations, large segments of my data (largely those moderately to very large binary files I mentioned earlier) is kept compressed, using whichever of five programs (gzip, bzip2, xz, rzip, or lrzip) gives the smallest end result. I’d prefer for the backup program to (where possible) decompress them before backing up.

And since many of those big files are PDF (or, worse, bundled-image-file) ebooks that I’ve been planning to convert to smaller formats (using OCR or simply typing them myself), I want the ability to, on occasion, purge certain files from the backup. (By all means ask me to confirm twice or three times, and if the disk pressure isn’t urgent postpone actually doing it until after asking me again in a week or two. But the option should be there.)

That’s the short list of features I can think of at the moment. As I said at the beginning, it’s quite likely that no one program meets every one of these features, but I’d like your advice on choosing a “backup solution” that comes as close to fitting these as possible, or in revising my “wants” to better fit my real needs. Do you have any thoughts?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s