Moving data in 10 minutes
; list at the bottom regularly updated
Data needs to move. Those backup hard drives sitting in the closet? They have quite a few bits fallen over already. Or they may not work at all, next time I use them. And that is digital, binary data—let's not even think about those photo albums in my grandma's basement, which “we are going to digitise soon.”
And what about link rot? The stories on this website were only written a few years ago, and yet when I ran an analysis1, many links didn't work any more. The world moves, and data needs to move with it.
But at least those hard drives and photo albums are my own responsibility. What worries me more is the trust we place in institutions. Merely 10 years ago I used a social network that I never thought would simply vanish.2 From photo albums to VHS to DVD to MKV. From tapes, to zip drives, to SSD, to the cloud. From IRC to MSN to Facebook to Ello3. Data needs to move.
1 This link will probably be broken within a few years…
2 Thanks to the heroic efforts of the Archive Team, I was able to retrieve some of my data.
3 Just kidding.
I I O O O O I O I I O I O I O I O I I I O I O I O I I I O I O I O O I I O O O I I
Now, most of my data can be moved around easily, with several backups in different locations. But I realised that the data that I entrusted some institutions with—Facebook, Twitter, Youtube, and more—can not be easily moved. Therefore I set a goal for myself:
For any system that I use, I should be able to move my data to another system within 10 minutes.
The time constraint is important. With sufficient effort one can move data out of any system, but one shouldn't need to. Also, ideally it would be done automatically, so I can create a backup script that periodically copies my data from various systems.
I realised that for a lot of systems that I use, this isn't true yet, so I set out to make it so. This page trackers my efforts so far, and I hope to include other relevant projects here as well.
Luckily, for once, we have a major force on our side: the European Union. The General Data Protection Regulation Article 20, which will go into effect on 25 May 2018, requires service providers to provide users’ data “in a structured, commonly used and machine-readable format.” Hopefully this legislation will speed up the efforts described on this page.
I O I O I I I O I O I O O I I O O O I I I I O O O O I O I I O I O I O I O I I I O
Here's an overview of the services that I actively backup my data from. Some methods are more technical than others. For the technically inclined I’ve included some script snippets, which I actually use (in zsh, though they should typically also work in sh). I’ll update this list when I discover new methods.
Their
archive tool exports a lot, but not the most important thing: my likes and comments. It says “Jan Paul Posma likes a link,” but it doesn’t say
which link. This is unacceptable, and I haven’t found a solution for this yet.
Their
archive tool is pretty good: it exports tweets, retweets, pictures, all with links to the originals. It does link to its URL shortener (which is
problematic), but has the original URL in the HTML
title attribute. 👏 Unfortunately it doesn’t export direct messages, and there doesn’t seem to be a way to automate exporting an archive.
I want to backup the videos that I watched, but their
Terms of Service don’t allow that. However, if I would want to (I would never do this Youtube, don’t sue me), I could use
youtube-dl pretty easily, for example like this:
youtube-dl --write-info-json --write-description --write-thumbnail --write-annotations --all-subs --ignore-errors --output "%(title)s-%(id)s.%(ext)s" <my-liked-video-playlist-url>
. This even skips over already downloaded videos, and resumes aborted downloads!
They provide
exports for admins, but unfortunately not for regular users. I also want to backup my private conversations, which their exports typically don’t provide. They do provide an API which we can use to create a backup. For this you can use
SlackBackup, which is easy to setup. I fixed some bugs in that script, and my fixed version can be found
here (they have also merged it into their repo, but just not cut a release yet). Use it like this:
java -jar SlackBackup_JP.jar <your-token> cgpf
E-mail is a super-portable format, and Google Mail is no exception. Besides being able to (incrementally) make a copy using
IMAP or
POP, they also provide an archive tool through
Google Takeout. If that is not enough, you can use
Gmvault, which claims to smooth over some bugs in Google's APIs (that is what I use).
Super easy. Use
exports to manually download a copy, or
private addresses if you want to automatically download them. I use:
curl -o "jpp-`date +%Y-%m-%d`.ics" <private-address-url>
I O I O I I I O I I I O I O I O O I I O O O I O I O I I I O I O O O I O I I O I O
One interesting question is where to store the data. I’ve thought about several options, and for now settled on something simple: multiple external hard drives (mirrors with the same data), plus Dropbox for the most important stuff. As my friend Jethro said, you need at least 3 hard drives: one will fail, then the second one might also fail as you copy data off it, and then you'll be happy to have a third one!
I’ve also considered cloud storage, but it's not very attractive. For example, Amazon S3 costs $0.023 per GB per month, but I can buy a 5TB external hard drive for $120 (in June 2017), which is $0.024 per GB. That means I can buy one external hard drive per month for the same price! Amazon Glacier and Backblaze seem a lot cheaper, so I might use those for last-resort backups.
I O I I I O I O I O O I I O O O I I I I O O O O I O I I O I O I O I O I I I O I O
How do you move your data? Please let me know about your backup methods by tweeting me at @JanPaul123 or emailing me at j@npaulpos.ma!