I've been recently testing out git-annex, and I've just started trusting my data to it. My use case for it has been for archiving and storing old project and research data that I no longer want stored on my workstation and to manage source tarballs from the cports stuff that I have been hacking at.

The main advantage of having git-annex in my case is that I can now have a directory full of stub entries with no actual data. All my data is stored either on a remote server with all the raid5/6 and tape backup available to it and an additional copy is stored on a portable disk which will probably end up sitting on my shelf.

My setup is currently...

  • Workstation - annex'd project directory in ~/Projects which may or may not have all the data that is being tracked.
  • A portable disk - with a bare repo storing a replica of the my workstation
  • A remote server - with effectively a clone of my workstation's ~/Projects directory

I guess the sensible thing to do is actually this...

  • Remote server - with a bare repo (storing all the data as well)
  • Workstation - where I keep a clone of the annex'd files but have only a partial copy of the data.
  • Portable disk - for storing a replica as a secondary backup that can be removed.

The nice thing about using git-annex is now, I can track my files on a big fat server or workstation. Then take a clone of the annex'd repo and get whatever files I need and work offline. It gives me the possiblilty of managing my large archives of files offline and possibly track replicas of data.

I'm not totally sure I would recommend anyone to use it to for industrial grade data archiving or anything like that just yet. It is certainly a tool for system admins, techies and power users who like to work offline and truly manage their data in distributed fashion.

I can see myself using this more for tracking and managing tarballs of user data that I need to backup to protect students and postdocs from user stupidity.

Bookmark and Share