[Dirvish] Vaults, Branches, and Images

Loren M. Lang lorenl at alzatex.com
Tue Jul 5 21:14:06 UTC 2011


I had always been confused by how branches are used in the Dirvish 
arena, and the wiki article on dirvish.org didn't seem to be very clear 
for me.  Recently, while pondering it (and digging through source code), 
it dawned on me how they work, and also a possible use case (for me at 
least).  First, a quick summary:

A vault represents a collection of branches all residing on the same 
file systems.  This is required so that hard links can exists between 
copies and keep the disk usage down to a minimum.  Each branch shares 
the same directory structure in the vault, but has it's own independent 
configuration and history file.  An branch contains one or more images, 
each image references an earlier image from the same branch.  When 
running dirvish, it creates one new image for one branch in one vault.  
The branch default is assumed if none if specified.  The vault must 
always be specified.  When creating a new image, the --reference option 
can be specified to dirvish to specify an image from a different branch 
is to be used as the reference point.  This is the one feature that 
differentiates using branches from vaults. If two branches don't share 
any images as a reference point, then it's no different than using two 
vaults.

Now, for a different use case than is specified in the Wiki article.  
What if two branches point to the same filesystem, but just use a 
different configuration, perhaps a different exclude list.  I might use 
the default branch to back up the whole filesystem, but use a different 
branch with an exclude list to exclude heavily modified, but 
non-critical areas of the filesystem, such as the directory tree used 
for automatic nightly builds.  There's no need to keep each nights copy 
backed up so I will exclude it from the daily branch.  The default 
branch can get run on Sunday, and for the other 6 days of the week I 
will run the daily branch.

One issue remains with this approach, files that get modified which are 
in both the daily branch and default branch will fork into two copies as 
Dirvish only uses images from the same branch, by default, for hard 
links on new images.  One solution to that might be using --reference to 
keep daily based on default, but I'd still end up with two copies if the 
new version of the file is first noticed by a daily backup and then 
later by the default backup.  Then, by the next default backup, all 
dailies from then on will hard link to the copy from default instead of 
the original copy in the first daily.  Since my daily backups will 
expire sooner than my weekly full default backup, those extra copies 
will eventually go away instead of staying around as they would if the 
two branches stayed independent history-wise.

This brings up another point that I would like to eventually solve, 
those the issue is really with rsync, not Dirvish.  When moving a large 
amount of data from one place to another, it will cause all that data to 
be duplicated on backups.  It makes restructuring data on the server 
difficult if I don't have huge margins on my backup media.  Ideally, 
rsync could track inode numbers and hard link to the original files at 
their old location to their new location, but alas, rsync does not.  
Other backup solutions, such as dump/restore track files solely by their 
metadata information looking at the ctime/mtime for updates and only 
backup files that have newer times than the last start of backup time.  
In the case of moving a single 4GB directory, only the file listing of 
the old parent directory and new parent directory need to be backed up 
which can be measured in kilobytes or less.  This level of backup could 
be reasonably achieved with rsync if it maintained a mapping of source 
inode/device numbers to destination paths in the sync.  Then, when rsync 
detects a new file on the source, it can look it up by inode/device to 
the old path on the destination and detect whether it can hard link or 
at least save bandwidth by using rolling checksums.

-- 
Loren M. Lang
lorenl at alzatex.com
http://www.alzatex.com/


Public Key: ftp://ftp.tallye.com/pub/lorenl_pubkey.asc
Fingerprint: 10A0 7AE2 DAF5 4780 888A  3FA4 DCEE BB39 7654 DE5B



More information about the Dirvish mailing list