[Dirvish] Backing up millions of files...

Keith Lofstrom keithl at kl-ic.com
Tue Feb 15 15:12:31 PST 2005

On Tue, Feb 15, 2005 at 05:07:40PM -0500, Brian Johnson wrote:
> We have a need to backup 50000 directories that contain around
> 2 million files total.  Some small, some large.  The problem we
> have had with rsync in the past is that it uses up a tremendous
> amount of memory creating the file list in memory.  Does dirvish
> have a work around for this, or should I continue to look for
> another backup solution?  Does anybody have suggestions? 
> I've been creating a flat spot on my forehead smashing it against
> the proverbial desk for months now.  ANY help would be GREATLY
> appreciated!

It is still rsync - however, I can think of a cheezy trick. 

Depending on how the client directories are arranged, you can break
the dirvish run up into hundreds of vaults corresponding to
subdirectories, and dirvish will call rsync multiple times to
work through them one by one. 

Since the configuration files accumulate, you could generate your own
script that builds another configuration file that is called from
master.conf, which contains an additional Runall: section listing
these hundreds of vaults.   The script would also have to generate
a bunch of $VAULT/dirvish/default.conf config files.

Dirvish backs up by directory (actually, it tells rsync to).  If your
50000 directories are flat in one directory, you might find yourself
rearranging them so they are less flat, or doing complicated things
with "exclude". 

So what is dirvish doing, if you are doing a lot of work already?
It does wrap rsync nicely, and turns what would otherwise be an
enormous script into a file traversal, makes logs, saves expire
information, etc. Further, if you come up with additional small
tools to shape your job to dirvish, and add them to the next
release, others will use and debug and help maintain them. 
However, dirvish may get stretched a little to fit this job.

If the contents of these subdirectories changes a lot, you may
find that (1) rsync's ability to overlay images won't have much
to work with, and (2) different data ends up in different places
on different days.  That would be too nasty.  

Also, Wayne Davison could change the rsync memory usage in a future
release, and make all the gyrations unnecessary.  He has been asked
before, mostly to deal with files that change during backup.  Consider
asking on the rsync list;  Wayne is very helpful and rsync is improving
daily.  Check the list history, first - search on:

Some data structures are resistant to backup and to management in
general, or require additional script writing to transform into a
shape that dirvish can handle.  I hope the suggestions above give
you some ideas.


Keith Lofstrom          keithl at keithl.com         Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs

More information about the Dirvish mailing list