[Dirvish] Combining branches after upgrades

Kenneth Lerman
Sat Nov 10 13:45:42 UTC 2007

[I'm new to this list and to dirvish. I'm so new that I'm not even using it
yet, but I just took a quick spin around the historic mailing list to try to
see what's up.]

I think Keith's requirement should be pretty easy to implement at the vault

Simply search the vault (or even the entire "bank") for matching files and
then link them together. The trick is to do this efficiently.

Keep an md5 hash for each file, together with the path to the file in a
large (disk based) hash file. After dirvish runs (in the post processing),
find each new file that has changed. For each such file, compute the md5 and
look it up in the hash file. If the md5 matches, the files are the same and
can be linked.

Either the md5 would have to include the permissions, times, ownership, etc
of the file or a separate compare could be done to make sure that they
match. An "ignore times" flag could be provided if the times aren't
important. Ignore ownership could be another flag. Adding these flags,
though, might require changing existing code.

Since hash lookup runs in constant time regardless of the number of entries,
one big hash file could be used for all of the files in the bank.

A nice thing about doing this is that it would require no change to existing
code. Just add the proper post process and we're done. We'd need another
piece of code to remove files from the hash file as they expire.

It shouldn't be more than a few days work -- at least for a version that
limits the big hash file to 2 gig or so. Let's see: 16 bytes for the hash
plus (say) 100 bytes for the file name. That leaves us with only 15 million
files. That's a pretty good start.

It wouldn't be too hard to split the hash over N files based on the low bits
of the MD5.

So... shoot me down. Am I missing something? As I said, I'm new here.


