Tips for Transferring Large Numbers of Files
There's an added overhead in moving large numbers of files from one computer to another. If the source and destination are similar enough in configuration and you have time to spare, you might simply use rsync or a raw file copy.
There are other -- faster -- ways, particularly if you're trying to archive those files or are in a hurry.
A significant part of the overhead is caused by the need to manage every single file in the collection. That can be reduced by archiving the files: in most cases this will reduce the size as well. Many files compress well: text approximately 85%, and repetitive log files by as much as 98%.
With this technique, try to avoid directories full of images or similarly-compressed files: these should be archived with no compression. A zip setting exists for this, and plain vanilla tar works wonders.
Rzip is an excellent tool in general for large filesets containing common files if you have the time (it is rather slow) and the disk space.
Subversion working copies (which contain pristine copies of the checked-out files) compress fairly well, but Rzip can compress them further than gzip or bzip2, looking for long-distance redundancies.
As the site for rzip points out, there are environments where multiple copies of the same file exist but no reasonable faculty exists to remove them:
It is quite common these days to need to compress files that contain long distance redundancies. For example, when compressing a set of home directories several users might have copies of the same file, or of quite similar files. It is also common to have a single file that contains large duplicated chunks over long distances, such as pdf files containing repeated copies of the same image.
In order to get the benefits of multiple techniques, it's often necessary to split the files to be moved up into large groups. Don't be afraid to do this, but human time is valuable too: at some point there are diminishing returns which could be better spent deciding on just one technique.
Look for ways to combine things, too. For example, home directories where there are subversion working copies are a wonderful target for rsync.
In summary:
- Transmitting fewer files is better
- It often takes less time to compress a file and transmit the compressed version than to transfer the uncompressed version
- If you have time, run rzip on specific sets of files where there is a large distance between duplications
- Don't waste your time: just come up with something that's approximately correct... don't worry about making things ideal.
Comments
Post new comment