We periodically have to transfer files to a collection of machines in a cluster. Without a distributed filesystem we rely on user level processes to move these files to their target destinations. Previously we had been making a sequence of N rsync calls to populate a collection of N machines. When looking for an preexisting solution that would improve our workflow, I could not find one that met the following requirements:
- Must not require additional processes or daemons running in the cluster.
- Must not require any superuser privileges to operate.
- Able to transfer an arbitrary collection of files and directories.
- Able to transfer to an arbitrary file path on the destination machines.
- A push-based system as opposed to a pull-based system.
We ended up writing ssync, a thin recursive wrapper around the rsync unix tool. ssync is a divide-and-conquer file copying tool to multiple destination hosts. It transfers to N remote machines in log N iterations. The command line arguments to ssync are identical to the command line arguments for rsync with the following two exceptions: (1) You must use the dummy name ‘remote’ as the destination host, ie. ‘
[USER@]remote:DEST.’ (2) The destination hosts are specified with the option ‘
--hosts‘ followed by a space separated list of hostnames. The option ‘
--hosts‘ must appear after all the ordinary command line arguments to rsync.
ssync will divide the input list of hosts in half generating list1 and list2. Each half is separated into the first elements, head1 and head2, and the remaining elements, tail1 and tail2. The file contents are concurrently rsync-ed from the current machine onto head1 and head2. The recursive step performs two concurrent ssync operations: one on host head1 applied to the list tail1, and another on host head2 applied to the list tail2. The base case performs a fixed number of concurrent rsync operations.
Here are some examples of ssync:
ssync -t *.h *.c remote:dest/ --hosts foo bar ssync -avz /data/tmp remote:/dest --hosts foo bar baz quux
All command line arguments to ssync are passed along to the underlying rsync invocations. If any child process is killed or exits with a non-zero status code, then the parent process will exit with that status code. ssync offers atomicity in case of failures at the granularity of the underlying rsync invocations. Each remote host should have observed either all of the updates or none of them.
As with any open-source software there are a few known limitations and caveats. The trailing slash is not supported in the SRC arguments: ‘ssync -avz /data/tmp/ remote:/dest’ will return a non-zero exit status. Contacting the rsync daemon (using :: notation) has not been tested and may or may not work.
Below is an example comparing the performance of rsync versus ssync. We are transferring a 100MB file of random information. On the horizontal axis is the number of destination machines. On the vertical axis is the run time normalized to the rsync transfer time for a single host. rsync and ssync are invoked with the arguments ‘-Wq’.
We’ve made ssync available on our github page under the Apache License.
Do you find this tool useful? Are you using another tool to meet similar goals? Please feel free to make pull requests, bug reports, and/or feature requests.