So you have this /var/lib/mysql directory that you need to copy to three other machines. A quick and dirty solution is to use ssh and tee (it goes without saying that passwordless ssh is needed, here and for all the other examples):
$ tar -C /var/lib/mysql -cvzf - . |\ tee >(ssh dstbox1 'tar -C /var/lib/mysql/ -xzvf -') \ >(ssh dstbox2 'tar -C /var/lib/mysql/ -xzvf -') \ >(ssh dstbox3 'tar -C /var/lib/mysql/ -xzvf -') > /dev/null
If the directory tree to be transfered is not local, it is again possible to use ssh to get to it:
$ ssh srcbox 'tar -C /var/lib/mysql -cvzf - .' |\ tee >(ssh dstbox1 'tar -C /var/lib/mysql/ -xzvf -') \ >(ssh dstbox2 'tar -C /var/lib/mysql/ -xzvf -') \ >(ssh dstbox3 'tar -C /var/lib/mysql/ -xzvf -') > /dev/null
This means that all the data flows from the source, through the machine where the pipeline runs, to the targets. On the other hand this solution has the advantage that there is no need to set up passwordless ssh between the origin and the target(s); the only machine that needs passwordless ssh to all the others is the machine where the command runs.
Now this is all basic stuff, but after doing this I wondered whether it would be possible to generalize the logic for a variable number of target machines, so for example a nettar-style operation could be possible, as in
$ nettar2.sh /var/lib/mysql dstbox1:/var/lib/mysql dstbox2:/var/tmp dstbox3:/var/lib/mysql ...
Would mean: take (local) /var/lib/mysql and replicate it to dstbox1 under /var/lib/mysql, to dstbox2 under /var/tmp, to dstbox3 under /var/lib/mysql, and so on for any extra argument supplied. Arguments could have the form targetname:[targetpath], with a missing targetpath indicating the same path as the source (ie, /var/lib/mysql in this example).
It turns out that such a generalization is not easy.
Note that in the following code, all error checking and other refinements are omitted for simplicity. In particular, care should be taken at least to:
- validate the arguments passed to the script for number (at least two) and correct syntax
- check that paths exist (or create them if not, etc)
- properly escape arguments to commands that are executed using ssh (for example using printf %q)
- validate data that is used to dynamically build commands to be run with eval
None of the above is done in the code that follows.
Concurrent transfers
An obvious way to do it is to run three (or however many) concurrent transfers, eg
#!/bin/bash # syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ] # parallel transfers srcpath=$1 shift for arg in "$@"; do dstbox=${arg%:*} dstpath=${arg#*:} [ -n "$dstpath" ] || dstpath=$srcpath tar -C "$srcpath" -cvzf - . | ssh "$dstbox" "tar -C '$dstpath' -xvzf -" & done wait
This obviously simply reads $srcpath multiple times and transfers it to each target machine. We are not exploiting the data duplication done by tee. If the source directory is huge, this will not be efficient as multiple processes at once will try to read it; although the OS will probably cache most of it, it doesn't look like a satisfactory solution.
So what if we actually want to use tee (which in turn implies that we need process substitution or an equivalent facility)?
Using eval
The first thing that comes to mind is to use the questionable eval command:
#!/bin/bash # syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ] # using tee + eval do_sshtar(){ local dstbox=$1 dstpath=$2 ssh "$dstbox" "tar -C '$dstpath' -xvzf -" } declare -a args srcpath=$1 shift for arg in "$@"; do dstbox=${arg%:*} dstpath=${arg#*:} [ -n "$dstpath" ] || dstpath=$srcpath args+=( ">(do_sshtar '$dstbox' '$dstpath')" ) done tar -C "$srcpath" -cvzf - . | eval tee "${args[@]}" ">/dev/null"
This effectively builds the full list of process substitutions at runtime and executes them. However, when using eval we should be well aware of what we're doing. See the following pages for a good discussion of the implications of using eval: http://mywiki.wooledge.org/BashFAQ/048 and http://wiki.bash-hackers.org/commands/builtin/eval.
Note that with process substitution there is also the (in this case minor) issue that the created processes are run asynchronously in background, and we have no way to wait for their full termination (not even using wait), so the script might give us back the prompt slightly before all the background processes have fully completed their job.
Coprocesses
Bash and other shells have coprocesses (see also here), so it would seem that they could be useful for our purposes.
However, at least in bash, it seems that it's not possible to create a coprocess whose name is stored in a variable (which is how we would create a bunch of coprocesses programmatically), eg:
$ coproc foo { command; } # works $ cname=foo; coproc $cname { command; } # does not work as expected (creates a coproc literally named $cname)
So to use coprocesses for our task, we would need again to resort to eval.
Named pipes
Let's see if there is some other possibility. Indeed there is, and it involves using named pipes (aka FIFOs):
#!/bin/bash # syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ] # using tee + FIFOs (ssh version) declare -a fifos srcpath=$1 shift count=1 for arg in "$@"; do dstbox=${arg%:*} dstpath=${arg#*:} [ -n "$dstpath" ] || dstpath=$srcpath curfifo=/tmp/FIFO${count} mkfifo "$curfifo" fifos+=( "$curfifo" ) ssh "$dstbox" "tar -C '$dstpath' -xvzf -" < "$curfifo" & ((count++)) done tar -C "$srcpath" -cvzf - . | tee -- "${fifos[@]}" >/dev/null wait # cleanup the FIFOs rm -- "${fifos[@]}"
Here we're creating N named pipes, whose names are saved in an array, and an instance of ssh +tar to the target machine is launched in background reading from each pipe. Finally, tee is run against all the existing named pipes to send them the data; all the FIFOs are removed at the end.
This is not too bad, but we should manually set up interprocess communication (ie, create/delete the FIFOs); the beauty of process substitution is that bash sets up those channels for us, and here we're not taking advantage of that.
A point to note is that here we used ssh for the data transfer; it's always possible to change the code to use netcat, as explained in the nettar article. Here's an adaptation of the last example to use the nettar method (the other cases are similar):
#!/bin/bash # syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ] # using tee + FIFOs (netcat version) declare -a fifos srcpath=$1 shift count=1 for arg in "$@"; do dstbox=${arg%:*} dstpath=${arg#*:} [ -n "$dstpath" ] || dstpath=$srcpath if ssh "$dstbox" "cd '$dstpath' || exit 1; { nc -l -p 1234 | tar -xvzf - ; } </dev/null >/dev/null 2>&1 &"; then curfifo=/tmp/FIFO${count} mkfifo "$curfifo" fifos+=( "$curfifo" ) nc "$dstbox" 1234 < "$curfifo" & ((count++)) else echo "Warning, skipping $dstbox" >&2 # or whatever fi done tar -C "$srcpath" -cvzf - . | tee -- "${fifos[@]}" >/dev/null wait # cleanup the FIFOs rm -- "${fifos[@]}"
There should be some other way. I'll update the list if I discover some other method. As always, suggestions welcome.
Recursion
Update 19/05/2014: Marlon Berlin suggested (thanks) that recursion could be used to build an implicit chain of >(...) process substitutions, and indeed that's true. So here it is:
#!/bin/bash # syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ] # using recursion (ssh version) do_sshtar(){ local dstbox=${1%:*} dstpath=${1#*:} [ -n "$dstpath" ] || dstpath=$srcpath shift if [ $# -eq 0 ]; then # end recursion ssh "$dstbox" "tar -C '$dstpath' -xzvf -" else # send data to "current" $dstbox and recurse tee >(ssh "$dstbox" "tar -C '$dstpath' -xzvf -") >(do_sshtar "$@") >/dev/null fi } srcpath=$1 shift tar -C "$srcpath" -czvf - . | do_sshtar "$@"
When the do_sshtar function receives only one argument, it just transfers the data directly via ssh to terminate the recursion. Otherwise, it uses tee to transfer the data and continue the recursion. Simple and elegant. Here's the netcat version:
#!/bin/bash # syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ] # using recursion (netcat version) do_nctar(){ local dstbox=${1%:*} dstpath=${1#*:} [ -n "$dstpath" ] || dstpath=$srcpath shift # set up listening nc on $dstbox if ssh -n "$dstbox" "cd '$dstpath' || exit 1; { nc -l -p 1234 | tar -xvzf - ; } </dev/null >/dev/null 2>&1 &"; then if [ $# -eq 0 ]; then # end recursion nc "$dstbox" 1234 else # send data to "current" $dstbox and recurse tee >(nc "$dstbox" 1234) >(do_nctar "$@") >/dev/null fi else echo "Warning, skipping $dstbox" >&2 # one way or another, we must consume the input if [ $# -eq 0 ]; then cat > /dev/null else do_nctar "$@" fi fi } srcpath=$1 shift tar -C "$srcpath" -czvf - . | do_nctar "$@"
The -n switch to ssh is important, otherwise it will try to read from stdin, consuming our tar data.
Hi:
I guess you mention this as a proof of concept as doing that through rsync would be much simpler isn't it?
As far as I know, rsync can't transfer to multiple target machines simultaneously.