Poor man’s directory tree replication

Posted by waldner on 18 May 2014, 12:15 am

So you have this /var/lib/mysql directory that you need to copy to three other machines. A quick and dirty solution is to use ssh and tee (it goes without saying that passwordless ssh is needed, here and for all the other examples):

$ tar -C /var/lib/mysql -cvzf - . |\
  tee >(ssh dstbox1 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox2 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox3 'tar -C /var/lib/mysql/ -xzvf -') > /dev/null

If the directory tree to be transfered is not local, it is again possible to use ssh to get to it:

$ ssh srcbox 'tar -C /var/lib/mysql -cvzf - .' |\
  tee >(ssh dstbox1 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox2 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox3 'tar -C /var/lib/mysql/ -xzvf -') > /dev/null

This means that all the data flows from the source, through the machine where the pipeline runs, to the targets. On the other hand this solution has the advantage that there is no need to set up passwordless ssh between the origin and the target(s); the only machine that needs passwordless ssh to all the others is the machine where the command runs.

Now this is all basic stuff, but after doing this I wondered whether it would be possible to generalize the logic for a variable number of target machines, so for example a nettar-style operation could be possible, as in

$ nettar2.sh /var/lib/mysql dstbox1:/var/lib/mysql dstbox2:/var/tmp dstbox3:/var/lib/mysql ...

Would mean: take (local) /var/lib/mysql and replicate it to dstbox1 under /var/lib/mysql, to dstbox2 under /var/tmp, to dstbox3 under /var/lib/mysql, and so on for any extra argument supplied. Arguments could have the form targetname:[targetpath], with a missing targetpath indicating the same path as the source (ie, /var/lib/mysql in this example).

It turns out that such a generalization is not easy.

Note that in the following code, all error checking and other refinements are omitted for simplicity. In particular, care should be taken at least to:

validate the arguments passed to the script for number (at least two) and correct syntax
check that paths exist (or create them if not, etc)
properly escape arguments to commands that are executed using ssh (for example using printf %q)
validate data that is used to dynamically build commands to be run with eval

None of the above is done in the code that follows.

Concurrent transfers

An obvious way to do it is to run three (or however many) concurrent transfers, eg

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# parallel transfers
 
srcpath=$1
shift
 
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  tar -C "$srcpath" -cvzf - . | ssh "$dstbox" "tar -C '$dstpath' -xvzf -" &
done
 
wait

This obviously simply reads $srcpath multiple times and transfers it to each target machine. We are not exploiting the data duplication done by tee. If the source directory is huge, this will not be efficient as multiple processes at once will try to read it; although the OS will probably cache most of it, it doesn't look like a satisfactory solution.

So what if we actually want to use tee (which in turn implies that we need process substitution or an equivalent facility)?

Using eval

The first thing that comes to mind is to use the questionable eval command:

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using tee + eval
 
do_sshtar(){
  local dstbox=$1 dstpath=$2
  ssh "$dstbox" "tar -C '$dstpath' -xvzf -"
}
 
declare -a args
 
srcpath=$1
shift
 
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  args+=( ">(do_sshtar '$dstbox' '$dstpath')" )
done
 
tar -C "$srcpath" -cvzf - . | eval tee "${args[@]}" ">/dev/null"

This effectively builds the full list of process substitutions at runtime and executes them. However, when using eval we should be well aware of what we're doing. See the following pages for a good discussion of the implications of using eval: http://mywiki.wooledge.org/BashFAQ/048 and http://wiki.bash-hackers.org/commands/builtin/eval.

Note that with process substitution there is also the (in this case minor) issue that the created processes are run asynchronously in background, and we have no way to wait for their full termination (not even using wait), so the script might give us back the prompt slightly before all the background processes have fully completed their job.

Coprocesses

Bash and other shells have coprocesses (see also here), so it would seem that they could be useful for our purposes.
However, at least in bash, it seems that it's not possible to create a coprocess whose name is stored in a variable (which is how we would create a bunch of coprocesses programmatically), eg:

$ coproc foo { command; }      # works
$ cname=foo; coproc $cname { command; }  # does not work as expected (creates a coproc literally named $cname)

So to use coprocesses for our task, we would need again to resort to eval.

Named pipes

Let's see if there is some other possibility. Indeed there is, and it involves using named pipes (aka FIFOs):

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using tee + FIFOs (ssh version)
 
declare -a fifos
 
srcpath=$1
shift
 
count=1
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  curfifo=/tmp/FIFO${count}
  mkfifo "$curfifo"
  fifos+=( "$curfifo" )
  ssh "$dstbox" "tar -C '$dstpath' -xvzf -" < "$curfifo" &
  ((count++))
done
 
tar -C "$srcpath" -cvzf - . | tee -- "${fifos[@]}" >/dev/null
 
wait
# cleanup the FIFOs
rm -- "${fifos[@]}"

Here we're creating N named pipes, whose names are saved in an array, and an instance of ssh +tar to the target machine is launched in background reading from each pipe. Finally, tee is run against all the existing named pipes to send them the data; all the FIFOs are removed at the end.
This is not too bad, but we should manually set up interprocess communication (ie, create/delete the FIFOs); the beauty of process substitution is that bash sets up those channels for us, and here we're not taking advantage of that.

A point to note is that here we used ssh for the data transfer; it's always possible to change the code to use netcat, as explained in the nettar article. Here's an adaptation of the last example to use the nettar method (the other cases are similar):

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using tee + FIFOs (netcat version)
 
declare -a fifos
 
srcpath=$1
shift
 
count=1
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
 
  if ssh "$dstbox" "cd '$dstpath' || exit 1; { nc -l -p 1234 | tar -xvzf - ; } </dev/null >/dev/null 2>&1 &"; then
    curfifo=/tmp/FIFO${count}
    mkfifo "$curfifo"
    fifos+=( "$curfifo" )
    nc "$dstbox" 1234 < "$curfifo" &
    ((count++))
  else
    echo "Warning, skipping $dstbox" >&2   # or whatever
  fi
done
 
tar -C "$srcpath" -cvzf - . | tee -- "${fifos[@]}" >/dev/null
 
wait
# cleanup the FIFOs
rm -- "${fifos[@]}"

There should be some other way. I'll update the list if I discover some other method. As always, suggestions welcome.

Recursion

Update 19/05/2014: Marlon Berlin suggested (thanks) that recursion could be used to build an implicit chain of >(...) process substitutions, and indeed that's true. So here it is:

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using recursion (ssh version)
 
do_sshtar(){
 
  local dstbox=${1%:*} dstpath=${1#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  shift
 
  if [ $# -eq 0 ]; then
    # end recursion
    ssh "$dstbox" "tar -C '$dstpath' -xzvf -"
  else
    # send data to "current" $dstbox and recurse
    tee >(ssh "$dstbox" "tar -C '$dstpath' -xzvf -") >(do_sshtar "$@") >/dev/null
  fi
}
 
srcpath=$1
shift
 
tar -C "$srcpath" -czvf - . | do_sshtar "$@"

When the do_sshtar function receives only one argument, it just transfers the data directly via ssh to terminate the recursion. Otherwise, it uses tee to transfer the data and continue the recursion. Simple and elegant. Here's the netcat version:

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using recursion (netcat version)
 
do_nctar(){
 
  local dstbox=${1%:*} dstpath=${1#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  shift
 
  # set up listening nc on $dstbox
  if ssh -n "$dstbox" "cd '$dstpath' || exit 1; { nc -l -p 1234 | tar -xvzf - ; } </dev/null >/dev/null 2>&1 &"; then
    if [ $# -eq 0 ]; then
      # end recursion
      nc "$dstbox" 1234
    else
      # send data to "current" $dstbox and recurse
      tee >(nc "$dstbox" 1234) >(do_nctar "$@") >/dev/null
    fi
  else
    echo "Warning, skipping $dstbox" >&2
    # one way or another, we must consume the input
    if [ $# -eq 0 ]; then
      cat > /dev/null
    else
      do_nctar "$@"
    fi
  fi
}
 
srcpath=$1
shift
 
tar -C "$srcpath" -czvf - . | do_nctar "$@"

The -n switch to ssh is important, otherwise it will try to read from stdin, consuming our tar data.

Filed under shell, tips Tagged bash, eval, process substitution, ssh, tar, tee

Comments are closed | Permalink

2 Comments

xavy says:

May 19, 2014 at 08:42

Hi:

I guess you mention this as a proof of concept as doing that through rsync would be much simpler isn't it?
- waldner says:
  
  May 19, 2014 at 09:41
  
  As far as I know, rsync can't transfer to multiple target machines simultaneously.

\1