Remote-to-remote data copy

Posted by waldner on 9 February 2015, 9:14 am

...going through the local machine, which is what people normally want and try to do.

Of course it's not as efficient as a direct copy between the involved boxes, but many times it's the only option, for various reasons.

Here are some ways (some with standard tools, some home-made) to accomplish the task. We'll indicate the two remote machines between which data has to be transferred with remote1 and remote2. We assume no direct connectivity between them is possible, but we have access to both from the local machine (with passwordless SSH where appropriate).

remote1 to local, local to remote2

This is of course the obvious and naive way: just copy everything temporarily from remote1 to the local machine (with whatever method), then again from the local machine to remote2. If copying remote-to-remote is bad, doing it this way is even worse, as we actually need space on the local machine to store the data, albeit only temporarily. Sample code using rsync (options are only indicative):

$ rsync -avz remote1:/src/dir/ /local/dir/
sending incremental file list
...
$ rsync -avz /local/dir/ remote2:/dest/dir/
sending incremental file list
...

For small or even medium amounts of data this solution can be workable, but it's clearly not very satisfactory.

scp -3

Newer versions of scp have a command line switch (-3) which does just wat we want: remote-to-remote copy going through the local machine. In this case at least, we don't need local disk space:

$ scp -3 -r remote1:/src/dir remote2:/dest/dir    # recursive to copy everything; adapt as needed

An annoying "feature" of scp -3 is that there's no indication of progress whatsoever (whereas the default for non-remote-to-remote copy is to show progress of each file as it's copied), and no option to enable it. Sure, with -v that information is printed, but so is a lot of other stuff.

SSH + tar

We can also of course use SSH and tar:

$ ssh remote1 'tar -C /src/dir/ -cvzf - .' | ssh remote2 'tar -C /dest/dir/ -xzvf -'

tar + netcat/socat

Can we modify our nettar tool to support remote-to-remote copies? The answer is yes, and here's the code for a generalized version that automatically detects whether local-to-remote, remote-to-local or remote-to-remote copy is desired. This version uses socat instead of netcat, which implies that socat must be installed on the involved remote machines, as well as on the local box. It also implies that traffic is allowed between the local box and the remote ones on the remote TCP port used (in this example 1234).

#!/bin/bash
 
# nettar_gen.sh
# copy directory trees between local/remote and local/remote, using tar + socat

# Usage: $0 src dst

# if either src or dst contain a colon, it's assumed to mean machine:path, otherwise assumed 
# local path

# examples
#
# $0 remote:/src/dir /local/dst
# $0 /local/src remote:/dst/dir
# $0 remote1:/src/dir remote2:/dst/dir

# NOTE: error checking is very rudimentary. Argument sanity checking is missing.

src=$1
dst=$2

port=1234
remotesrc=0
remotedst=0
user=root

if [[ "$src" =~ : ]]; then
  remotesrc=1
  srcmachine=${src%%:*}
  srcdir=${src#*:}
  if ! ssh "$user"@"$srcmachine" "cd '$srcdir' || exit 1; { tar -cf - . | socat - TCP-L:$port,reuseaddr ; } </dev/null >/dev/null 2>&1 &"; then
    echo "Error setting up source on $srcmachine" >&2
    exit 1
  fi
fi

if [[ "$dst" =~ : ]]; then
  remotedst=1
  dstmachine=${dst%%:*}
  dstdir=${dst#*:}
  if ! ssh "$user"@"$dstmachine" "cd '$dstdir' || exit 1; { socat TCP-L:$port,reuseaddr - | tar -xf - ; } </dev/null >/dev/null 2>&1 &"; then
    echo "Error setting up destination on $dstmachine" >&2
    exit 1
  fi
fi

# sometimes remote initialization takes a bit longer...
sleep 0.5

if [ $remotesrc -eq 0 ] && [ $remotedst -eq 0 ]; then
  # local src, local dst
  tar -cf - -C "$src" . | tar -xvf - -C "$dst"
elif [ $remotesrc -eq 0 ]; then
  # local src, remote dst
  tar -cvf - -C "$src" . | socat - TCP:"$dstmachine":$port
elif [ $remotedst -eq 0 ]; then
  # remote src, local dst
  socat TCP:"$srcmachine":$port - | tar -xvf - -C "$dst"
else
  # remote src, remote dst
  socat TCP:"$srcmachine":$port - | socat - TCP:"$dstmachine":$port
fi

So with this code we can say

$ nettar_gen.sh remote1:/src/dir remote2:/dst/dir

and transfer the files unencrypted without the overhead of SSH (as tar runs remotely, we won't be able to see the names of the files being transferred though). Compression can be added to tar if desired (not always makes things faster, so it might or might not be an improvement).

Real rsync?

The approaches so far (except the first one, which however has other drawbacks) have the problem that they are not incremental, so if a transfer is interrupted, we have to restart it from the beginning (ok, we can cheat and move or delete the already-copied data on the origin, so it doesn't have to be copied again, but it should be obvious that this is neither an optimal nor a desirable workaround).
The tool of choice when we need to resume partial transfers is, of course, rsync but, as the man page kindly informs us,

Rsync copies files either to or from a remote host, or locally on the current host (it does not support copying files between two remote hosts).

However, we can leverage SSH's port forwarding capabilities and "bring", so to speak, a "tunnel" to remote1 that connects to remote2 via the local machine, for example:

$ ssh -R10000:remote2:10000 remote1

If we do the above, anything sent to localhost:10000 on remote1 will be sent to port 10000 on remote2. In particular, we can forward to port 22 on remote2 (or whatever port SSH is using there):

$ ssh -R10000:remote2:22 remote1

Now "ssh -p 10000 localhost" on remote1 gives us a password request from remote2's SSH daemon.

So, since rsync runs over SSH, with this tunnel in place we can run this on remote1 (all the examples use root as the user on remote2, adapt as needed):

remote1$ rsync -e 'ssh -l root -p 10000' -avz /src/dir/ localhost:/dest/dir/

and we'll effectively be transferring stuff to remote2. We can run the above directly from the local box (the -t option to SSH is to force a pseudo-tty allocation, otherwise we couldn't be asked for the password):

$ ssh -t -R10000:remote2:22 remote1 'rsync -e "ssh -l root -p 10000" -avz /src/dir/ localhost:/dest/dir/'
The authenticity of host '[localhost]:10000 ([127.0.0.1]:10000)' can't be established.
ED25519 key fingerprint is 9a:fd:f3:7f:55:1e:6b:44:b2:88:fd:a3:e9:c9:b9:ed.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[localhost]:10000' (ED25519) to the list of known hosts.
root@localhost's password:
sending incremental file list
...

So this way we get almost what we wanted, except we're still prompted for a password (which, as should be clear by now, is really the password for root@remote2). This is expected, since remote1 has probably no relation whatsoever with remote2 (we are also asked to accept remote2's SSH host key).

Although this solution is already quite satisfactory, can we do better? The answer is: yes.

An option is to set up passwordless SSH between remote1 and remote2, so we need to install the appropriate SSH keys on remote1's ~/.ssh directory (and adapt the -e option to rsync to use them, if necessary). This may or may not be OK, depending on the exact circumstances, but in any case it still requires changes on remote1, which may not be desirable (it also requires further work if some day we want to transfer, say, between remote1 and remote3, and for any new remote we want to work with).

Can't we somehow exploit the fact that we do have passwordless SSH already to both remotes from out local machine?
The answer is again: yes. If we use ssh-agent (or gpg-agent, which can store SSH keys as well), whose job is to read and store private SSH keys, we can then take advantage of the -A option to SSH (can also be specified in ~/.ssh/config as ForwardAgent) to forward our agent connection to remote1; there, the keys stored by the agents will be accessible and thus usable to get passwordless login on remote2 (well, on [localhost]:10000 actually, which is how remote1 will see it). Simply put, forwarding the SSH agent means that all SSH authentication challenges will be forwarded to the local machine, so in particular it is possible to take advantage of locally available keys even for authentications happening remotely. Here is a very good description of the process. (And be sure to read und understand the implications of using -A as explained in the man page.)
With a running agent with knowledge of the relevant key on the local machine and agent forwarding, we can finally have a seamless remote-to-remote rsync:

$ ssh -t -A -R10000:remote2:22 remote1 'rsync -e "ssh -l root -p 10000" -avz /src/dir/ localhost:/dest/dir/'

An annoyance with this approach is that, since remote1 stores the host key of [localhost]:10000 in its ~/.ssh/known_hosts file, if we do this:

$ ssh -t -A -R10000:remote2:22 remote1 'rsync -e "ssh -l root -p 10000" -avz /src/dir/ localhost:/dest/dir/'

and then this:

$ ssh -t -A -R10000:remote3:22 remote1 'rsync -e "ssh -l root -p 10000" -avz /src/dir/ localhost:/dest/dir/'

SSH will complain loudly and rightly that the key for localhost:10000 has changed.
A workaround, if this kind of operation is needed frequently, is to set up some sort of mapping between remote hosts and ports used on remote1 (and stick to it).

A slightly better method could be to cleanup the relevant entry from remote1's ~/.ssh/known_hosts file just before starting the transfer (eg with sed or some other tool), and then use StrictHostKeyChecking=no to have the key automatically added without confirmation, for example:

# cleanup, then do the copy
$ ssh remote1 'sed -i "/^\[localhost\]:10000 /d" .ssh/known_hosts'
$ ssh -t -A -R10000:remote2:22 remote1 'rsync -e "ssh -l root -p 10000 -o StrictHostKeyChecking=no" -avz /src/dir/ localhost:/dest/dir/'
Warning: Permanently added '[localhost]:10000' (ED25519) to the list of known hosts.
sending incremental file list
...

Update 13/02/2015: it turns out that ssh-keygen, despite its name, has an option (-R) to remove a host key from the known_hosts file, so it can be used insted of sed in the above example:

# ssh-keygen -R '[localhost]:10000'
# Host [localhost]:10000 found: line 21 type ED25519
/root/.ssh/known_hosts updated.
Original contents retained as /root/.ssh/known_hosts.old

However, it leaves behind a file with the .old suffix, and outputs a message which can't be suppressed with -q, despite what the man page says, so one would need to resort to shell redirection if silent operation is wanted.

Filed under networking, shell, tips, worksforme Tagged remote copy, rsync, scp, socat, ssh

Comments are closed | Permalink

\1