Skip to content

Three text processing tasks

Just three problems that came up in different circumstances in the last couple of months.

Ranges, again

Ranges strike again, this time the task is to print or select everything from the first occurrence of /START/ in the input to the last occurrence of /END/, including the extremes or not. So, given this sample input:

 1 xxxx
 2 xxxx
 3 END
 4 aaa
 5 START
 6 START
 7 zzz
 8 START
 9 hhh
10 END
11 ppp
12 END
13 mmm
14 START

we want to match from line 5 to 12 (or from line 6 to 11 in the noninclusive version).

The logic is something along the lines of: when /START/ is seen, start collecting lines. Each time an /END/ is seen (and /START/ was previously seen), print what we have so far, empty the buffer and start collecting lines again, in case we see another /END/ later.

Here's an awk solution for the inclusive case:

awk '!ok && /START/ { ok = 1 }
ok { p = p sep $0; sep = RS }
ok && /END/ { print p; p = sep = "" }' file.txt

and here's the noninclusive case, which is mostly the same code with the order of the blocks reversed:

awk 'ok && /END/ { if (content) print p; p = sep = "" }
ok { p = p sep $0; sep = RS; content = 1 }
!ok && /START/ { ok = 1 }' file.txt

The "content" variable is necessary for the obscure corner case in which the input contains something like

...
START

END
...

If we relied upon "p" not being empty to decide whether to print or not, this case would be indistinguishable from this other one:

...
START
END
...

We could also (perhaps a bit cryptically) avoid the extra variable and rely on "sep" being set instead. We keep the extra variable for the sake of clarity.

Here are two sed solutions implementing the same logic (not really recommended, but since the original request was to solve this with sed). The hold buffer is used to accumulate lines.
Inclusive:

# sed -n
# from first /START/ to last /END/, inclusive version

/START/ {
  H
  :loop
  $! {
    n
    H
    # if we see an /END/, sanitize and print
    /END/ {
      x
      s/^\n//
      p
      s/.*//
      x
    }
    bloop
  }
}

The noninclusive version uses the same logic, except we discard the first /START/ line that we see (done by the "n" in the loop), and, when we see an /END/, we print what we have so far (which crucially does not include the /END/ line itself, which however is included for the next round of accumulation).

# sed -n
# from first /START/ to last /END/, noninclusive version

/START/ {
  :loop
  $! {
    n
    /END/ {
      # recover lines accumulated so far
      x

      # if there something, print
      /./ {
        # remove leading \n added by H
        s/^\n//
        p
      }

      # empty the buffer
      s/.*//

      # recover the /END/ line for next round
      x
    }
    H
    bloop
  }
}

Note that the above solutions assume that no line exists that match both /START/ and /END/. Other solutions are of course possible.

Conditional line join

In this case we have some special lines (identified by a pattern). Every time a special line is seen, all the previous or following lines should be joined to it. An example to make it clear, using /SPECIAL/ as our pattern:

SPECIAL 1
line2
line3
SPECIAL 2
line5
line6
line7
SPECIAL 3
SPECIAL 4
line10
SPECIAL 5

So we want one of the two following outputs, depending on whether we join the special lines to the preceding or the following ones:

# join with following lines
SPECIAL 1 line2 line3
SPECIAL 2 line5 line6 line7
SPECIAL 3
SPECIAL 4 line10
SPECIAL 5
# join with preceding lines
SPECIAL 1
line2 line3 SPECIAL 2
line5 line6 line7 SPECIAL 3
SPECIAL 4
line10 SPECIAL 5

The sample input has been artificially crafted to work with both types of change; in practice, in real inputs either the first or the last line won't match /SPECIAL/, depending on the needed processing.

So here's some awk code that joins each special line with the following ones, until a new special line is found, thus producing the first of the two output shown above:

awk -v sep=" " '/SPECIAL/ && done == 1 {
  print ""
  s = ""
  done = 0
}
{
  printf "%s%s", s, $0
  s = sep
  done = 1
}
END {
  if (done) print""
}' file.txt

And here's the idiomatic solution to produce the second output (join with preceding lines):

awk -v sep=" " '{ ORS = /SPECIAL/ ? RS : sep }1' file.txt

The variable "sep" should be set to the desired separator to be used when joining lines (here it's simply a space).

Intra-block sort

(for want of a better name)

Let's imagine an input file like

alpha:9832
alpha:11
alpha:449
delta:23847
delta:113
gamma:1
gamma:10
gamma:100
gamma:101
beta:5768
beta:4

The file has sections, where the first field names the section (alpha, beta etc.). Now we want to sort each section according to its second field (numeric), but without changing the overall order of the sections. In other words, we want this output:

alpha:11
alpha:449
alpha:9832
delta:113
delta:23847
gamma:1
gamma:10
gamma:100
gamma:101
beta:4
beta:5768

As a variation, blocks can be separated by a blank line, as follows:

alpha:9832
alpha:11
alpha:449

delta:23847
delta:113

gamma:1
gamma:10
gamma:100
gamma:101

beta:5768
beta:4

So the corresponding output should be

alpha:11
alpha:449
alpha:9832

delta:113
delta:23847

gamma:1
gamma:10
gamma:100
gamma:101

beta:4
beta:5768
Shell

The blatantly obvious solution using the shell is to number each section adding a new field at the beginning, then sort according to field 1 + field 3, and finally print the result removing the extra field that we added:

awk -F ':' '$1 != prev {count++} {prev = $1; print count FS $0}' file.txt | sort -t ':' -k1,1n -k3,3n | awk -F ':' '{print substr($0,index($0,FS)+1)}'
alpha:11
alpha:449
alpha:9832
delta:113
delta:23847
gamma:1
gamma:10
gamma:100
gamma:101
beta:4
beta:5768

Instead of reusing awk, the job of the last part of the pipeline could have been done for example with cut or sed.

For the variation with separated blocks, an almost identical solution works. Paragraphs are numbered prepending a new field, the result sorted, and the prepended numbers removed before printing:

awk -v count=1 '/^$/{count++}{print count ":" $0}' file.txt | sort -t ':' -k1,1n -k3,3n | awk -F ':' '{print substr($0,index($0,FS)+1)}'
alpha:11
alpha:449
alpha:9832

delta:113
delta:23847

gamma:1
gamma:10
gamma:100
gamma:101

beta:4
beta:5768

A crucial property of this solution is that empty lines are always thought as being part of the next paragraph (not the previous), so when sorting they remain where they are. This also means that runs of empty lines in the input are preserved in the output.

Perl

The previous solutions treat the input as a single entity, regardless of how many blocks it has. After preprocessing, sort is applied to the whole data, and if the file is very big, many temporary resources (disk, memory) are needed to do the sorting.

Let's see if it's possible to be a bit more efficient and sort each block independently.

Here is an example with perl that works with both variations of the input (without and with separated blocks).

#!/usr/bin/perl

use warnings;
use strict;

sub printblock {
  print $_->[1] for (sort { $a->[0] <=> $b->[0] } @_);
}

my @block = ();
my ($prev, $cur, $val);

while(<>){

  my $empty = /^$/;

  if (!$empty) {
    ($cur, $val) = /^([^:]*):([^:]*)/;
    chomp($val);
  }

  if (@block && ($empty || $cur ne $prev)) {
    printblock(@block);
    @block = ();
  }

  if ($empty) {
    print;
  } else {
    push @block, [ $val, $_ ];
    $prev = $cur;
  }
}

printblock(@block) if (@block);

Of course all the sample code given here must be adapted to the actual input format.

File encryption on the command line

This list is just a reference which hopefully saves some googling.

Let's make it clear that we're talking about symmetric encryption here, that is, a password (or better, a passphrase) is supplied when the file is encrypted, and the same password can be used to decrypt it. No public/private key stuff or other preparation should be necessary. We want a quick and simple way of encrypting stuff (for example, before moving them to the cloud or offsite backup not under our control). As said, file ecryption, not whole filesystems or devices.

Another important thing is that symmetric encryption is vulnerable to brute force attacks, so a strong password should always be used and the required level of security should always be evaulated. It may be that symmetric encryption is not the right choice for a specific situation.

It is worth noting that the password or passphrase that are supplied to the commands are not used directly for encryption/decription, but rather are used to derive the actual encryption/decryption keys. However this is done transparently by the tools (usually through some sort of hashing) and for all practical purposes, these passwords or passphrases are the keys, and should be treated as such.

In particular, one thing that should be avoided is putting them directly on the command line. Although some tools allow that, the same tools generally also offer options to avoid it, and they should definitely be used.

Openssl

Probably the simplest and most commonly installed tool is openssl.

# Encrypt
$ openssl enc -aes-192-cbc -in plain.txt -out encrypted.enc
# Decrypt
$ openssl enc -d -aes-192-cbc -in encrypted.enc -out plain.txt

The above is the basic syntax. The cipher name can of course be different; the man page for the enc openssl subcommand lists the supported algorithms (the official docs also say: "The output of the enc command run with unsupported options (for example openssl enc -help) includes a list of ciphers, supported by your version of OpenSSL, including ones provided by configured engines." Still, it seems that adding a regular -help or -h option wouldn't be too hard). Other useful options:

  • -d to decrypt
  • -pass to specify a password source. In turn, the argument can have various formats: pass:password to specify the password directly in the command, env:var to read it from the environment variable $var, file:pathname to read it from the file at pathname, fd:number to read it from a given file descriptor, and stdin to read it from standard input (equivalent to fd:0, but NOT equivalent to reading it from the user's terminal, which is the default behavior if -pass is not specified)
  • -a to base64-encode the encrypted file (or assume it's base64-encoded if decrypting)

Openssl can also read the data to encrypt from standard input (if no file is specified with -in) and/or write to standard output (if -out is not given). Example with password from file:

# Encrypt
$ tar -czvf - file1 file2 ... | openssl enc -aes-192-cbc -pass file:/path/to/keyfile -out archive.tar.gz.enc
# Decrypt
$ openssl enc -d -aes-192-cbc -pass file:/path/to/keyfile -in archive.tar.gz.enc | tar -xzvf -

GPG

There are two main versions of GPG, the 1.x series and the 2.x series (respectively 1.4.x and 2.0.x at the time of writing).

gpg comes with a companion program, gpg-agent, that can be used to store and retrieve passphrases use to unlock private keys, in much the same way that ssh-agent caches password-protected SSH private keys (actually, in addition to its own, gpg-agent can optionally do the job of ssh-agent and replace it). Using gpg-agent is optional with gpg 1.x, but mandatory with gpg 2.x. In practice, when doing symmetric encryption, the agent is not used, so we won't talk about it here (although we will briefly mention it later when talking about aespipe, since that tool can use it).

GPG 1.4.x
# Encrypt file
$ gpg --symmetric --cipher-algo AES192 --output encrypted.enc plain.txt
# Decrypt file
$ gpg --decrypt --output plain.txt encrypted.enc

# Encrypt stdin to file
$ tar -czvf - file1 file2 ... | gpg --symmetric --cipher-algo AES192 --output archive.tar.gz.enc
# Decrypt file to stdout
$ gpg --decrypt archive.tar.gz.enc | tar -xzvf -

Useful options:

  • -a (when encrypting) create ascii-armored file (ie, a special text file)
  • --cipher-algo ALG (when encrypting) use ALG as cipher algorithm (run gpg --version to get a list of supported ciphers)
  • --batch avoid asking questions to the user (eg whether to overwrite a file). If the output file exists, the operation fails unless --yes is also specified
  • --yes assume an answer of "yes" to most questions (eg when overwriting an output file, which would otherwise ask for confirmation)
  • --no-use-agent to avoid the "gpg: gpg-agent is not available in this session" message that, depending on configuration, might be printed if gpg-agent is not running (it's only to avoid the message; as said, the agent is not used anyway with symmetric encryption)
  • --passphrase string use string as the passphrase
  • --passphrase-file file read passphrase from file
  • --passphrase-fd n read passphrase from file descriptor n (use 0 for stdin)
  • --quiet suppress some output messages
  • --no-mdc-warning (when decrypting) suppress the "gpg: WARNING: message was not integrity protected" message. Probably, a better thing to do is use --force-mdc when encrypting, so GPG won't complain when decrypting.

In any case, GPG will create and populate a ~/.gnupg/ directory if it's not present (I haven't found a way to avoid it - corrections welcome).

Similar to openssl, GPG reads from standard input if no filename is specified at the end of the command line. However, writing to standard output isn't obvious.

When encrypting, if no --output option is given, GPG will create a file with the same name as the input file, with the added .gpg extension (eg file.txt becomes file.txt.gpg), unless input comes from stdin, in which case output goes to stdout. If the input comes from a regular file and writing to standard ouput is desired, --output - can be used. --output can of course also be used if we want an output file name other than the default with .gpg appended.
On the other hand, when decrypting using --decrypt output goes to stdout unless --output is used to override it. If --decrypt is not specified, GPG still decrypts, but the default operation is to decrypt to a file named like the one on the command line but with the .pgp suffix removed (eg file.txt.pgp becomes file.txt); if the file specified does not end in .pgp, then --output must be specified (--output - writes to stdout), otherwise PGP exits with a "unknown suffix" error.

GPG 2.0.x
# Encrypt file
$ gpg --symmetric --batch --yes --passphrase-file key.txt --cipher-algo AES256 --output encrypted.enc plain.txt
# Decrypt file
$ gpg --decrypt --batch --yes --passphrase-file key.txt --output plain.txt encrypted.enc

# Encrypt stdin to file
$ tar -czvf - file1 file2 ... | gpg --symmetric --batch --yes --passphrase-file key.txt --cipher-algo AES256 --output archive.tar.gz.enc
# Decrypt file to stdout
$ gpg --decrypt --batch --yes --passphrase-file key.txt archive.tar.gz.enc | tar -xzvf -

In this case, the --batch option is mandatory (and thus probably --yes too) if we don't want pgp to prompt for the passphrase and instead use the one supplied on the command line with one of the --passphrase* options. The --no-use-agent option is ignored in gpg 2.0.x, as using the agent is mandatory and thus it should always be running (even though it's not actually used when doing symmetric encryption).

aespipe

As the name suggests, aespipe only does AES in its three variants (128, 192, 256). Aespipe tries hard to prevent the user from specifying the passphrase on the command line (and rightly so), so the passphrase(s) must normally be in a file (plaintext or encrypted with GPG). It is of course possible to come up with kludges to work around these restrictions, but they are there for a reason.

Aespipe can operate in single-key mode, where only one key/passphrase is necessary, and in multi-key mode, for which at least 64 keys/passphrases are needed. With 64 keys it operates in multi-key-v2 mode, with 65 keys it switches to multi-key-v3 mode, which is the safest and recommended mode, and the one that will be used for the examples.

So we need a file with 65 lines of random grabage; one way to generate it is as follows:

$ tr -dc '[:print:]' < /dev/random | fold -b | head -n 65 > keys.txt

If the above command blocks, it means that the entropy pool of the system isn't providing enough data. Either generate some entropy by doing some work or using an entropy-gathering daemon, or use /dev/urandom instead (at the price of losing some randomness).

Aespipe can also use a pgp-encrypted key file; more on this later. For now let's use the cleartext one.

# Encrypt a file using aes256
$ aespipe -e AES256 -P keys.txt < plain.txt > encrypted.enc
# Decrypt
$ aespipe -d -P keys.txt < encrypted.enc > plain.txt

As can be seen from the examples, given the way aespipe works (that is, as a pipe), it is not necessary to show its usage to encrypt to/from stdin/stdout, since it's its default and only mode of operation.

Useful options:

  • -C count run count rounds of hashing when generating the encryption key from the passphrase. This stretching helps to slow down brute force attacks. Recommended if using single-key mode, not needed in multi-key mode(s)
  • -e ENCALG (when encrypting) use ENCALG as cipher algorithm (AES128, AES192, AES256)
  • -h HASHALG use HASHALG to generate the actual key from the passphrase (default depends on encryption algorithm, see the man page)

One very important thing to note is that aespipe has a minimum block granularity when encrypting and decrypting; in simple terms, this means that the result of the decryption must always be a multiple of this minumum (16 bytes in single-key mode, 512 bytes in multi-key modes). NULs are added to pad if needed. Here is a blatant demonstration of this fact:

$ echo hello > file.txt.orig
$ ls -l file.txt.orig
-rw-r--r-- 1 waldner users 6 Jul 11 16:52 file.txt.orig
$ aespipe -P keys.txt < file.txt.orig > file.txt.enc
$ aespipe -d -P keys.txt < file.txt.enc > file.txt.dec
$ ls -l file.txt.*
-rw-r--r-- 1 waldner users 512 Jul 11 16:58 file.txt.dec
-rw-r--r-- 1 waldner users 512 Jul 11 16:57 file.txt.enc
-rw-r--r-- 1 waldner users   6 Jul 11 16:52 file.txt.orig
$ od -c file.txt.dec 
0000000   h   e   l   l   o  \n  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000020  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0001000

Some file formats can tolerate garbage at the end (eg tar), other can't, so this is something to keep into account when using aespipe. In the cases where the original size is known, it may be possible to postprocess the decrypted file to remove the padding but this may not always be practical:

$ origsize=$(wc -c < file.txt.orig)
$ truncate -s "$origsize" file.txt.dec
# alternatively
$ dd if=file.txt.dec bs="$origsize" count=1 > file.txt.likeorig

In the cases where the exact byte size is needed and no postprocessing is possible or wanted, another tool should be used (eg gpg or openssl).

Ok, so let's now see how to use an encrypted keyfile with aespipe. The file should be encrypted with GPG, which in turn can do symmetric encryption (as previously seen in this same article) or public-key encryption (using a public/private key pair, which should be already generated and available - not covered here).
Let's encrypt our keys.txt file both with symmetric and public-key encryption (separately)

# using symmetric encryption
$ gpg --symmetric --output keys.enc.sym keys.txt
# enter passphrase, or use some --passphrase* option to specify one

# using public key encryption
$ gpg --encrypt --recipient 199705C4 --output keys.enc.pubk keys.txt
# no passphrase is required, as only the public key is used to encrypt
# here "199705C4" is the id of the (public) key

Now, we want to encrypt or decrypt some file using the keys contained in our password-protected keyfile(s). This is done using the -K option (instead of -P) to aespipe. Let's start with the symmetrically enctrypted keyfile (keys.enc.sym):

# encrypt
$ aespipe -e aes256 -K keys.enc.sym < plain.txt > encrypted.enc
# aespipe prompts for the gpg passphrase to decrypt the keyfile

# decrypt
$ aespipe -e aes256 -K keys.enc.sym < encrypted.enc > plain.txt
# same thing, passphrase for keyfile is prompted

Now with the public-key encrypted keyfile:

# encrypt
$ aespipe -e aes256 -K keys.enc.pubk < plain.txt > encrypted.enc
# to decrypt keys.enc.pubk, the private key is needed, 
# aespipe prompts for the passphrase to unlock the private key

# decrypt
$ aespipe -e aes256 -K keys.enc.pubk < encrypted.enc > plain.txt
# same thing, passphrase to unlock the private key is prompted

So far, nothing special. However, for this last case (keyfile encrypted with public key cryptography), aespipe can actually use gpg-agent (if it's running) to obtain the passphrase needed to unlock the private key. This is done with the -A option, which tells aespipe the path to the socket where gpg-agent is listening. Assuming gpg-agent has already seen the passphrase to unlock the private key, it can transmit it to aespipe.

# The gpg-agent socket information is in the GPG_AGENT_INFO environment variable
# in the session where the agent is running, or one to which the variable has been exported. For example:
$ echo "$GPG_AGENT_INFO"
/tmp/gpg-gXM3Pm/S.gpg-agent:4897:1
# encrypt using a public-key encrypted keyfile, but tell aespipe to ask gpg-agent for the passphrase
$ aespipe -e aes256 -A "$GPG_AGENT_INFO" -K keys.enc.pubk < plain.txt > encrypted.enc
# similar for decryption

Other utilities

Let's have a look at some other utilities that are simpler but lack the flexibility provided by the previous ones.

mcrypt

This seems to be almost unusable, as doing practically anything beyond simple, optionless encryption produces a message like

Signal 11 caught. Exiting.

so it doesn't seem to be a good candidate for serious use. Some research shows many user in the same situation. More information is welcome.

aescrypt

This is a little-known program, however aescrypt is open source and very simple to use. It is multiplatform and has even a GUI for graphical operation. Here, however, we'll use the command-line version.

# encrypt a file
$ aescrypt -e -p passphrase file.txt
# creates file.txt.aes

# decrypt a file
$ aescrypt -d -p passphrase file.txt.aes
# creates file.txt

# encrypt standard input
$ tar -czvf - file1 file2 ... | aescrypt -e -p passphrase - -o archive.tar.gz.aes

# decrypt to stdout
$ aescrypt -d -p passphrase -o - archive.tar.gz.aes | tar -xzvf -

If no -p option is specified, aescrypt interactively prompts for the passphrase.
If no -o option is specified, a file with the same name and the .aes suffix is created when encrypting, and one with the .aes suffix removed when decrypting.

Since putting passwords directly on the command line is bad, it is possible to put the passphrase in a file and tell aescrypt to read it from the file. However, the file is not a simple text file; it has to be in a format that aescrypt recognizes. To create it, the documentation suggests using the aescrypt_keygen utility as follows:

$ aescrypt_keygen -p somesupercomplexpassphrase keyfile.key

The aescrypt_keygen program is only available in the source code package and not in the binary one (at least in the Linux version). However, since this file, according to the documentation, is nothing more than the UTF-16 encoding of the passphrase string, it's easy to produce the same result without the dedicated utility:

# generate keyfile
$ echo somesupercomplexpassphrase | iconv -f ascii -t utf-16 > keyfile.key

Once we have a keyfile, we can encrypt/decrypt using it:

$ aescrypt -e -k keyfile.key file.txt
# etc.
ccrypt

The ccrypt utility is another easy-to-use encryption program that implements the AES(256) algorithm. Be sure to read the man page and the FAQ.

Warning: when not reading from standard input, ccrypt overwrites the source file with the result of the encryption or decryption. This means that, if the encryption process is interrupted, a file could be left in an only partially encrypted state. On the other hand, when encrypting standard input this (obviously) does not happen. Sample usage:

# encrypt a file; overwrites the unencrypted version, creates file.txt.cpt
$ ccrypt -e file.txt

# decrypt a file; overwrites the encrypted version, creates file.txt
$ ccrypt -d file.txt.cpt

In this mode, multiple file arguments can be specified, and they will all be encrypted/decrypted. It is possible to recursively encrypt files contained in subdirectories if the -r/--recursive option is specified.

If no files are specified, ccrypt operates like a pipe:

# Encrypt standard input (example)
$ tar -czvf - file1 file2 ... | ccrypt -e > archive.tar.gz.enc
# Decrypt to stdout (example)
$ ccrypt -d < archive.tar.gz.enc | tar -xzvf -

To use the command non-interactively, it is possible to specify the passphrase in different ways:

  • -K|--key passphrase: directly in the command (not recommended)
  • -E|--envvar var: the passphrase is the content of environment variable $var

A useful option might be the -x|--keychange, which allows changing the passphrase of an already encrypted file; the old and new passphrases are prompted - or specified on the command line with -K/-H (--key/--key2) or -E/-F (--envvar/--envvar2) respectively, the file is decrypted with the old passphrase and reencrypted with the new one.

7-zip

The compression/archiving utility 7-zip can apparently do AES256 encryption, deriving the encryption key from the passphrase specified by the user with the -p option:

# encrypt/archive, prompt for passphrase
$ 7z a -p archive.enc.7z file1 file2 ...

# encrypt/archive, passphrase on the command line
$ 7z a -ppassphrase archive.enc.7z file1 file2 ...

# encrypt/archive standard input (prompt for passphrase)
$ tar -cvf - file1 file2 ... | 7z a -si -p archive.enc.tar.7z

# decrypt/extract, prompt for passphrase
$ 7z x -p archive.enc.7z [ file1 file2 ... ]

# decrypt/extract, passphrase on the command line
$ 7z x -ppassphrase archive.enc.7z [ file1 file2 ... ]

# decrypt/extract to stdout (prompt for passphrase)
$ 7z x -so -p archive.enc.tar.7z | tar -xzvf -

It looks like there's no way to run in batch (ie, non-interactive) mode without explicitly specifying the passphrase on the command line.

A simple SameGame implementation

Writing a game is a good way to learn and/or practice a new language, so here it is: a samegame implementation written in Python using the good pygame library.

Link to download: same.py. The only dependency is the pygame library. The game should work with python2 and python3, on all the platforms where python is available (tested on Linux and Windows).

Screenshot:

same-screenshot

Running the program with -h or --help shows the supported options:

$ ./same.py -h
Usage:
same.py [ -h|--help ]
same.py [ -l|--load f ] [ -g|--gameid n ] [ -c|--colors n ] [ -s|--cellsize n ] [ -x|--cols n ] [ -y|--rows n ]

-h|--help        : this help
-l|--load f      : load saved game from file "f" (disables all options)
-g|--gameid n    : play game #n (default: random betwen 0 and 100000)
-c|--colors n    : use "n" colors (default: 5)
-s|--cellsize n  : force a cellsize of "n" pixels (default: 30)
-x|--cols n      : force "n" columns (default: 17)
-y|--rows n      : force "n" rows (default: 15)

During the game, the following keybindings are supported:

u       undo move
ctrl-r  redo move
r       restart current game (same number)
n       start new game (different number)
q/ESC   exit the game
ctrl-s  save the current state of the game (for later retrieval with --load)
1-3     change the color scheme
a       toggle highlighting of current cell group

Some random notes:

  • At any time, the current state of the game is held in a big dictionary called gameinfo. When saving the game, this data structure is serialized to a file using version 2 of the pickle protocol (so it can be read both from python2 and python3). Games are saved in the current directory using a filename like "_samepy.64622-20.sav" where the two numbers indicate respectively the game number and the current move at the time of saving.
    It would have been nice to use some more standard format (eg JSON), but the data structures used here cannot be serialized into JSON (eg dictionaries with tuples as keys). (Ok, I cheated: with some work it is in fact possible using custom encoders and decoders, but here it's probably not worth the effort.)
  • It is possible to override the default values for cellsize, rows and column, even all three at the same time (within reason). If overriding one or more of these values results in too small/big cells, or too few/many rows or columns, an error is printed.
  • The game number that can be specified with -g is used to seed the random number generator before (randomly) populating the game board, so, on the same machine and with the same python version, the same number will always produce the same game layout. If the python major version changes, that is no longer true: game #100 with python2 is different from game #100 with python3. It might even change between, say, python3.3 and python3.4 (although it seems not to), or when using the same python version on different machines; more information is welcome, as usual.
  • By default 5 colors are used; this can be changed with the -c command line switch. The fewer the colors, the easier it is to solve the game; with two colors success is practically certain. There are three different palettes (ie, color schemes), that can be activated during the game with the keys 1-3. If you don't like them (I don't like them too much, but I'm also too lazy), or want to add more palettes, it's easy to find the place in the code where they can be changed.
  • There is more than one way to keep game history for undo/redo purposes. One could just remember the moves made by the player (ie the groups of cells that were removed at each turn), and upon undo/redo go backwards/forwards in this history, each time readding/removing a group of cells and recalculating the resulting board after the insertion or removal. This needs little memory to save the game history, but needs some calculation for each undo and redo. It's true that one of those two functions (the one that removes the cells) must be written anyway, to allow the player to actually play.
    However, the approach followed here is to separately save each board layout in sequence, and designate one of those states as "current" using an index into the sequence. This way, undo and redo are as simple as updating this index to point to the previous or the next saved state respectively (ie, subtracting or adding 1 to it). Restarting the game is (almost) just a matter of setting the pointer to move 0.
    So undo/redo/restart are very simple, but more memory is used to store all the information (this is also apparent by the size of the serialized saved game).
    In retrospective, if I were to rewrite it, I wold probably use the first approach.
  • The scoring system is quite simple: removing a group of N cells scores N^2 points. This differs slightly from other implementations of the game.
  • For some reason, the game is slow on machines with few resources. The highlighting of the current cell group, for example, has a certain lag, and so has the removal of cells following a click. It is possible to toggle highlighting on/off using the a key during the game, which makes it a bit better. The algorithms are certainly not optimal, however I think that alone doesn't explain these delays. Is it really all redrawing overhead? More info welcome.

Port mirroring with Linux bridges

Many commercial switches allow replication of traffic from one or more ports to one designated port (usually chosen by the user) for monitoring and analysis purposes. Some models offer the option to choose whether to replicate only incoming or outgoing traffic (or both, of course).
Typical uses cases for this are traffic analysis systems like IDS/IPS, but it can also be used for troubleshooting.

This feature goes by many names, among which are "SPAN", "port mirroring", "port monitoring", "monitor mode", "roving" and surely others. Although the actual setup procedures vary from vendor to vendor (or even from model to model), what they do in the end is the same.
There can be differences, however, in the way tagged (ie VLAN) packet are mirrored; in some cases, VLAN tags are stripped from the mirrored copy.

Since Linux implements at least two types of bridging (nowadays used mostly to create virtual networks t/o connect virtual machines), one may wonder whether port mirroring is possible. The answer is yes, although the procedures may be a bit tricky. So let's see how to set up port mirroring under Linux with the two prevailing bridging implementations (Openvswitch and in-kernel bridging), plus another kludge at the end.

Openvswitch

Let's start with Openvswitch, the (by now not so) new, multiplatform, all-singing, all-dancing bridge implementation.

Extremely simplified, openvswitch uses a kernel module to manage the data path (ie, the actual forwarding of frames), and keeps everything else in user space. A daemon (ovs-vswitchd) manages the switch operations (which however can manage multiple bridges, so only one daemon needs to run), and another daemon (ovs-ovsdb) manages the database which contains the various tables that make up the configuration(s) for all the bridges managed by ovs-vswitchd.

Each of these two functions is driven by a corresponding protocol: OpenFlow for the management of flows and data paths (not mandatory), and OVSDB for the management of the switch itself (to add/remove ports, interfaces, bridges etc. and to removal and configuration in general).

In fact, a basic installation of openvswitch runs a local OVSDB daemon, and all the various ovs-vsctl management commands (including those shown below) connect to this local OVSDB instance via a UNIX socket, asking it to carry out the tasks.

So we have our bridge ovsbr0, with three VMs connected respectively to vnet0, vnet1 and vnet2 (of course, everything remains perfectly valid and applicable if we have real physical interfaces instead).

# ovs-vsctl show
...
    Bridge "ovsbr0"
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
        Port "vnet2"
            Interface "vnet2"
        Port "vnet1"
            Interface "vnet1"
        Port "vnet0"
            Interface "vnet0"
...
# ovs-vsctl list bridge ovsbr0
_uuid               : 0141452d-efc1-47f8-a3b4-24f0c2bc1c36
controller          : []
datapath_id         : "00002e454101f847"
datapath_type       : ""
external_ids        : {}
fail_mode           : []
flood_vlans         : []
flow_tables         : {}
ipfix               : []
mirrors             : [8a547c29-a171-4412-b7ed-b2a1b88815de]
name                : "ovsbr0"
netflow             : []
other_config        : {}
ports               : [1d1da575-73ac-4bac-8e81-1042da415103, a8333e72-cb12-4777-bf55-e339ff41ece1, ccd87251-f61f-47ff-84f3-9e8864e6c2d8, f66298f8-02e8-48cc-a2c8-92181bea2c56]
protocols           : []
sflow               : []
status              : {}
stp_enable          : false

A thing to note (besides the awkward command names, that is) is that in openvswitch, absolutely everything that can be referenced has an UUID (this is by design).
In this case, we see that the switch has three ports (plus the "internal" port that is created by default), whose UUIDs are as shown in the ports field (which is a list of values).
(Each port, in turn, may be and usually is composed of one or more interfaces, which are also objects and have their own UUIDs, but that's not relevant here).

Just to get an idea, to get the actual UUIDs of our ports we can use this command:

# for p in vnet{0..2}; do echo "$p: $(ovs-vsctl get port "$p" _uuid)"; done
vnet0: f66298f8-02e8-48cc-a2c8-92181bea2c56
vnet1: ccd87251-f61f-47ff-84f3-9e8864e6c2d8
vnet2: a8333e72-cb12-4777-bf55-e339ff41ece1

To do mirroring with openvswitch, the first thing to do is to create and add a mirror (doh!) to the bridge.

# ovs-vsctl -- --id=@m create mirror name=mymirror -- add bridge ovsbr0 mirrors @m
cd94ea72-bb7f-4a26-816f-983a085a4bfd

The syntax may look a bit awkward, but it's not complicated (and it's well explained in the ovs-vsctl man page). We're running two commands at once, each command is introduced by --. The first comand creates a mirror named mymirror and, thanks to the --id=@m part, saves its UUID in the "variable" @m, which remains available for later commands. And we use it indeed in the second command, which associates the newly-created mirror mymirror with the bridge ovsbr0.

As said, everything has an UUID, and mirrors are no exceptions: the UUID of the new mirror is output as a result of the (successful) command. Let's check:

# ovs-vsctl list bridge ovsbr0
_uuid               : 0141452d-efc1-47f8-a3b4-24f0c2bc1c36
controller          : []
datapath_id         : "00002e454101f847"
datapath_type       : ""
external_ids        : {}
fail_mode           : []
flood_vlans         : []
flow_tables         : {}
ipfix               : []
mirrors             : [cd94ea72-bb7f-4a26-816f-983a085a4bfd]
name                : "ovsbr0"
netflow             : []
other_config        : {}
ports               : [1d1da575-73ac-4bac-8e81-1042da415103, a8333e72-cb12-4777-bf55-e339ff41ece1, ccd87251-f61f-47ff-84f3-9e8864e6c2d8, f66298f8-02e8-48cc-a2c8-92181bea2c56]
protocols           : []
sflow               : []
status              : {}
stp_enable          : false

So everything as before, but now our bridge has the mirror (since it's a list, as shown by the fact that it's in square brackets, there can be more than one).

Now that we have our mirror created and in the bridge, we should configure its source ports and destination ports. We want to mirror all traffic going in/out port vnet0, and we want to send it to bridge port vnet2 (where presumably we have a traffing monitoring application).

We must be careful with the terminology here. A mirror has a set of "source" and "destination" ports, but those refer only to origin ports, that is, those whose traffic we want to mirror. If a port is included in the source port set (select_src_port in openvswitch term), its outgoing traffic will be mirrored; if it's included in the destination port set (select_dst_port), its incoming traffic will be mirrored. So if we want to mirror both incoming and outgoing traffic for vnet0, we must include it in both sets:

# f66298f8-02e8-48cc-a2c8-92181bea2c56 is the UUID of vnet0
# ovs-vsctl set mirror mymirror select_src_port=f66298f8-02e8-48cc-a2c8-92181bea2c56 select_dst_port=f66298f8-02e8-48cc-a2c8-92181bea2c56
# ovs-vsctl list mirror mymirror
_uuid               : cd94ea72-bb7f-4a26-816f-983a085a4bfd
external_ids        : {}
name                : mymirror
output_port         : []
output_vlan         : []
select_all          : false
select_dst_port     : [f66298f8-02e8-48cc-a2c8-92181bea2c56]
select_src_port     : [f66298f8-02e8-48cc-a2c8-92181bea2c56]
select_vlan         : []
statistics          : {}

Thanks to the previously introduced --id=@name feature, we could have done the same thing without having to specify the actual UUID of vnet0:

# ovs-vsctl -- --id=@vnet0 get port vnet0 -- set mirror mymirror select_src_port=@vnet0 select_dst_port=@vnet0

In general, this syntax is both clearer and easier, so we're going to use it for the remaining steps.

If we wanted to mirror both vnet0 and vnet1 in both directions, we would do:

# ovs-vsctl \
  -- --id=@vnet0 get port vnet0 \
  -- --id=@vnet1 get port vnet1 \
  -- set mirror mymirror 'select_src_port=[@vnet0,@vnet1]' 'select_dst_port=[@vnet0,@vnet1]'

So the trick is to populate select_src_port and select_dst_port with the (list(s) of) UUIDs of the ports that we're interested in.

So far we've told openvswitch which port(s) we want to mirror, but we haven't said yet to which port we want to send this mirrored traffic. That is the purpose of the output_port attribute, which again is the UUID of the port which will receive the mirrored traffic. In our case, we know that this port is vnet2, so here's how we add it:

# ovs-vsctl -- --id=@vnet2 get port vnet2 -- set mirror mymirror output-port=@vnet2
# ovs-vsctl list mirror mymirror
_uuid               : cd94ea72-bb7f-4a26-816f-983a085a4bfd
external_ids        : {}
name                : mymirror
output_port         : a8333e72-cb12-4777-bf55-e339ff41ece1
output_vlan         : []
select_all          : false
select_dst_port     : [f66298f8-02e8-48cc-a2c8-92181bea2c56]
select_src_port     : [f66298f8-02e8-48cc-a2c8-92181bea2c56]
select_vlan         : []
statistics          : {}

So if we now go to our VM connected to vnet2 we're going to see the mirrored traffic from vnet0. Try it and see.

Now that we have seen the step-by-step procedure, it should not come as a surprise that we could also have done all the above in a single command (reformatted for clarity):

# ovs-vsctl \
  -- --id=@m create mirror name=mymirror \
  -- add bridge ovsbr0 mirrors @m \
  -- --id=@vnet0 get port vnet0 \
  -- set mirror mymirror select_src_port=@vnet0 select_dst_port=@vnet0 \
  -- --id=@vnet2 get port vnet2 \
  -- set mirror mymirror output-port=@vnet2
cd94ea72-bb7f-4a26-816f-983a085a4bfd

A quick and dirty way to mirror all traffic passing thrugh the bridge to a given port is to use the select_all property of the mirror:

# ovs-vsctl -- --id=@vnet2 get port vnet2 -- set mirror mymirror select_all=true output-port=@vnet2
# ovs-vsctl list mirror mymirror
_uuid               : cd94ea72-bb7f-4a26-816f-983a085a4bfd
external_ids        : {}
name                : mymirror
output_port         : a8333e72-cb12-4777-bf55-e339ff41ece1
output_vlan         : []
select_all          : true
select_dst_port     : []
select_src_port     : []
select_vlan         : []
statistics          : {tx_bytes=216769, tx_packets=1400}

Openvswitch mirrors preserve VLAN tags, so the traffic is received untouched.

To remove a specific mirror, the following command can be used:

# ovs-vsctl -- --id=@m get mirror mymirror -- remove bridge ovsbr0 mirrors @m

To remove all existing mirrors from a bridge:

# ovs-vsctl clear bridge ovsbr0 mirrors

Traditional bridging

Before Openvswitch came about, Linux had had (and of course still has) in-kernel bridging since about forever.
This is a much simpler yet functional bridge implementation in the Linux kernel, which provides basic functionality like STP but not much more. In particular, there is no native port mirroring functionality.
But fear not: Linux has a powerful tool which, among a lot of other things, can also mirror traffic. We're talking of the traffic control subsystem (tc for short), which can do all sorts of magic things.
Since it's a generic framework, its capabilities (including mirroring) are not limited to bridges; this means that we can mirror traffic for any interface(s) and send it to any other(s), regardless of whether they are phisical, virtual, part of a bridge or not, etc.

Indeed, for this example we're going to mirror the incoming/outgoing traffic of the interface bond0 and have it copied to the dummy interface dummy0 (very useful for testing). Replace with vnetx/vifx.y/whatever as needed. It works just the same.

First a very brief and simplified recap, since tc is very akin to a black art. Every interface, in Linux, has a so-called queuing discipline (qdisc), which basically defines the criteria that are used to send packets out the interface. This is for outgoing packets; it is also possible, although not usually done, to set a qdisc for incoming traffic, although its usefulness is somewhat limited (but it is definitely used for mirroring).
These qdisc are usually referred to as the "root qdisc" (for outgoing traffic) and "ingress qdisc" (for incoming traffic).
So the idea is: to mirror the traffic for an interface, we configure the relevant qdisc (root and/or ingress) to mirror packets before doing anything else.

To do this, we need to attach a classifier (filter in tc speak) to the relevant qdisc. Simply put, a filter tries to match packets according to some criteria and, if the match succeeds, performs certain actions on them.

Let's start with the code to mirror incoming traffic for an interface, which is simpler. The first thing to do is to establish an ingress qdisc for the interface, as there's none by default:

# tc qdisc add dev bond0 ingress

This creates an ingress qdisc for bond0 and gives it the ffff: identifier (it's always ffff:, for any interface, so no surprises):

# tc qdisc show dev bond0
qdisc ingress ffff: parent ffff:fff1 ----------------

Now, as said, we attach a filter to it. This filter simply matches all packets, and mirrors them to dummy0. A filter is attached to a qdisc, so it must have a reference to the parent. Here's the syntax to create the filter:

# tc filter add dev bond0 parent ffff: \
    protocol all \
    u32 match u8 0 0 \
    action mirred egress mirror dev dummy0

The syntax is arcane (and, in this case, not really immediately understandable), but there are basically 3 parts. Let's break it down. The first part is the filter creation linked to the parent qdisc for interface bond0:

tc filter add dev bond0 parent ffff:

Then come the matching rules; first, we say that the match should be attempted on any protocol, since we want all the traffic:

protocol all

This is not yet part of the actual filter; it's just part of the syntax that tc needs to know which packets it should attempt to apply actual matching rules to (ok, it is effectively a filter, but not in the tc sense).
Then we give the actual filter rule:

u32 match u8 0 0

This is the syntax used to tell the u32 filter that, of the packets it's seeing (that is, all of them), all should be matched. "u32" informs the parser that a u32 match follows, and the actual matching happens in the "u8 0 0" part, which, in simple language, returns true if the first byte of the packet (u8), ANDed with 0, gives 0. Some basic knowledge of bitwise operations tells us that X AND 0 == 0 for any X, so the match is always true.

Finally, the third part of the command specifies the action that is to be executed on matching packets (again, all of them):

action mirred egress mirror dev dummy0

Here we use the mirred action, which basically has two modes of operation: mirror (which is what we want here) to, er, mirror the packet, and redirect, to, uhm, redirect it. Both do their job using the device specified in the "dev" argument. As for the "egress" part, that's the only supported mode as of this writing.

If we wanted to mirror to multiple devices, all we would have to do is to specify multiple actions:

action mirred egress mirror dev dummy0 \
action mirred egress mirror dev dummy1 ...

So if you've made it so far, you'll be happy to know that applying these rules for outgoing traffic is almost the same, just a bit more complicated. The thing is, unlike the ingress case, interfaces normally do have an egress (outgoing) qdisc, but we can't attach filters directly to it since it's a classless qdisc ("classless" just means that it can't have "child" classes and filters). So the first thing to do is add a classful egress qdisc; once we've done that, the filter is attached in the same way as for the ingress qdisc.
As a side note, the mq qdisc found in wireless interfaces, despite claiming to be classful, doesn't seem to support direct filter attachment.

If we add a classful qdisc, we should decide which one to use, since there are a few of them. The most common ones are PRIO, CBQ and HTB. Of these, the simplest is PRIO, which is what we're going to use for our example. So without further ado, let's add our classful egress qdisc to our interface:

# tc qdisc add dev bond0 handle 1: root prio

We choose to give it the handle 1:; we could as well have used 100: or 42:, it doesn't matter as long as we use the same number when attaching the filter.
Once we have a classful qdisc to play, we can finally attach the filter to it, exactly in the same way as we did for the ingress qdisc:

# tc filter add dev bond0 parent 1: \
    protocol all \
    u32 match u8 0 0 \
    action mirred egress mirror dev dummy0

Now, let's bring dummy0 up and check:

# ip link set dummy0 up
# tcpdump -e -v -n -i dummy0
tcpdump: WARNING: dummy0: no IPv4 address assigned
tcpdump: listening on dummy0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:56:41.237966 00:13:72:af:11:23 > 00:16:3e:fd:aa:67, ethertype IPv4 (0x0800), length 153: (tos 0x0, ttl 64, id 57195, offset 0, flags [DF], proto TCP (6), length 139)
    192.168.1.3.17569 > 192.168.1.232.514: Flags [P.], cksum 0x84b9 (correct), seq 3603440679:3603440766, ack 1213686729, win 229, options [nop,nop,TS val 1217617195 ecr 69837571], length 87
18:56:41.238131 00:16:3e:fd:aa:67 > 00:13:72:af:11:23, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 51990, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.232.514 > 192.168.1.3.17569: Flags [.], cksum 0x9889 (correct), ack 87, win 1307, options [nop,nop,TS val 69844202 ecr 1217617195], length 0
...
18:57:06.687832 00:26:b9:72:16:99 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 14, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.7.1.1 is-at 00:26:b9:72:16:99, length 46

As can be seen above, VLAN tags are copied.

So to sum it up, here's how to enable bidirectional mirroring from bond0 to dummy0;

sif=bond0
dif=dummy0

# ingress
tc qdisc add dev "$sif" ingress
tc filter add dev "$sif" parent ffff: \
          protocol all \
          u32 match u8 0 0 \
          action mirred egress mirror dev "$dif"

# egress
tc qdisc add dev "$sif" handle 1: root prio
tc filter add dev "$sif" parent 1: \
          protocol all \
          u32 match u8 0 0 \
          action mirred egress mirror dev "$dif"

Of course, to mirror traffic for multiple source interfaces, the above (all or only half of it, depending on whether we want traffic in both or only one direction) should be repeated for each of them.

To remove the mirroring, it's enough to delete the root and ingress qdiscs from all the involved source interfaces (the default root qdisc will be restored automatically):

tc qdisc del dev bond0 ingress
tc qdisc del dev bond0 root

Daemonlogger

So, just for the sake of it, let's see another method to mirror traffic under Linux.

There's a nice utility called daemonlogger, which, according to its description, "is able to log packets to file or mirror to another interface", which sounds just like what we are looking for. Debian has it in its standard repositories.

A quick read of the man page shows that we can use it as follows:

# daemonlogger -i bond0 -o dummy0
[-] Interface set to bond0
[-] Log filename set to "daemonlogger.pcap"
[-] Tap output interface set to dummy0[-] Pidfile configured to "daemonlogger.pid"
[-] Pidpath configured to "/var/run"
[-] Rollover size set to 18446744071562067968 bytes
[-] Rollover time configured for 0 seconds
[-] Pruning behavior set to oldest IN DIRECTORY

-*> DaemonLogger <*-
Version 1.2.1
By Martin Roesch
(C) Copyright 2006-2007 Sourcefire Inc., All rights reserved

sniffing on interface bond0

At this point, tcpdump on dummy0 gives us all the traffic of bond0. Admittedly, less sophisticated than both Openvswitch and tc, but definitely much more "quick and dirty". It's also worth mentioning that it supports BPF filters just like tcpdump, so traffic can be filtered out before mirroring.
Nevertheless, a word of caution; the README file says, at the end:

This code is largely untested and probably completely shoddy.

Poor man’s directory tree replication

So you have this /var/lib/mysql directory that you need to copy to three other machines. A quick and dirty solution is to use ssh and tee (it goes without saying that passwordless ssh is needed, here and for all the other examples):

$ tar -C /var/lib/mysql -cvzf - . |\
  tee >(ssh dstbox1 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox2 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox3 'tar -C /var/lib/mysql/ -xzvf -') > /dev/null

If the directory tree to be transfered is not local, it is again possible to use ssh to get to it:

$ ssh srcbox 'tar -C /var/lib/mysql -cvzf - .' |\
  tee >(ssh dstbox1 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox2 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox3 'tar -C /var/lib/mysql/ -xzvf -') > /dev/null

This means that all the data flows from the source, through the machine where the pipeline runs, to the targets. On the other hand this solution has the advantage that there is no need to set up passwordless ssh between the origin and the target(s); the only machine that needs passwordless ssh to all the others is the machine where the command runs.

Now this is all basic stuff, but after doing this I wondered whether it would be possible to generalize the logic for a variable number of target machines, so for example a nettar-style operation could be possible, as in

$ nettar2.sh /var/lib/mysql dstbox1:/var/lib/mysql dstbox2:/var/tmp dstbox3:/var/lib/mysql ...

Would mean: take (local) /var/lib/mysql and replicate it to dstbox1 under /var/lib/mysql, to dstbox2 under /var/tmp, to dstbox3 under /var/lib/mysql, and so on for any extra argument supplied. Arguments could have the form targetname:[targetpath], with a missing targetpath indicating the same path as the source (ie, /var/lib/mysql in this example).

It turns out that such a generalization is not easy.

Note that in the following code, all error checking and other refinements are omitted for simplicity. In particular, care should be taken at least to:

  • validate the arguments passed to the script for number (at least two) and correct syntax
  • check that paths exist (or create them if not, etc)
  • properly escape arguments to commands that are executed using ssh (for example using printf %q)
  • validate data that is used to dynamically build commands to be run with eval

None of the above is done in the code that follows.

Concurrent transfers

An obvious way to do it is to run three (or however many) concurrent transfers, eg

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# parallel transfers
 
srcpath=$1
shift
 
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  tar -C "$srcpath" -cvzf - . | ssh "$dstbox" "tar -C '$dstpath' -xvzf -" &
done
 
wait

This obviously simply reads $srcpath multiple times and transfers it to each target machine. We are not exploiting the data duplication done by tee. If the source directory is huge, this will not be efficient as multiple processes at once will try to read it; although the OS will probably cache most of it, it doesn't look like a satisfactory solution.

So what if we actually want to use tee (which in turn implies that we need process substitution or an equivalent facility)?

Using eval

The first thing that comes to mind is to use the questionable eval command:

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using tee + eval
 
do_sshtar(){
  local dstbox=$1 dstpath=$2
  ssh "$dstbox" "tar -C '$dstpath' -xvzf -"
}
 
declare -a args
 
srcpath=$1
shift
 
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  args+=( ">(do_sshtar '$dstbox' '$dstpath')" )
done
 
tar -C "$srcpath" -cvzf - . | eval tee "${args[@]}" ">/dev/null"

This effectively builds the full list of process substitutions at runtime and executes them. However, when using eval we should be well aware of what we're doing. See the following pages for a good discussion of the implications of using eval: http://mywiki.wooledge.org/BashFAQ/048 and http://wiki.bash-hackers.org/commands/builtin/eval.

Note that with process substitution there is also the (in this case minor) issue that the created processes are run asynchronously in background, and we have no way to wait for their full termination (not even using wait), so the script might give us back the prompt slightly before all the background processes have fully completed their job.

Coprocesses

Bash and other shells have coprocesses (see also here), so it would seem that they could be useful for our purposes.
However, at least in bash, it seems that it's not possible to create a coprocess whose name is stored in a variable (which is how we would create a bunch of coprocesses programmatically), eg:

$ coproc foo { command; }      # works
$ cname=foo; coproc $cname { command; }  # does not work as expected (creates a coproc literally named $cname)

So to use coprocesses for our task, we would need again to resort to eval.

Named pipes

Let's see if there is some other possibility. Indeed there is, and it involves using named pipes (aka FIFOs):

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using tee + FIFOs (ssh version)
 
declare -a fifos
 
srcpath=$1
shift
 
count=1
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  curfifo=/tmp/FIFO${count}
  mkfifo "$curfifo"
  fifos+=( "$curfifo" )
  ssh "$dstbox" "tar -C '$dstpath' -xvzf -" < "$curfifo" &
  ((count++))
done
 
tar -C "$srcpath" -cvzf - . | tee -- "${fifos[@]}" >/dev/null
 
wait
# cleanup the FIFOs
rm -- "${fifos[@]}"

Here we're creating N named pipes, whose names are saved in an array, and an instance of ssh +tar to the target machine is launched in background reading from each pipe. Finally, tee is run against all the existing named pipes to send them the data; all the FIFOs are removed at the end.
This is not too bad, but we should manually set up interprocess communication (ie, create/delete the FIFOs); the beauty of process substitution is that bash sets up those channels for us, and here we're not taking advantage of that.

A point to note is that here we used ssh for the data transfer; it's always possible to change the code to use netcat, as explained in the nettar article. Here's an adaptation of the last example to use the nettar method (the other cases are similar):

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using tee + FIFOs (netcat version)
 
declare -a fifos
 
srcpath=$1
shift
 
count=1
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
 
  if ssh "$dstbox" "cd '$dstpath' || exit 1; { nc -l -p 1234 | tar -xvzf - ; } </dev/null >/dev/null 2>&1 &"; then
    curfifo=/tmp/FIFO${count}
    mkfifo "$curfifo"
    fifos+=( "$curfifo" )
    nc "$dstbox" 1234 < "$curfifo" &
    ((count++))
  else
    echo "Warning, skipping $dstbox" >&2   # or whatever
  fi
done
 
tar -C "$srcpath" -cvzf - . | tee -- "${fifos[@]}" >/dev/null
 
wait
# cleanup the FIFOs
rm -- "${fifos[@]}"

There should be some other way. I'll update the list if I discover some other method. As always, suggestions welcome.

Recursion

Update 19/05/2014: Marlon Berlin suggested (thanks) that recursion could be used to build an implicit chain of >(...) process substitutions, and indeed that's true. So here it is:

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using recursion (ssh version)
 
do_sshtar(){
 
  local dstbox=${1%:*} dstpath=${1#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  shift
 
  if [ $# -eq 0 ]; then
    # end recursion
    ssh "$dstbox" "tar -C '$dstpath' -xzvf -"
  else
    # send data to "current" $dstbox and recurse
    tee >(ssh "$dstbox" "tar -C '$dstpath' -xzvf -") >(do_sshtar "$@") >/dev/null
  fi
}
 
srcpath=$1
shift
 
tar -C "$srcpath" -czvf - . | do_sshtar "$@"

When the do_sshtar function receives only one argument, it just transfers the data directly via ssh to terminate the recursion. Otherwise, it uses tee to transfer the data and continue the recursion. Simple and elegant. Here's the netcat version:

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using recursion (netcat version)
 
do_nctar(){
 
  local dstbox=${1%:*} dstpath=${1#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  shift
 
  # set up listening nc on $dstbox
  if ssh -n "$dstbox" "cd '$dstpath' || exit 1; { nc -l -p 1234 | tar -xvzf - ; } </dev/null >/dev/null 2>&1 &"; then
    if [ $# -eq 0 ]; then
      # end recursion
      nc "$dstbox" 1234
    else
      # send data to "current" $dstbox and recurse
      tee >(nc "$dstbox" 1234) >(do_nctar "$@") >/dev/null
    fi
  else
    echo "Warning, skipping $dstbox" >&2
    # one way or another, we must consume the input
    if [ $# -eq 0 ]; then
      cat > /dev/null
    else
      do_nctar "$@"
    fi
  fi
}
 
srcpath=$1
shift
 
tar -C "$srcpath" -czvf - . | do_nctar "$@"

The -n switch to ssh is important, otherwise it will try to read from stdin, consuming our tar data.