Skip to content

Argument juggling with awk

This seems to be a sort of FAQ. A typical formulation goes like "I have a bash array, how do I pass it to awk so that it becomes an awk array"?

Leaving aside the fact that it may be possible to extend the awk code to do whatever one is doing with the shell array (in which cases the problem goes away), let's focus on how to do strictly what is requested (and more).

ARGC and ARGV

Like many other languages, awk has two special variables ARGC and ARGV that give information on the arguments passed to the awk program. ARGC contains the number of total arguments (including the awk interpreter or script), and ARGV is an array of ARGC elements (indexed from 0 to ARGC - 1) that contains all the arguments (ARGV[0] is always the name of the awk interpreter or script).
Let's demonstrate this with a simple example:

awk 'BEGIN{print "ARGC is " ARGC; for(i = 0; i < ARGC; i++) print "ARGV["i"] is " ARGV[i]}' foo bar
ARGC is 3
ARGV[0] is awk
ARGV[1] is foo
ARGV[2] is bar

There are two important things to know:

  • Unlike other languages, in awk ARGC and ARGV can be modified
  • When awk's main loop starts (and only then), awk processes whatever it finds in ARGV, starting from ARGV[1] up to ARGV[ARGC - 1].

Of course, these should normally be file names or variable assignments. But this is only relevant when the main loop starts; before then, in the BEGIN block we can manipulate ARGC and ARGV to our taste, and as long as what's left afterwards in ARGV is a list of files to process (or variable assignments), awk doesn't really care how those values got there.

So let's see some use cases for ARGC/ARGV manipulation.

Double pass over a file

Some code uses the two-file idiom to process the same file twice. So instead of doing

awk .... file.txt file.txt

we could just specify the file name once and double it in the BEGIN block so awk sees it twice:

# this is as if we said awk ..... file.txt file.txt
awk 'BEGIN{ARGV[ARGC++] = ARGV[1]} { ... }' file.txt

Fixed arguments

Let's assume that our awk code always has to process one or more files, whose names do not change. Of course, we could specify those names at each invocation of awk; nothing new here. However, for some reason we don't want to specify those names at each invocation, since they never change anyway; we only want to specify the variable file names. So if we have two never-changing files ("fixed1.txt" and "fixed2.txt"), we want to invoke our code with

process.awk file1 file2 file3 ...

but in fact we want awk to run as if we said

process.awk fixed1.txt fixed2.txt file1 file2 file3 ...

Let's see how the code to accomplish this may look like (of course it has to be adapted to the specific situation):

awk 'BEGIN {
  for(i = ARGC+1; i > 2; i--)
    ARGV[i] = ARGV[i - 2]
  ARGC += 2
  ARGV[1] = "fixed1.txt"
  ARGV[2] = "fixed2.txt"
}
# now awk processes fixed1.txt and fixed2.txt first, then whatever was specified on the command line
{
  ...
}' file1 file2 file3 ...

Passing a shell array (and more or less arbitrary data)

So, to back to the original question, how can we take advantage of this juggling to pass in an array? A simple way would be to pass all the array elements as normal awk arguments, process them in the BEGIN block, then remove them so when the main loop starts awk is unaware of what happened. Let's see an example:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
 
awk 'BEGIN{
 
  # ARGV[1] is the number of elements we have
  arrlen = ARGV[1]
 
  for(i = 2; i <= arrlen + 1; i++)
    awkarr[i - 1] = ARGV[i]
 
  # clean up
  j = 1
  for(i = arrlen + 2; i < ARGC; i++)
    ARGV[j++] = ARGV[i]
  ARGC = j
}
 
# here awk starts processing from file1, unaware of what we did earlier
# but we have awkarr[] populated with the values from shellarr (and arrlen is its length)
{
  ...
}
 
' ${#shellarr[@]} "${shellarr[@]}" file1 file2

awkarr has its elements indexed starting from 1, as is customary in awk; it's easy to adapt the code to use 0-based or another number.
We could also pass the number of elements in the array as a normal value using -v, which simplifies processing somewhat:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
 
awk -v arrlen="${#shellarr[@]}" 'BEGIN{
 
  for(i = 1; i <= arrlen; i++)
    awkarr[i] = ARGV[i]
 
  # clean up
  for(i = arrlen + 1; i < ARGC; i++)
    ARGV[i - arrlen] = ARGV[i]
  ARGC -= arrlen
}
# ... as before
 
' "${shellarr[@]}" file1 file2

If the number of files to process is known (which should be the most common case), then it's even easier as we can specify them first and the array elements afterwards. Let's assume we know that we always process two files:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
 
awk -v nfiles=2 'BEGIN{
  for(i = nfiles + 1; i < ARGC; i++)
    awkarr[i - nfiles] = ARGV[i]
  arrlen = ARGC - (nfiles + 1)
  ARGC = nfiles + 1
}
# ... as before
 
' file1 file2 "${shellarr[@]}"

Finally, if we want to "pass" a shell associative array to awk (such that it exists with the same keys and values in the awk code), we could do this:

declare -A shellarr
shellarr=( [fook]='foov' [bark]='barv' [bazk]='bazv' [xxxk]='xxxv' [yyyk]='yyyv' )
 
awk -v nfiles=2 'BEGIN{
  arrlen = ( ARGC - (nfiles + 1) ) / 2
  for(i = nfiles + 1; i < nfiles + 1 + arrlen; i++)
    awkarr[ARGV[i]] = ARGV[i + arrlen]
  ARGC = nfiles + 1
}
# ... as before
 
' file1 file2 "${!shellarr[@]}" "${shellarr[@]}"

This works because in bash, the order of expansion of "${!shellarr[@]}" and "${shellarr[@]}" is the same (currently, at least). To be 100% sure, however, we could of course copy all the key, value pairs to another array and pass that one, as in the following example:

declare -A shellarr
shellarr=( [fook]='foov' [bark]='barv' [bazk]='bazv' [xxxk]='xxxv' [yyyk]='yyyv' )
 
declare -a temp
for key in "${!shellarr[@]}"; do
  temp+=( "$key" "${shellarr[$key]}" )
done
 
awk -v nfiles=2 'BEGIN{
  arrlen = ( ARGC - (nfiles + 1) ) / 2
  for(i = nfiles + 1; i < ARGC; i += 2)
    awkarr[ARGV[i]] = ARGV[i + 1]
  ARGC = nfiles + 1
}
# ... as before
 
' file1 file2 "${temp[@]}"

In the last two examples, it should be noted that, as usual with associative arrays, the concept of array "length" doesn't make much sense; it's just an indication of how many elements the hash has, and nothing more (in awk, all arrays are associative regardless, though they can be used as "normal" ones as we did in the first examples).

Update 31/10/2013: So there's always something new to learn, and in my case it was that if an element of ARGV is the empty string, awk just skips it. This simplifies the examples where the ARGV elements are moved down to fill the positions where the shell array elements were. In fact, all that's needed is to set those elements to "", and awk will naturally skip them. So the first two examples above become:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
 
awk 'BEGIN{
 
  # ARGV[1] is the number of elements we have
  arrlen = ARGV[1]
  ARGV[1] = ""
 
  for(i = 2; i <= arrlen + 1; i++) {
    awkarr[i - 1] = ARGV[i]
    ARGV[i] = ""
  }
}
...' ${#shellarr[@]} "${shellarr[@]}" file1 file2

Second example:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
 
awk -v arrlen="${#shellarr[@]}" 'BEGIN{
  for(i = 1; i <= arrlen; i++) {
    awkarr[i] = ARGV[i]
    ARGV[i] = ""
  }
}
...' "${shellarr[@]}" file1 file2

Quick file sharing over HTTP

Download here: tws.pl. Note that a recent version of Perl is required (definitely works with 5.18).

This is (hopefully) an evolution (perhaps suffering from creeping featurism) of the excellent wwwshare (thanks pgas), which itself is based on Vidar's one (which gets the credit for the original idea). This is a simple throwaway web server (tws) - or better said, something that pretends to be one to a client -, which can be useful when we need to quickly transfer some file or data to a friend or remote party. The program prints a list of URLs, and the remote end can then download the file by pointing a normal HTTP client (browser, curl, whatever) to one of these URLs. As the original author says, "when the file is downloaded, it exits. No setup or cleanup required".

The new features are:

  • Written in Perl
  • MIME support (to help the client know the file type)
  • Progress bar!
  • Streaming mode, using chunked transfer encoding (introduced by HTTP 1.1)

Usage

Run the program with -h to see a summary:

$ tws.pl -h
Usage:
tws.pl [ -a ] [ -u ] [ -n ] [ -b bufsize ] [ -p port ] [ -m mimetype ] [ -U url ] [ -f filename ] name

-a          : consider all addresses for URLs (including loopback and link-local addresses)
-u          : flush output buffer as soon as it's written
-n          : do not resolve IPs to names
-b bufsize  : read/write up to bufsize bytes for cycle (default: 16384)
-p port     : listen on this port (default: random)
-m mimetype : force MIME type (default: autodetect if possible, otherwise application/octet-stream)
-U url      : include this URL among the listed alternative URLs
-f filename : use 'filename' to build the request part of the URL (default: dynamically computed)

'name' (mandatory argument) must exist in normal mode; in streaming mode it's only used to build the URL

Examples:
$ tws.pl -p 1025 /path/to/file.zip
Listen for connections on port 1025; send file.zip upon client connection. The specified path must exist.

$ tws.pl -p 4444 -U 'publicname.example.com:5555' -f archive.zip '/path/to/funny file.zip'
Listen on port 4444, suggest http://publicname.example.com:5555/archive.zip as download URL (presumably a port forwarding exists)

$ tar -cjf - file1 file2 file3 | tws.pl -m application/x-bzip2 result.tbz2
Listen on random port; upon connection, send the data coming from the pipe with the specified MIME type.
result.tbz2 need not exist; it's only used to build the URL

In the simplest case, one just does

$ tws.pl /path/to/some/file.iso
Listening on port 8052, MIME type is application/x-iso9660-image

Possible URLs that should work to retrieve the file:

http://scooter.example.com:8052/file.iso
http://10.4.133.1:8052/file.iso
http://[2001:db8:1::2]:8052/file.iso

Hopefully at least one of the printed URLs is valid and can be communicated to the other party, which then connects to download the file:

Client connected: 10.112.1.18 (colleague.example.com) from port 51066
 100% [=======================================================>]   3,224,686,592 (29s) 104.8M/s

The listening port is random; it is possible to force a specific value if needed (see the help). The part after the / in the URL is determined based on the supplied filename, to give some hint to the client or browser that downloads the file. Here too it is possible to force a specific string.

If the program detects that its standard input is connected to a pipe, it automatically operates in streaming mode, which is a fancy name to mean that it reads from standard input rather than a given file. A filename should still be specified, though, so the download URL can be "correctly" built (to be more helpful to the client). Streaming mode means that one can do something like this, for instance:

$ tar -cjf - file1 file2 file3 | tws.pl -m application/x-bzip2 result.tbz2
Listening on port 8787 (streaming mode), MIME type is application/x-bzip2

Possible URLs that should work to retrieve the file:

http://scooter.example.com:8787/result.tbz2
http://10.4.133.1:8787/result.tbz2
http://[2001:db8:1::2]:8787/result.tbz2

In streaming mode, the content length is of course not known, so the program sends the data using chunked transfer encoding; since this is an HTTP 1.1 feature, HTTP 1.0-only clients will not understand it (notably wget versions prior to 1.13 have this limitation, so don't use it to download when in streaming mode). Another issue with streaming mode is that the MIME type is also not known; it's possible to give hints on the command line (see the above example and the help); in any case, the program defaults to application/octet-stream which should always work (though not extremely helpful to the client).

The program can also operate in unbuffered mode (-u), which means that data sent to the client is flushed as it is written, so the client receives it immediately. This feature, coupled with streaming mode, can be used as a rudimentary tool to send live streaming data to an HTTP client, for example like this:

$ tail -f /var/log/messages | tws.pl -u -m text/plain log.txt

or actual output from a pipeline, eg

$ bigprocessing.sh hugefile.csv | tee results.txt | tws.pl -u -m text/plain results.txt

Connecting with a browser or another HTTP client should show the data in (near) real time. This doesn't seem terribly useful, but perhaps someone can come up with a valid use case. Keep in mind that for this to work you need to make sure that whatever program is writing to the pipe is not buffering the output (many programs do buffering if they detect that stdout is not a terminal). Tools like stdbuf or unbuffer help in this case. On the client side, curl has a --no-buffer/-N option that tells it to show data as it arrives without buffering. Also, it seems some browsers do a bit of initial buffering of the data they receive, after which they start showing new data in realtime (more info welcome).

Notes

If the address or name in the URL that the other party should use to download is not local, the program cannot know it. In principle, it could be done (somewhat unreliably) by querying some external IP address check service like dyndns and friends, but in practice it's easier to leave this to the user, who surely knows better. Thus, it's possible to supply an URL that the user knows is leading to the local machine (see help for an example). And of course, this is only so it can be copied/pasted; it doesn't really change what the program does.

The way the program works is: once a connection is received, it reads the client's HTTP request and discards it (the only check that is performed is that it is a GET method, but even that could probably be avoided); after that, a minimal set of HTTP reply headers are sent, followed by the actual data. This means the code is simple, but it also means that picky clients that only accept certain encodings, expect specific headers or other special features will probably not work. If more sophisticated behavior is desired, use a real web server (of which there are many).

The code makes a number of assumptions and uses some tools that practically make it very Linux-specific; it has not been tested under other platforms. Also it relies on some external programs to get some information (local IPs, terminal size, MIME types etc); none of these external programs is critical, so the absence of some or all of them will not cause failure.

URL encoding is done using the URI::Escape module, if available; otherwise, no URL encoding is performed at all. With "normal" filenames this is not a problem, however in cases where weird URLs would result, it is possible to explicitly supply a name (see help).

To handle IPv4 and IPv6 clients with a single IPv6 socket, IPv4-mapped addresses are used. The program disables the socket option IPV6_V6ONLY, so both IPv4 and IPv6 clients can be accepted regardless of the setting in /proc/sys/net/ipv6/bindv6only. However, people should be using IPv6 already!

If the terminal is resized while the program is sending data, the progress bar will NOT be resized accordingly. However, since the terminal width is not checked until after a client has connected, it is possible to resize the terminal while the program is still waiting for a client to connect.

And btw, only one client is handled. As said, for anything more complex use a real webserver.

That's it. Any comment or bug report is welcome, as usual.

Run cron job every N days

Let's say we want to run a job every N days, or weeks, regardless of month or year boundaries. For example, once every three tuesdays, or once every 17 days, or whatever.

Cron itself (at least the variants I have access to) has no way to specify these time periods, so it would seem this could not be done.

But there's a simple way to do it. It is based on modular arithmetic and on the fact that we know that measurement of time on Unix starts on a concrete date, which is the well-known January the 1st, 1970 (also known as "the Epoch"). For the remainder, I'm assuming UTC and a running time of midnight for simplicity; it should be easy to consider the appropriate time differences where needed.

With this kind of requirement we need to have an actual starting date for the job, that is, when it has to run for the first time, so we can use it as a starting point for the "every N days" intervals.
Once we have an actual date of first execution for our task (say, 2013-01-15, a Tuesday, at 00:00), we can divide the time passed since the Epoch until our date into groups of N days. For this first example, let's say N == 14, two weeks. With the following calculation we can see which place our starting day occupies in a period of 14 days (two weeks):

$ echo $(( $(date +%s -d "2013-01-15 00:00") / 86400 % 14 ))
11

Dividing by 86400 gives the number of days passed since the Epoch, from which the modulo 14 is calculated. The result is 11, which tells us that at any given time, performing the above calculation using the current date will yield 11 only on $startdate, of course, and on every second Tuesday (well, every 14 days, which is the same) starting from $startdate (or going backwards from $startdate, which is not important here). Simple test code to show that it's true:

#!/bin/bash
 
# starting from 2013-01-10, calculate the modulo for each day over a period of
# 40 days, checking that only the days we're interested in have modulo 11
 
begin=2013-01-10
 
for i in {0..39}; do
  curdate=$(date +%s -d "$begin + $i days 00:00")
 
  modulo=$(( curdate / 86400 % 14 ))
 
  [ $modulo -eq 11 ] && prefix="*** " || prefix=
 
  echo "${prefix}Date $(date "+%F %T (%a)" -d @$curdate) has modulo $modulo"
done

Sample run:

$ ./modcheck.sh
Date 2013-01-10 00:00:00 (Thu) has modulo 6
Date 2013-01-11 00:00:00 (Fri) has modulo 7
Date 2013-01-12 00:00:00 (Sat) has modulo 8
Date 2013-01-13 00:00:00 (Sun) has modulo 9
Date 2013-01-14 00:00:00 (Mon) has modulo 10
*** Date 2013-01-15 00:00:00 (Tue) has modulo 11
Date 2013-01-16 00:00:00 (Wed) has modulo 12
Date 2013-01-17 00:00:00 (Thu) has modulo 13
Date 2013-01-18 00:00:00 (Fri) has modulo 0
Date 2013-01-19 00:00:00 (Sat) has modulo 1
Date 2013-01-20 00:00:00 (Sun) has modulo 2
Date 2013-01-21 00:00:00 (Mon) has modulo 3
Date 2013-01-22 00:00:00 (Tue) has modulo 4
Date 2013-01-23 00:00:00 (Wed) has modulo 5
Date 2013-01-24 00:00:00 (Thu) has modulo 6
Date 2013-01-25 00:00:00 (Fri) has modulo 7
Date 2013-01-26 00:00:00 (Sat) has modulo 8
Date 2013-01-27 00:00:00 (Sun) has modulo 9
Date 2013-01-28 00:00:00 (Mon) has modulo 10
*** Date 2013-01-29 00:00:00 (Tue) has modulo 11
Date 2013-01-30 00:00:00 (Wed) has modulo 12
Date 2013-01-31 00:00:00 (Thu) has modulo 13
Date 2013-02-01 00:00:00 (Fri) has modulo 0
Date 2013-02-02 00:00:00 (Sat) has modulo 1
Date 2013-02-03 00:00:00 (Sun) has modulo 2
Date 2013-02-04 00:00:00 (Mon) has modulo 3
Date 2013-02-05 00:00:00 (Tue) has modulo 4
Date 2013-02-06 00:00:00 (Wed) has modulo 5
Date 2013-02-07 00:00:00 (Thu) has modulo 6
Date 2013-02-08 00:00:00 (Fri) has modulo 7
Date 2013-02-09 00:00:00 (Sat) has modulo 8
Date 2013-02-10 00:00:00 (Sun) has modulo 9
Date 2013-02-11 00:00:00 (Mon) has modulo 10
*** Date 2013-02-12 00:00:00 (Tue) has modulo 11
Date 2013-02-13 00:00:00 (Wed) has modulo 12
Date 2013-02-14 00:00:00 (Thu) has modulo 13
Date 2013-02-15 00:00:00 (Fri) has modulo 0
Date 2013-02-16 00:00:00 (Sat) has modulo 1
Date 2013-02-17 00:00:00 (Sun) has modulo 2
Date 2013-02-18 00:00:00 (Mon) has modulo 3

So there we have it, every second Tuesday starting from 2013-01-15. The code shown in modcheck.sh can be made generic so that values can be passed from the command line:

#!/bin/bash
 
# use: modcheck.sh [startdate yyyy-mm-dd] [period] [wanted modulo]
 
begin=$1
length=$2
wantedmod=$3 
 
for i in {0..39}; do
  curdate=$(date +%s -d "$begin + $i days 00:00")
 
  modulo=$(( curdate / 86400 % length ))
 
  [ $modulo -eq $wantedmod ] && prefix="*** " || prefix=
 
  echo "${prefix}Date $(date "+%F %T (%a)" -d @$curdate) has modulo $modulo"
done

Another test: let's say we want every fifth day starting from 2012-12-02. Let's calculate the modulo first:

$ echo $(( $(date +%s -d "2012-12-02 00:00") / 86400 % 5 ))
0

And let's verify it:

$ ./modcheck.sh 2012-12-01 5 0
Date 2012-12-01 00:00:00 (Sat) has modulo 4
*** Date 2012-12-02 00:00:00 (Sun) has modulo 0
Date 2012-12-03 00:00:00 (Mon) has modulo 1
Date 2012-12-04 00:00:00 (Tue) has modulo 2
Date 2012-12-05 00:00:00 (Wed) has modulo 3
Date 2012-12-06 00:00:00 (Thu) has modulo 4
*** Date 2012-12-07 00:00:00 (Fri) has modulo 0
Date 2012-12-08 00:00:00 (Sat) has modulo 1
Date 2012-12-09 00:00:00 (Sun) has modulo 2
Date 2012-12-10 00:00:00 (Mon) has modulo 3
Date 2012-12-11 00:00:00 (Tue) has modulo 4
*** Date 2012-12-12 00:00:00 (Wed) has modulo 0
Date 2012-12-13 00:00:00 (Thu) has modulo 1
Date 2012-12-14 00:00:00 (Fri) has modulo 2
Date 2012-12-15 00:00:00 (Sat) has modulo 3
Date 2012-12-16 00:00:00 (Sun) has modulo 4
*** Date 2012-12-17 00:00:00 (Mon) has modulo 0
Date 2012-12-18 00:00:00 (Tue) has modulo 1
Date 2012-12-19 00:00:00 (Wed) has modulo 2
Date 2012-12-20 00:00:00 (Thu) has modulo 3
Date 2012-12-21 00:00:00 (Fri) has modulo 4
*** Date 2012-12-22 00:00:00 (Sat) has modulo 0
Date 2012-12-23 00:00:00 (Sun) has modulo 1
Date 2012-12-24 00:00:00 (Mon) has modulo 2
Date 2012-12-25 00:00:00 (Tue) has modulo 3
Date 2012-12-26 00:00:00 (Wed) has modulo 4
*** Date 2012-12-27 00:00:00 (Thu) has modulo 0
Date 2012-12-28 00:00:00 (Fri) has modulo 1
Date 2012-12-29 00:00:00 (Sat) has modulo 2
Date 2012-12-30 00:00:00 (Sun) has modulo 3
Date 2012-12-31 00:00:00 (Mon) has modulo 4
*** Date 2013-01-01 00:00:00 (Tue) has modulo 0
Date 2013-01-02 00:00:00 (Wed) has modulo 1
Date 2013-01-03 00:00:00 (Thu) has modulo 2
Date 2013-01-04 00:00:00 (Fri) has modulo 3
Date 2013-01-05 00:00:00 (Sat) has modulo 4
*** Date 2013-01-06 00:00:00 (Sun) has modulo 0
Date 2013-01-07 00:00:00 (Mon) has modulo 1
Date 2013-01-08 00:00:00 (Tue) has modulo 2
Date 2013-01-09 00:00:00 (Wed) has modulo 3

So to use all this in our crons, we need to know the starting date, the frequency (every N days) and calculate the modulo. Once the modulo is known, we run the job if the modulo calculated for "now" (when the job is invoked) matches the modulo we want. So for instance if the period is 13 days and the modulo we want is 6, in our script we do:

#!/bin/bash
 
if (( $(date +%s) / 86400 % 13 != 6 )); then exit; fi
 
# run the task here
...

Or as usual it can also be done in the crontab itself so the script does not need to have special knowledge (it may not even be a script, so in that case the check would have to be external anyway):

0 0 * * *  bash -c '(( $(date +\%s) / 86400 \% 13 == 6 )) && runmyjob.sh'

Note: so far, it doesn't seem to have trouble with DST time changes. Corrections welcome.

GRE bridging, IPsec and NFQUEUE

Lots of stuff, apparently unrelated, but it all came together recently, so here it is. Probably the whole thing is useless, but along the way I found many interesting things.

The network topology used for this experiment is as follows:

gretapsite

The initial task was trying to find a way to bridge site A and site B using IPsec (for starters, something that's not terribly useful; and yes, I know that there are other ways, but that's not the point here), I came across the (utterly undocumented) gretap tunnel of iproute2, which is, well, a GRE interface that can encapsulate ethernet frames (rather than IP packets, which is the more usual use case for GRE).

A gretap interface is created thus:

routerA# ip link add gretap type gretap local 192.168.200.1 remote 172.16.0.1 dev eth1
routerA# ip link show gretap
6: gretap@eth1: <BROADCAST,MULTICAST> mtu 1462 qdisc noop state DOWN mode DEFAULT qlen 1000
    link/ether 62:24:67:45:44:ad brd ff:ff:ff:ff:ff:ff

The idea is: add the gretap interface to a bridge at both ends of the tunnel, and voila, site A and site B are bridged. Then, use IPsec to encrypt the GRE traffic between the sites: mission accomplished. Right? No. Turns out it isn't so easy. Let's go step by step.

The MTU problem

As shown above, when the gretap interface is created, it has a default MTU of 1462, which is correct: 1500 (underlying physical interface) - 20 (outer IP header added by GRE) - 4 (GRE header) - 14 (ethernet header of the encapsulated frame) = 1462.

However, the rest of machines in the LAN have an MTU of 1500, of course.

For a normal GRE interface (encapsulating pure IP) a lower tunnel MTU is a bit of an annoyance, but nothing critical: when a packet too big is received, an ICMP error (type 3, code 4 for IPv4) is sent back to the originator of the packet, which will then hopefully take the necessary actions (lower its MTU, retransmit but this time allowing fragmentation, whatever).
However here with our gretap interface we're bridging, which means that there's no "previous hop" where to send the ICMP error.
Furthermore, since the gretap interface is added to a bridge, the bridge MTU is lowered in turn. Since ethernet networks work on the assumption that all the participating interfaces have the same MTU, if a bridge with an MTU of 1462 (maximum frame size 1476) receives a full-sized frame (up to 1514 bytes), it just silently drops it. Not nice, although there's nothing else it can do.

To check, let's add the gretap interface to a bridge on routerA:

routerA# ip link add br0 type bridge
routerA# ip link set eth0 down
routerA# ip addr del 10.32.x.x/24 dev eth0    # remove whatever IP address it had
routerA# ip link set eth0 master br0
routerA# ip link set eth0 up
routerA# ip link set br0 up
routerA# ip addr add 10.0.0.254/24 dev br0
routerA# ip link set gretap up
routerA# ip link set gretap master br0

and the same on routerB:

routerB# ip link add br0 type bridge
routerB# ip link set eth0 down
routerB# ip addr del 10.32.x.x/24 dev eth0   # remove whatever IP address it had
routerB# ip link set eth0 master br0
routerB# ip link set eth0 up
routerB# ip link set br0 up
routerB# ip addr add 10.0.0.253/24 dev br0
routerB# ip link set gretap up
routerB# ip link set gretap master br0

(we assign 10.32.0.253 to br0 on routerB to avoid conflicts, since the networks are bridged; these addresses are not used for these examples, anyway).

Now, from host A, we produce a maximum-sized frame:

hostA# ping -s 1472 10.32.0.111
PING 10.32.0.111 (10.32.0.111) 1472(1500) bytes of data.
^C
--- 10.32.0.111 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2015ms

If we capture the traffic, we see that the small ARP request/reply frames pass through the tunnel, but the frames containing the actual ICMP echo request packets are silently dropped by br0 at routerA (which at this point still has an MTU of 1462).

Seems like the obvious thing to do is to raise the MTU of the gretap interface to 1500, so the bridge can have an MTU of 1500 again:

routerA# ip link show br0
4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1462 qdisc noqueue state UP mode DEFAULT 
    link/ether 00:16:3e:c3:8c:ef brd ff:ff:ff:ff:ff:ff
routerA# ip link set gretap mtu 1500
routerA# ip link show br0
4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT 
    link/ether 00:16:3e:c3:8c:ef brd ff:ff:ff:ff:ff:ff

and the same thing at routerB. Are we done now? Of course no, because while now the bridge does accept full-sized frames, there is a problem when a big frame has to be sent out the gretap "bridge port": after encapsulation, the size of the resulting IP packet can be up to 1514 + 4 + 20 = 1538 bytes. It may or may not be possible to fragment it, and here comes the next interesting point.

IP fragmentation

Now it turns out that, by default, GRE interfaces do Path MTU Discovery (PMTUD for short), which means the packets they produce have the DF (Don't Fragment) flag bit set. As mentioned above, this behavior is useful when encapsulating IP, so the previous hop can be notified of the tunnel bottleneck if needed. But again, here we're encapsulating ethernet and bridging, so there's noone to notify: if the oversized packet cannot be fragmented (and by default it can't, since PMTUD is performed), it's just dropped.

What we want, ideally, is that the encapsulated packets resulting from gretap encapsulation be fragmentable, that is, have the DF bit set to 0. It should be possible to do it, right?

Reading through the scarce iproute2 documentation, we learn that tunnel interfaces can be given a special option nopmtudisc at creation time, whose function, according to the manual, is to "disable Path MTU Discovery on this tunnel." Sounds just like the feature we want, so let's set the flag when creating the interface:

routerA# ip link del gretap
routerA# ip link add gretap type gretap local 192.168.200.1 remote 172.16.0.1 dev eth1 nopmtudisc
routerA# ip link set gretap mtu 1500
routerA# ip link set gretap up
routerA# ip link set gretap master br0

(same at routerB). However, if we now retry the oversized ping (-s 1472), it still doesn't work. How is that? A sample traffic capture shows that IP packets leaving hostA have the DF bit set, and apparently this bit gets copied to the outer IP header added by GRE (despite the nopmtudisc flag), which thus results in an unfragmentable packet which is silently dropped. If we explicitly disable DF on the IP packets produced by ping, it finally works:

ping -Mdont -s 1472 10.32.0.111
PING 10.32.0.111 (10.32.0.111) 1472(1500) bytes of data.
1480 bytes from 10.32.0.111: icmp_req=1 ttl=64 time=1.26 ms
1480 bytes from 10.32.0.111: icmp_req=2 ttl=64 time=0.932 ms
1480 bytes from 10.32.0.111: icmp_req=3 ttl=64 time=1.01 ms
^C
--- 10.32.0.111 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 0.932/1.071/1.264/0.143 ms

and capturing outgoing traffic at routerA and routerB show fragmented GRE/IP packets (yes, fragments are bad, but probably better than nothing at all).
However, this is no solution, since we artificially modified the behavior of the application on the client, which is not doable for all the LAN hosts.

More bad news: Linux (and probably most operating systems) performs PMTUD by default, meaning that each and every IP packet produced by applications has the DF bit set (unless overridden by the application, which is not usual). There is a /proc entry that can disable this behavior on a global basis (/proc/sys/net/ipv4/ip_no_pmtu_disc, which has to be set to 1 to disable PMTUD), but we'd like to solve the problem on the routers without touching the other machines. So the goal now is: clear the DF bit from the outer IP header, so packets can be fragmented (regardless of the original value of DF).

Here are some ugly drawings showing how the gretap encapsulation works. First, there's the original ethernet frame, which can be up to 1514 bytes (1500 IP + 14 ethernet header):

ethernet

When this frame is encapsulated, we get this:

greencap

Clearing DF: NFQUEUE

One may imagine that it should be possible to perform this apparently simple using standard tools; after all, it's "normal" packet mangling, which iptables can perform. Again, it turns out it's not so easy. There seems to be no native iptables way to do this (while it can touch other parts of the IP header, like TTL or TOS).

However, what iptables does have is a generic mechanism to pass packets to userspace: to use it, we create iptables rules that match the packets we want and apply the NFQUEUE (formerly QUEUE) target to them. Then, a user-space program can register to receive and process these packets, and decide their fate (or "verdict" in nfqueue speak): send them back to iptables (after optionally modifying them), or discard them. The mechanism used for this communication is called nfnetlink. The library is in C, but Perl and Python bindings exist.

Before writing the actual code, however, we have to create the iptables rule that matches the tunneled packets. Thinking about it a bit, it's not obvious which tables and chains it would traverse on the router. Would it traverse the INPUT chain? It's not destined to the local machine (or is it?). What about the OUTPUT chain? It's not locally generated (or is it?). Fortunately, in this case an easy way exists to clear all doubts, since iptables allows tracing and the complete set of tables and chains that are traversed can be easily seen and logged.

 1 TRACE: raw:PREROUTING:policy:2 IN=br0 OUT= PHYSIN=eth0 MAC=00:16:3e:52:ba:6c:00:16:3e:93:08:ca:08:00 SRC=10.32.0.50 DST=10.32.0.111 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=819 SEQ=1
 2 TRACE: filter:FORWARD:policy:1 IN=br0 OUT=br0 PHYSIN=eth0 PHYSOUT=gretap MAC=00:16:3e:52:ba:6c:00:16:3e:93:08:ca:08:00 SRC=10.32.0.50 DST=10.32.0.111 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=819 SEQ=1
 3 TRACE: raw:OUTPUT:rule:1 IN= OUT=eth1 SRC=192.168.200.1 DST=172.16.0.1 LEN=122 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=47
 4 TRACE: raw:OUTPUT:policy:2 IN= OUT=eth1 SRC=192.168.200.1 DST=172.16.0.1 LEN=122 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=47
 5 TRACE: filter:OUTPUT:policy:2 IN= OUT=eth1 SRC=192.168.200.1 DST=172.16.0.1 LEN=122 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=47

 6 TRACE: raw:PREROUTING:policy:2 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC=172.16.0.1 DST=192.168.200.1 LEN=122 TOS=0x00 PREC=0x00 TTL=63 ID=40301 PROTO=47
 7 TRACE: filter:INPUT:policy:1 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC=172.16.0.1 DST=192.168.200.1 LEN=122 TOS=0x00 PREC=0x00 TTL=63 ID=40301 PROTO=47
 8 TRACE: raw:PREROUTING:rule:1 IN=br0 OUT= PHYSIN=gretap MAC=00:16:3e:93:08:ca:00:16:3e:52:ba:6c:08:00 SRC=10.32.0.111 DST=10.32.0.50 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18065 PROTO=ICMP TYPE=0 CODE=0 ID=819 SEQ=1
 9 TRACE: raw:PREROUTING:policy:2 IN=br0 OUT= PHYSIN=gretap MAC=00:16:3e:93:08:ca:00:16:3e:52:ba:6c:08:00 SRC=10.32.0.111 DST=10.32.0.50 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18065 PROTO=ICMP TYPE=0 CODE=0 ID=819 SEQ=1
10 TRACE: filter:FORWARD:policy:1 IN=br0 OUT=br0 PHYSIN=gretap PHYSOUT=eth0 MAC=00:16:3e:93:08:ca:00:16:3e:52:ba:6c:08:00 SRC=10.32.0.111 DST=10.32.0.50 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18065 PROTO=ICMP TYPE=0 CODE=0 ID=819 SEQ=1

Lines 1 to 5 show the outgoing packet, and lines 6 to 10 the return packet containing the ICMP echo reply.

Since a bridge is involved, the actual result depends on whether the /proc entry /proc/sys/net/bridge/bridge-nf-call-iptables is set to 0 or 1. If it's set to 1, as most distributions do, the trace will be similar to the one shown above; if it's set to 0, the steps where the packet passes through the bridge will not be shown (lines 1, 2, 8, 9 and 10 above will be missing).

So where do we tap into the flow to get our packets? Obviously we want to see the GRE packets, not the raw ethernet traffic (so whether iptables is called for bridge traffic or not is not important here), and we're only interested in traffic from the local LAN to the tunnel (lines 3 to 5 in the above trace). So a good place could be the OUTPUT chain, either in the raw or the filter table. Let's choose the filter table which is more common:

routerA# iptables -A OUTPUT -s 192.168.200.1 -d 172.16.0.1 -p gre -j NFQUEUE --queue-bypass
routerB# iptables -A OUTPUT -s 172.16.0.1 -d 192.168.200.1 -p gre -j NFQUEUE --queue-bypass

This sends all matching traffic to the NFQUEUE queue number 0 (the default), and moves on the next rule or policy if there's no user application listening on that queue (the --queue-bypass part), so at least small packets can pass by default.

Clearing DF, finally

Now all that's left is writing the user code that receives packets from queue 0, clears the DF bit, and sends them back to iptables. Fortunately, the library provides some sample C code that can be adapted for our purposes. So without further ado, let's fire up an editor and write this code:

/***************************************************************************************
 * clear_df.c: clear, uh, DF bit from IPv4 packets. Heavily borrowed from              * 
 * http://netfilter.org/projects/libnetfilter_queue/doxygen/nfqnl__test_8c_source.html *
 ***************************************************************************************/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <netinet/in.h>
#include <linux/types.h>
#include <linux/netfilter.h>            /* for NF_ACCEPT */
#include <arpa/inet.h>
#include <time.h>
#include <libnetfilter_queue/libnetfilter_queue.h>

/* Standard IPv4 header checksum calculation, as per RFC 791 */
u_int16_t ipv4_header_checksum(char *hdr, size_t hdrlen) {

  unsigned long sum = 0;
  const u_int16_t *bbp;
  int count = 0;

  bbp = (u_int16_t *)hdr;
  while (hdrlen > 1) {
    /* the checksum field itself should be considered to be 0 (ie, excluded) when calculating the checksum */
    if (count != 10) {
      sum += *bbp;
    } 
    bbp++; hdrlen -= 2; count += 2;
  }

  /* in case hdrlen was an odd number, there will be one byte left to sum */
  if (hdrlen > 0) {
    sum += *(unsigned char *)bbp;
  }

  while (sum >> 16) {
    sum = (sum & 0xffff) + (sum >> 16);
  }

  return (~sum);
}

/* callback function; this is called for every matched packet. */       
static int cb(struct nfq_q_handle *qh, struct nfgenmsg *nfmsg, struct nfq_data *nfa, void *data) {

  u_int32_t queue_id;
  struct nfqnl_msg_packet_hdr *ph;
  int pkt_len;

  char *buf;
  size_t hdr_len;

  /* determine the id of the packet in the queue */
  ph = nfq_get_msg_packet_hdr(nfa);
  if (ph) {
    queue_id = ntohl(ph->packet_id);
  } else {
    return -1;
  }

  /* try to get at the actual packet */
  pkt_len = nfq_get_payload(nfa, &buf);

  if (pkt_len >= 0) {

    hdr_len = ((buf[0] & 0x0f) * 4);

    /* clear DF bit */
    buf[6] &= 0xbf;

    /* set new packet ID */
    *((u_int16_t *)(buf + 4)) = htons((rand() % 65535) + 1);

    /* recalculate checksum */
    *((u_int16_t *)(buf + 10)) = ipv4_header_checksum(buf, hdr_len);
  }

  /* "accept" the mangled packet */
  return nfq_set_verdict(qh, queue_id, NF_ACCEPT, pkt_len, buf);
}

int main(int argc, char **argv) {

    struct nfq_handle *h;
    struct nfq_q_handle *qh;
    int fd;
    int rv;
    char buf[4096] __attribute__ ((aligned));

    /* printf("opening library handle\n"); */
    h = nfq_open();
    if (!h) {
        fprintf(stderr, "error during nfq_open()\n");
        exit(1);
    }

    /* printf("unbinding existing nf_queue handler for AF_INET (if any)\n"); */
    if (nfq_unbind_pf(h, AF_INET) < 0) {
        fprintf(stderr, "error during nfq_unbind_pf()\n");
        exit(1);
    }

    /* printf("binding nfnetlink_queue as nf_queue handler for AF_INET\n"); */
    if (nfq_bind_pf(h, AF_INET) < 0) {
        fprintf(stderr, "error during nfq_bind_pf()\n");
        exit(1);
    }

    /* printf("binding this socket to queue '0'\n"); */
    qh = nfq_create_queue(h,  0, &cb, NULL);
    if (!qh) {
        fprintf(stderr, "error during nfq_create_queue()\n");
        exit(1);
    }

    /* printf("setting copy_packet mode\n"); */
    if (nfq_set_mode(qh, NFQNL_COPY_PACKET, 0xffff) < 0) {
        fprintf(stderr, "can't set packet_copy mode\n");
        exit(1);
    }

    fd = nfq_fd(h);

    /* initialize random number generator */
    srand(time(NULL));

    while ((rv = recv(fd, buf, sizeof(buf), 0)) && rv >= 0) {
        nfq_handle_packet(h, buf, rv);
    }

    /* printf("unbinding from queue 0\n"); */
    nfq_destroy_queue(qh);

    /* printf("closing library handle\n"); */
    nfq_close(h);

    exit(0);
}

It's useful to refer to this illustration of the IPv4 header to better follow the explanation.
The main thing worth noting of the above code is that, if we clear DF, we also have to fill in the "identification" (id) field (bytes 4 and 5 of the header) of the IPv4 header. This field is generally set to 0 when DF is set, since in that case of course the packet will never be fragmented. However, we're actually allowing fragmentation for a packet for which it was possibly not intended, so we fill the id field with a random 16-bit integer between 1 and 65535; this value is used by whoever has to reassemble the packet, to tell which fragments are part of the same original packet. If we left the field to 0, then all fragments (if any) would have the same ID and the receiver would have a hard time reassembling the original packet, especially if the fragments arrive out of order.
And of course, since we're changing the header, the checksum (bytes 10 and 11 of the header) has to be recalculated.

An obvious optimization of the above code (not implemented here as it's just a proof of concept) would be to immeditaly accept the packet without mangling it if the DF bit is already set to 0. Another possibiility could be to not touch the packet if its length is less than the outgoing interface MTU (the output interface can be obtained using nfq_get_outdev, for example); however in this case we'd be trusting all the hops along the path to have an MTU greater or equal to ours, which may not be true; so when in doubt, we just always clear DF.

Compilation requires the appropriate header files to be present (libnfnetlink-dev and libnetfilter-queue-dev under debian). To compile the code, do:

gcc -o clear_df -lnfnetlink -lnetfilter_queue clear_df.c

So now, let's run our program and retry the damned ping (forcing DF, just to be sure):

routerA# clear_df
routerB# clear_df
hostA# ping -s 1472 -Mdo 10.32.0.111
PING 10.32.0.111 (10.32.0.111) 1472(1500) bytes of data.
1480 bytes from 10.32.0.111: icmp_req=1 ttl=64 time=1.48 ms
1480 bytes from 10.32.0.111: icmp_req=2 ttl=64 time=0.969 ms
1480 bytes from 10.32.0.111: icmp_req=3 ttl=64 time=1.11 ms
1480 bytes from 10.32.0.111: icmp_req=4 ttl=64 time=0.946 ms
1480 bytes from 10.32.0.111: icmp_req=5 ttl=64 time=0.944 ms
^C
--- 10.32.0.111 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 0.944/1.091/1.481/0.208 ms

So it finally works, and without changing anything on host A. On the routers, capturing the traffic on the link between routerA and routerB does indeed show fragmented packets. So we can finally say that "it works".

Final piece: IPsec

Although this was the original goal, it somehow got lost along the road while we were troubleshooting things, so back on track after the detour. Now that we got this far, adding IPsec should be a piece of cake. To make it a bit more interesting (not much), routerA is going to use ipsec-tools + racoon, while routerB will use Openswan. These are probably the most common IPsec implementations under Linux.

Since the GRE packets already contain router A's and router B's public IPs in the IP header, effectively making this a host-to-host tunnel, we can use IPsec's transport mode to just encrypt and authenticate the payload (this is a case where transport mode is actually useful). Note however that IPsec transport mode inserts a new header (the ESP header) after the first IP header, so the "protocol" field of the latter will change from 47 (GRE) to 50 (ESP). Grafically, after ESP is applied (which however always happens before fragmentation) we now have this monster:

ipsecencap

Should we modify the iptables rule that sends the packets to userspace? Let's see. The easiest way to check is, again, tracing a packet and seeing which chains it traverses, so if we do it we see (omitting the bridging parts which are not relevant here):

 1 TRACE: raw:OUTPUT:policy:2 IN= OUT=eth1 SRC=192.168.200.1 DST=172.16.0.1 LEN=122 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=47 
 2 TRACE: filter:OUTPUT:rule:2 IN= OUT=eth1 SRC=192.168.200.1 DST=172.16.0.1 LEN=122 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=47 
 3 TRACE: raw:OUTPUT:rule:1 IN= OUT=eth1 SRC=192.168.200.1 DST=172.16.0.1 LEN=152 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ESP SPI=0xdb28966a 
 4 TRACE: raw:OUTPUT:policy:2 IN= OUT=eth1 SRC=192.168.200.1 DST=172.16.0.1 LEN=152 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ESP SPI=0xdb28966a 
 5 TRACE: filter:OUTPUT:policy:3 IN= OUT=eth1 SRC=192.168.200.1 DST=172.16.0.1 LEN=152 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ESP SPI=0xdb28966a 

 6 TRACE: raw:PREROUTING:policy:2 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC=172.16.0.1 DST=192.168.200.1 LEN=152 TOS=0x00 PREC=0x00 TTL=63 ID=21840 PROTO=ESP SPI=0x4802e17
 7 TRACE: filter:INPUT:policy:1 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC=172.16.0.1 DST=192.168.200.1 LEN=152 TOS=0x00 PREC=0x00 TTL=63 ID=21840 PROTO=ESP SPI=0x4802e17 
 8 TRACE: raw:PREROUTING:rule:1 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC=172.16.0.1 DST=192.168.200.1 LEN=122 TOS=0x00 PREC=0x00 TTL=63 ID=21840 PROTO=47 
 9 TRACE: raw:PREROUTING:policy:2 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC=172.16.0.1 DST=192.168.200.1 LEN=122 TOS=0x00 PREC=0x00 TTL=63 ID=21840 PROTO=47 
10 TRACE: filter:INPUT:policy:1 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC=172.16.0.1 DST=192.168.200.1 LEN=122 TOS=0x00 PREC=0x00 TTL=63 ID=21840 PROTO=47 

So interestingly enough, we see that (despite Linux not having a real IPsec virtual interface) IPsec packets traverse the chains twice, once before encryption (with PROTO=47, lines 1-2) and again after encryption (with PROTO=ESP, that is, protocol 50, lines 3-5).
This means that we can either keep the existing iptables rules, in which case our code will receive the unencrypted packets, or change it to match protocol ESP (-p 50 or -p esp), in which case we'll see ESP packets. Note that we can do the latter because ESP does not protect the outer header; if we used AH (protocol 51), we wouldn't be able to change the ID field, which is considered immutable and thus authenticated - ie, signed - without rendering the packet invalid. So if we were using AH in transport mode, we would definitely want to match unencrypted packets (ie, -p 47 or -p gre). Though according to most people, AH is next to useless anyway.

However, since NFQUEUE has to copy packets between kernel space and user space and back, and since unencrypted packets are smaller, it's more efficient to match on protocol 47 (thus we're leaving the existing iptables rules unchanged).

For completeness, here are the sample IPsec confgurations used on routerA and routerB.

ipsec-tools.conf on routerA:

#!/usr/sbin/setkey -f

## Flush the SAD and SPD
#
flush;
spdflush;

spdadd 192.168.200.1/32 172.16.0.1/32 gre -P out ipsec
   esp/transport//require;

spdadd 172.16.0.1/32 192.168.200.1/32 gre -P in ipsec
   esp/transport//require;

racoon.conf on routerA:

log notify;
path pre_shared_key "/etc/racoon/psk.txt";
path certificate "/etc/racoon/certs";

remote 172.16.0.1 {
        exchange_mode main;
        proposal {
                encryption_algorithm 3des;
                hash_algorithm md5;
                authentication_method pre_shared_key;
                dh_group modp1024;
        }
}

sainfo address 192.168.200.1/32 gre address 172.16.0.1/32 gre {
        pfs_group modp1024;
        encryption_algorithm 3des;
        authentication_algorithm hmac_md5;
        compression_algorithm deflate;
}

ipsec.conf on routerB:

version 2.0     # conforms to second version of ipsec.conf specification

# basic configuration
config setup
  nhelpers=0
  interfaces="%none"
  protostack=netkey
  klipsdebug=""
  plutodebug=""

conn to-routerA
  type=transport
  left=172.16.0.1
  leftsubnet=172.16.0.1/32
  right=192.168.200.1
  rightsubnet=192.168.200.1/32
  authby=secret
  phase2alg=3des-md5;modp1024
  keyexchange=ike
  ike=3des-md5;modp1024
  auto=start
  leftprotoport=gre
  rightprotoport=gre

For authentication we'll use pre-shared keys, but the changes to use certificate-based authentication are trivial (and not directly related to the main point of the article anyway).

The main thing to note is that we explicitly specify in our policies that we want to encrypt GRE traffic only, since that's what carries the tunneled ethernet frames. Everything in the above configuration can be changed; the policy can be changed to encrypt all traffic (using "any"), or the hash and encryption algorithms can be changed. There's nothing magic or special; it's just plain IPsec configuration.

Some notes on veth interfaces

Since there's not much documentation, here are the results of some experimentation.

veth interfaces are virtual ethernet interfaces that always exist in pairs. Whatever enters on one interface, exits from the other one, and viceversa. A simple test to check:

# ip link add type veth
# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000
    link/ether 52:54:00:5a:d2:86 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000
    link/ether 52:54:00:d7:26:a6 brd ff:ff:ff:ff:ff:ff
23: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000
    link/ether ee:c0:0e:d6:ae:09 brd ff:ff:ff:ff:ff:ff
24: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000
    link/ether 4e:e8:84:bd:01:f0 brd ff:ff:ff:ff:ff:ff
# ip addr add 10.0.0.1/24 dev veth0
# ip link set veth0 up
# ip link set veth1 up
# force sending packets out veth0
# ping 10.0.0.2

While the ping is running, tcpdump on veth1 will show traffic (most likely ARP).

One thing to note is that for an interface to be up, the other one must be up too.

The main (or perhaps only?) use for veth interfaces seems to be in the context of container virtualization, especially LXC. Once a veth pair is created, one end is assigned to the container and one end is assigned to the main host. Communication can then happen either using the pair as a point-to-point direct link (assigning IPs to both ends and doing routing on the host) or via bridging (the host-side interface is added to a bridge, perhaps where other interfaces are already connected, and the guest-side interface is assigned an IP inside the container).

Since containers are possible because of namespaces (two very good introductory articles can be found here and here), this is also the method used to "assign" the guest-side interface to the container. Note that doing so makes the interface deisappear from the host.

Let's try a simple example. First we start a container without network (lxc.network.type = empty):

container# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Then, on the host, we create a pair of veth:

host# ip link add vHOST type veth peer name vGUEST
host# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000
    link/ether 52:54:00:5a:d2:86 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000
    link/ether 52:54:00:d7:26:a6 brd ff:ff:ff:ff:ff:ff
25: vGUEST: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000
    link/ether be:86:db:5b:ec:a5 brd ff:ff:ff:ff:ff:ff
26: vHOST: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000
    link/ether d6:c9:65:e3:bb:e9 brd ff:ff:ff:ff:ff:ff

Bring the links up:

host# ip link set vHOST up
host# ip link set vGUEST up

Now we have to find out the PID of the container's main process; there are a few ways to do this, pstree is probably visually easier:

host# pstree -Apc
...
        |-lxc-start(3615)---init(3619)-+-getty(4236)
        |                              |-getty(4238)
        |                              |-getty(4239)
        |                              |-getty(4240)
        |                              |-login(4237)---bash(4447)
        |                              `-sshd(4223)
...

so we want PID 3619, and thus we do

host# ip link set vGUEST netns 3619

Doing this makes the interface disappear from the host:

host# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000
    link/ether 52:54:00:5a:d2:86 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000
    link/ether 52:54:00:d7:26:a6 brd ff:ff:ff:ff:ff:ff
26: vHOST: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000
    link/ether d6:c9:65:e3:bb:e9 brd ff:ff:ff:ff:ff:ff

and appear in the guest container:

container# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
25: vGUEST: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether be:86:db:5b:ec:a5 brd ff:ff:ff:ff:ff:ff

Now we can configure IPs and we're set:

host# ip addr add 10.0.0.1/24 dev vHOST
container# ip addr add 10.0.0.2/24 dev vGUEST
container# ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_req=1 ttl=64 time=0.103 ms
...

As mentioned, another possibility would be adding vHOST to a bridge.

Of course, all this doesn't have to be done manually, since common LXC tool can automate things, but it's just for explanation purposes. In fact, when lxc.network.type = veth is used, what happens is that a veth pair is created, one end assigned to the container's namespace, and the other end (on the host) added to an existing bridge (specified with lxc.network.link)

As said, veth interfaces always exist in pair, so when the guest is destroyed (eg shut down), both interfaces disappear. In the same way, if we delete one end on the host, the other end disappears from the guest.