Skip to content

PXE server with dnsmasq, apache and iPXE

Here we're going to set up a PXE server that would boot even cards with bad or buggy PXE firmware, without having to flash them.

First, some words about PXE.


PXE, acronym of Preboot eXecution Environment, is a specification originally developed by Intel that allows a computer to boot over the network. This has obvious applications in the case of diskless boxes (eg thin clients), but it can also be useful for normal machines, for example to temporarily boot a rescue disk, or (re)install the OS over the network without needing any physical medium.

Simplifying a bit, it goes like this:

  • A machine is turned on. In the BIOS, the boot order says to try PXE first (or a key can be pressed during the POST to the same effect, normally).
  • Its network card (NIC) has a chip with a special firmware, which implements a minimal stack of TCP/IP protocols (DHCP, TFTP, possibly DNS).
  • This firmware is loaded and performs a DHCP broadcast to get an IP address and other pieces of information.
  • If a suitable DHCP server sees the request, it selects an IP address and assigns it to the client.
  • Up to here, it's not different from normal DHCP. However, the server also sends two special pieces of information to the client in the DHCP offer: one is the name or IP address of a server (the so-called "next server", which may be the same DHCP server or not), the other one is the name of a file to download from there (so-called "boot filename" in DHCP speak, or "network boot program" (NBP) in PXE speak).
  • The PXE client configures its TCP/IP stack with the received information, then tries to download the boot filename from the next server via TFTP.
  • If it succeeds, it loads the NBP in memory and runs it. From now on, the NBP takes over and does whatever it takes to fully boot the machine.

Sounds simple, but as usual life isn't as simple as it seems. There are a few things to be noted.

First, while originally the NBP was downloaded via TFTP (and many times still is), some enhanced PXE implementations (like gPXE or iPXE) can use HTTP. They also support extra protocols like iSCSI o A0E (to boot from SANs).

Second, PXE isn't just a sequence of steps to bootstrap a machine; it also specifies an API. This means that the NBP runs in a special environment and can make use of many functionalities made available by the PXE that loaded it. In particular, if the calling PXE supports HTTP networking, this means that the NBP can too, via the PXE API, even if it wouldn't otherwise support it natively.

Let's take the case of pxelinux, probably the most used NBP for its flexibility. Only recent versions support HTTP natively; however, older versions (starting from 3.70, which is quite old) can use the PXE API and do HTTP if they are invoked from an HTTP-capable PXE implementation like gPXE or iPXE mentioned above. Since ideally we want our PXE server to serve stuff over HTTP as much as possible rather than TFTP, all this is quite good.

However, these enhanced PXE implementations are normally not found in consumer-end NICs, which instead tend to come with limited or buggy PXE implementations. There are a few workarounds for this:

  • Load the enhanced PXE firmware from a floppy, CDROM, or USB stick. So in the BIOS, the machine is configured to boot from the appropriate removable media, which loads the PXE firmware, which in turn boots from the network. In general, this is not very practical (and the media can be lost or damaged, or the reader can break. Many machine don't even have a floppy or CD reader anymore).
  • The NIC ROM can be flashed with the enhanced firmware. This is better, but it still requires some special action. For hundreds of machines, again this is not very practical.
  • The enhanced PXE firmware can be downloaded (chainloaded) by the buggy PXE as if it were an NBP (via TFTP), then take over and do the "real" PXE boot, downloading the "real" NBP which will then be able to use the API in the enhanced environment (with HTTP and all).

The last option is the easiest and most convenient to implement, since it does not require to mess around with sneakernet or ROM flashing, and is what is described here.

The plan

So we are going to use dnsmasq as our DHCP and TFTP server, apache to serve HTTP (for no particular reason, just because it's easy to set up with PHP), and iPXE for the enhanced PXE firmware. All running on the same machine for convenience, but there's no reason why the web server could not run on another box.

Since the DHCP server will possibly see (at least) two different DHCP queries (first one from the buggy PXE firmware, then one from iPXE), and has to send different NBP strings to them, a way is needed to tell which query we are seeing.

This is quite straightforward: if we capture the traffic with tcpdump, we see that the requests coming from iPXE have at least two identifying characteristics that are not present in requests not coming from iPXE. The first is DHCP option number 175, which is used for iPXE/gPXE-specific information. The second is the iPXE user class, which again is not normally present.

15:14:41.719114 IP (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 415) > BOOTP/DHCP, Request from 00:12:34:56:78:90, length 387, xid 0x71ceb4, secs 4, Flags [none]
	  Client-Ethernet-Address 00:12:34:56:78:90
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Discover
	    MSZ Option 57, length 2: 1472
	    ARCH Option 93, length 2: 0
	    NDI Option 94, length 3: 1.2.1
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
	    CLASS Option 77, length 4: "iPXE"
	    Parameter-Request Option 55, length 13: 
	      Subnet-Mask, Default-Gateway, Domain-Name-Server, LOG
	      Hostname, Domain-Name, RP, Vendor-Option
	      Vendor-Class, TFTP, BF, Option 175
	      Option 203
	    T175 Option 175, length 45:
	    Client-ID Option 61, length 7: ether 00:12:34:56:78:90
	    GUID Option 97, length 17:

It's also easy to see the same information in the DHCP server log.

In dnsmasq, we set a tag if we detect that the request comes from iPXE, and do different things depending on whether or not the tag is set. If the request is from a non-enhanced PXE client, we send them the iPXE firmware; otherwise, it's iPXE so we direct it to an HTTP URL to continue the boot process (see below).

To have maximum flexibility, we want to be able to tell which client we're talking to, and possibly give different orders to different clients. ("Orders" here means "iPXE scripts", which are textual sequences of iPXE directives that tell the clients to do certain things.)

To this end, we direct iPXE to do an HTTP GET request containing various parameteres that identify the client. On the server this runs a PHP script that decides what to do based on the received values. We thus send back an iPXE script containing further instructions to the client (eg "chainload pxelinux.0", "boot from iscsi", etc. See below for the examples).

This allows us to do things like (for example) "Client X: go get pxelinux from the local HTTP server to boot a rescue environment. Client Y: boot from iSCSI, here is the LUN URL. Client Z: boot pxelinux from another HTTP server to do an unattended Debian install..."


Now that we have defined the plan, let's finally get to the practical bits. It is assumed that the PXE server ( has IP address, the network's default gateway is, and the DNS server is It is also assumed that no other DHCP servers are present in the network.

dnsmasq configuration

The configuration of dnsmasq is short (of course adapt as needed):


# enable logging

# set tag "ENH" if request comes from iPXE ("iPXE" user class)

# alternative way, look for option 175

# if request comes from dumb firmware, send them iPXE (via TFTP)

# if request comes from iPXE, direct it to boot from boot1.txt



So we set the tag ENH (set:ENH) if the request comes from iPXE. The tag:!ENH syntax means "if the ENH tag is NOT set". Note that this syntax requires a reasonably recent version of dnsmasq; in older versions, "net:" had to be used instead of "tag:", and "#ENH" instead of "!ENH" (ie, "net:#ENH") to say "tag ENH not set".

The file undionly.kpxe (or a symlink to it) has to be in /var/www, and is the iPXE implementation used for chainloading, which is sent to the dumb clients via TFTP. This is the only TFTP transaction in the whole process. Once the client has loaded iPXE, everything happens over HTTP.

As a special case (in a positive sense), when PXE-booting a KVM virtual machine the very first request that the server sees already comes from iPXE, since that's what qemu uses to implement the VM's PXE "firmware". This means that in that case the process will be faster, since the chainloading phase will be skipped and the client sent directly to the HTTP URL.

Regardless of whether the client is originally dumb or not, it will eventually end up fetching boot1.txt (see below) via HTTP.

The last configuration lines enable dnsmasq's internal TFTP server, telling it to serve files (not coincidentally) from /var/www. And so...

Apache configuration

Any web server with PHP support would work, in fact; it's just that with apache, a running PHP is just two commands away with zero configuration.
And of course, it doesn't even have to be PHP: any server-side scripting language will do.

So our client (which is running iPXE, and can do HTTP) fetches boot1.txt, which lives in /var/www. Here's how it looks like



This is an iPXE script that chainloads another URL. Basically, it's just a cheap trick to send as much information as possible about the client to the server via a gigantic HTTP GET, so the client can be identified for further processing (though 99% of the times only the MAC address will be looked at, it's good to have as many variables as possible). iPXE replaces the various ${mac}, ${ip} etc. variables with the actual values for the client and also does URL-encoding. The full list of available parameters is here in the docs.

The above URL could also be supplied directly from dnsmasq, by replacing the URL in the dhcp-boot=tag:ENH, line with the one in boot1.txt. However it looks like that way the URL gets truncated if it's too long, so it's better to be safe and put it in its own file.

Now, finally, let's look at how boot2.php (which must also be in /var/www) looks like. Here is where we actually decide what to do with each client.

# send a suitable iPXE script to a client

echo "#!ipxe\n";
switch ($_GET['mac']) {
  case '00:12:34:56:78:90':
    # boot pxelinux from this server
    echo "chain\n";
  case '00:11:22:33:44:55':
    # boot from iSCSI
    echo "set initiator-iqn\n";
    # see for the syntax
    echo "sanboot\n";
  case '00:77:21:ab:cd:ee':
    # boot's super cool boot menu      
    echo "chain\n";
    # exit iPXE and let machine go on with BIOS boot sequence
    echo "exit\n";

In short, each client will receive an iPXE scrit telling it what to do. Here clients are detected by their MACs, but any variable among those that we pass can be used, of course.
If a client has no specific treatment set up for it, it will end up in the "default" branch of the switch statement, which will just direct it to exit iPXE and try the next device in the BIOS boot sequence, which would normally mean it will boot from its local hard disk (again this can be changed, of course). Another option is to chainload another bootloader that is able to boot a local disk, for example GRUB4DOS as explained in this page.

It's even possible to fetch and boot stuff off the Internet, as in the iPXE demo image, which can be loaded by directing the client to chain It really works. But the coolest service is, as shown for the third client in the above example,, which allows booting and installing a lot of operating systems off the Internet. It's really impressive. Well done!


If we direct a client to load pxelinux, then there is another degree of flexibility there, since pxelinux will try to load several configuration files, named from the most specific to the most generic, until it succeeds. Normally the sequence of attempts looks something like this:

GET /pxelinux.cfg/44454c4c-3900-104e-804e-b9c04f4d344a
GET /pxelinux.cfg/01-00-26-b9-5e-30-3a
GET /pxelinux.cfg/C0A80744
GET /pxelinux.cfg/C0A8074
GET /pxelinux.cfg/C0A807
GET /pxelinux.cfg/C0A80
GET /pxelinux.cfg/C0A8
GET /pxelinux.cfg/C0A
GET /pxelinux.cfg/C0
GET /pxelinux.cfg/C
GET /pxelinux.cfg/default

So again what a given client does can be decided by assigning it a pxelinux configuration file with a name more speficic than "default", which is what gets loaded if nothing better is found.

And of course, pxelinux.0 plus any other file needed by the configuration files (eg menu.c32 etc.) need to be present in the document root of the web server (or symlinks to them).

Since pxelinux is running with HTTP support thanks to iPXE, HTTP URLs can be used anywhere a file name would, eg

# ok, this doesn't make much sense

and even if you don't explicitly specify, it implicitly assumes that it has to use HTTP anyway (in that case, it automatically prepends the URL it's booting from to the names).


With this system it really becomes possible to do whatever one may imagine via PXE, and everything is controlled and managed from a single place.

Further reading (on the interactions between pxelinux and gPXE, but also relevant for iPXE):

Clarifying the relationship between PXELinux, Etherboot and gPXE/iPXE

Argument juggling with awk

This seems to be a sort of FAQ. A typical formulation goes like "I have a bash array, how do I pass it to awk so that it becomes an awk array"?

Leaving aside the fact that it may be possible to extend the awk code to do whatever one is doing with the shell array (in which cases the problem goes away), let's focus on how to do strictly what is requested (and more).


Like many other languages, awk has two special variables ARGC and ARGV that give information on the arguments passed to the awk program. ARGC contains the number of total arguments (including the awk interpreter or script), and ARGV is an array of ARGC elements (indexed from 0 to ARGC - 1) that contains all the arguments (ARGV[0] is always the name of the awk interpreter or script).
Let's demonstrate this with a simple example:

awk 'BEGIN{print "ARGC is " ARGC; for(i = 0; i < ARGC; i++) print "ARGV["i"] is " ARGV[i]}' foo bar
ARGC is 3
ARGV[0] is awk
ARGV[1] is foo
ARGV[2] is bar

There are two important things to know:

  • Unlike other languages, in awk ARGC and ARGV can be modified
  • When awk's main loop starts (and only then), awk processes whatever it finds in ARGV, starting from ARGV[1] up to ARGV[ARGC - 1].

Of course, these should normally be file names or variable assignments. But this is only relevant when the main loop starts; before then, in the BEGIN block we can manipulate ARGC and ARGV to our taste, and as long as what's left afterwards in ARGV is a list of files to process (or variable assignments), awk doesn't really care how those values got there.

So let's see some use cases for ARGC/ARGV manipulation.

Double pass over a file

Some code uses the two-file idiom to process the same file twice. So instead of doing

awk .... file.txt file.txt

we could just specify the file name once and double it in the BEGIN block so awk sees it twice:

# this is as if we said awk ..... file.txt file.txt
awk 'BEGIN{ARGV[ARGC++] = ARGV[1]} { ... }' file.txt

Fixed arguments

Let's assume that our awk code always has to process one or more files, whose names do not change. Of course, we could specify those names at each invocation of awk; nothing new here. However, for some reason we don't want to specify those names at each invocation, since they never change anyway; we only want to specify the variable file names. So if we have two never-changing files ("fixed1.txt" and "fixed2.txt"), we want to invoke our code with

process.awk file1 file2 file3 ...

but in fact we want awk to run as if we said

process.awk fixed1.txt fixed2.txt file1 file2 file3 ...

Let's see how the code to accomplish this may look like (of course it has to be adapted to the specific situation):

awk 'BEGIN {
  for(i = ARGC+1; i > 2; i--)
    ARGV[i] = ARGV[i - 2]
  ARGC += 2
  ARGV[1] = "fixed1.txt"
  ARGV[2] = "fixed2.txt"
# now awk processes fixed1.txt and fixed2.txt first, then whatever was specified on the command line
}' file1 file2 file3 ...

Passing a shell array (and more or less arbitrary data)

So, to back to the original question, how can we take advantage of this juggling to pass in an array? A simple way would be to pass all the array elements as normal awk arguments, process them in the BEGIN block, then remove them so when the main loop starts awk is unaware of what happened. Let's see an example:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
awk 'BEGIN{
  # ARGV[1] is the number of elements we have
  arrlen = ARGV[1]
  for(i = 2; i <= arrlen + 1; i++)
    awkarr[i - 1] = ARGV[i]
  # clean up
  j = 1
  for(i = arrlen + 2; i < ARGC; i++)
    ARGV[j++] = ARGV[i]
  ARGC = j
# here awk starts processing from file1, unaware of what we did earlier
# but we have awkarr[] populated with the values from shellarr (and arrlen is its length)
' ${#shellarr[@]} "${shellarr[@]}" file1 file2

awkarr has its elements indexed starting from 1, as is customary in awk; it's easy to adapt the code to use 0-based or another number.
We could also pass the number of elements in the array as a normal value using -v, which simplifies processing somewhat:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
awk -v arrlen="${#shellarr[@]}" 'BEGIN{
  for(i = 1; i <= arrlen; i++)
    awkarr[i] = ARGV[i]
  # clean up
  for(i = arrlen + 1; i < ARGC; i++)
    ARGV[i - arrlen] = ARGV[i]
  ARGC -= arrlen
# ... as before
' "${shellarr[@]}" file1 file2

If the number of files to process is known (which should be the most common case), then it's even easier as we can specify them first and the array elements afterwards. Let's assume we know that we always process two files:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
awk -v nfiles=2 'BEGIN{
  for(i = nfiles + 1; i < ARGC; i++)
    awkarr[i - nfiles] = ARGV[i]
  arrlen = ARGC - (nfiles + 1)
  ARGC = nfiles + 1
# ... as before
' file1 file2 "${shellarr[@]}"

Finally, if we want to "pass" a shell associative array to awk (such that it exists with the same keys and values in the awk code), we could do this:

declare -A shellarr
shellarr=( [fook]='foov' [bark]='barv' [bazk]='bazv' [xxxk]='xxxv' [yyyk]='yyyv' )
awk -v nfiles=2 'BEGIN{
  arrlen = ( ARGC - (nfiles + 1) ) / 2
  for(i = nfiles + 1; i < nfiles + 1 + arrlen; i++)
    awkarr[ARGV[i]] = ARGV[i + arrlen]
  ARGC = nfiles + 1
# ... as before
' file1 file2 "${!shellarr[@]}" "${shellarr[@]}"

This works because in bash, the order of expansion of "${!shellarr[@]}" and "${shellarr[@]}" is the same (currently, at least). To be 100% sure, however, we could of course copy all the key, value pairs to another array and pass that one, as in the following example:

declare -A shellarr
shellarr=( [fook]='foov' [bark]='barv' [bazk]='bazv' [xxxk]='xxxv' [yyyk]='yyyv' )
declare -a temp
for key in "${!shellarr[@]}"; do
  temp+=( "$key" "${shellarr[$key]}" )
awk -v nfiles=2 'BEGIN{
  arrlen = ( ARGC - (nfiles + 1) ) / 2
  for(i = nfiles + 1; i < ARGC; i += 2)
    awkarr[ARGV[i]] = ARGV[i + 1]
  ARGC = nfiles + 1
# ... as before
' file1 file2 "${temp[@]}"

In the last two examples, it should be noted that, as usual with associative arrays, the concept of array "length" doesn't make much sense; it's just an indication of how many elements the hash has, and nothing more (in awk, all arrays are associative regardless, though they can be used as "normal" ones as we did in the first examples).

Update 31/10/2013: So there's always something new to learn, and in my case it was that if an element of ARGV is the empty string, awk just skips it. This simplifies the examples where the ARGV elements are moved down to fill the positions where the shell array elements were. In fact, all that's needed is to set those elements to "", and awk will naturally skip them. So the first two examples above become:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
awk 'BEGIN{
  # ARGV[1] is the number of elements we have
  arrlen = ARGV[1]
  ARGV[1] = ""
  for(i = 2; i <= arrlen + 1; i++) {
    awkarr[i - 1] = ARGV[i]
    ARGV[i] = ""
...' ${#shellarr[@]} "${shellarr[@]}" file1 file2

Second example:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
awk -v arrlen="${#shellarr[@]}" 'BEGIN{
  for(i = 1; i <= arrlen; i++) {
    awkarr[i] = ARGV[i]
    ARGV[i] = ""
...' "${shellarr[@]}" file1 file2

Quick file sharing over HTTP

Download here: Note that a recent version of Perl is required (definitely works with 5.18).

This is (hopefully) an evolution (perhaps suffering from creeping featurism) of the excellent wwwshare (thanks pgas), which itself is based on Vidar's one (which gets the credit for the original idea). This is a simple throwaway web server (tws) - or better said, something that pretends to be one to a client -, which can be useful when we need to quickly transfer some file or data to a friend or remote party. The program prints a list of URLs, and the remote end can then download the file by pointing a normal HTTP client (browser, curl, whatever) to one of these URLs. As the original author says, "when the file is downloaded, it exits. No setup or cleanup required".

The new features are:

  • Written in Perl
  • MIME support (to help the client know the file type)
  • Progress bar!
  • Streaming mode, using chunked transfer encoding (introduced by HTTP 1.1)


Run the program with -h to see a summary:

$ -h
Usage: [ -a ] [ -u ] [ -n ] [ -b bufsize ] [ -p port ] [ -m mimetype ] [ -U url ] [ -f filename ] name

-a          : consider all addresses for URLs (including loopback and link-local addresses)
-u          : flush output buffer as soon as it's written
-n          : do not resolve IPs to names
-b bufsize  : read/write up to bufsize bytes for cycle (default: 16384)
-p port     : listen on this port (default: random)
-m mimetype : force MIME type (default: autodetect if possible, otherwise application/octet-stream)
-U url      : include this URL among the listed alternative URLs
-f filename : use 'filename' to build the request part of the URL (default: dynamically computed)

'name' (mandatory argument) must exist in normal mode; in streaming mode it's only used to build the URL

$ -p 1025 /path/to/
Listen for connections on port 1025; send upon client connection. The specified path must exist.

$ -p 4444 -U '' -f '/path/to/funny'
Listen on port 4444, suggest as download URL (presumably a port forwarding exists)

$ tar -cjf - file1 file2 file3 | -m application/x-bzip2 result.tbz2
Listen on random port; upon connection, send the data coming from the pipe with the specified MIME type.
result.tbz2 need not exist; it's only used to build the URL

In the simplest case, one just does

$ /path/to/some/file.iso
Listening on port 8052, MIME type is application/x-iso9660-image

Possible URLs that should work to retrieve the file:

Hopefully at least one of the printed URLs is valid and can be communicated to the other party, which then connects to download the file:

Client connected: ( from port 51066
 100% [=======================================================>]   3,224,686,592 (29s) 104.8M/s

The listening port is random; it is possible to force a specific value if needed (see the help). The part after the / in the URL is determined based on the supplied filename, to give some hint to the client or browser that downloads the file. Here too it is possible to force a specific string.

If the program detects that its standard input is connected to a pipe, it automatically operates in streaming mode, which is a fancy name to mean that it reads from standard input rather than a given file. A filename should still be specified, though, so the download URL can be "correctly" built (to be more helpful to the client). Streaming mode means that one can do something like this, for instance:

$ tar -cjf - file1 file2 file3 | -m application/x-bzip2 result.tbz2
Listening on port 8787 (streaming mode), MIME type is application/x-bzip2

Possible URLs that should work to retrieve the file:

In streaming mode, the content length is of course not known, so the program sends the data using chunked transfer encoding; since this is an HTTP 1.1 feature, HTTP 1.0-only clients will not understand it (notably wget versions prior to 1.13 have this limitation, so don't use it to download when in streaming mode). Another issue with streaming mode is that the MIME type is also not known; it's possible to give hints on the command line (see the above example and the help); in any case, the program defaults to application/octet-stream which should always work (though not extremely helpful to the client).

The program can also operate in unbuffered mode (-u), which means that data sent to the client is flushed as it is written, so the client receives it immediately. This feature, coupled with streaming mode, can be used as a rudimentary tool to send live streaming data to an HTTP client, for example like this:

$ tail -f /var/log/messages | -u -m text/plain log.txt

or actual output from a pipeline, eg

$ hugefile.csv | tee results.txt | -u -m text/plain results.txt

Connecting with a browser or another HTTP client should show the data in (near) real time. This doesn't seem terribly useful, but perhaps someone can come up with a valid use case. Keep in mind that for this to work you need to make sure that whatever program is writing to the pipe is not buffering the output (many programs do buffering if they detect that stdout is not a terminal). Tools like stdbuf or unbuffer help in this case. On the client side, curl has a --no-buffer/-N option that tells it to show data as it arrives without buffering. Also, it seems some browsers do a bit of initial buffering of the data they receive, after which they start showing new data in realtime (more info welcome).


If the address or name in the URL that the other party should use to download is not local, the program cannot know it. In principle, it could be done (somewhat unreliably) by querying some external IP address check service like dyndns and friends, but in practice it's easier to leave this to the user, who surely knows better. Thus, it's possible to supply an URL that the user knows is leading to the local machine (see help for an example). And of course, this is only so it can be copied/pasted; it doesn't really change what the program does.

The way the program works is: once a connection is received, it reads the client's HTTP request and discards it (the only check that is performed is that it is a GET method, but even that could probably be avoided); after that, a minimal set of HTTP reply headers are sent, followed by the actual data. This means the code is simple, but it also means that picky clients that only accept certain encodings, expect specific headers or other special features will probably not work. If more sophisticated behavior is desired, use a real web server (of which there are many).

The code makes a number of assumptions and uses some tools that practically make it very Linux-specific; it has not been tested under other platforms. Also it relies on some external programs to get some information (local IPs, terminal size, MIME types etc); none of these external programs is critical, so the absence of some or all of them will not cause failure.

URL encoding is done using the URI::Escape module, if available; otherwise, no URL encoding is performed at all. With "normal" filenames this is not a problem, however in cases where weird URLs would result, it is possible to explicitly supply a name (see help).

To handle IPv4 and IPv6 clients with a single IPv6 socket, IPv4-mapped addresses are used. The program disables the socket option IPV6_V6ONLY, so both IPv4 and IPv6 clients can be accepted regardless of the setting in /proc/sys/net/ipv6/bindv6only. However, people should be using IPv6 already!

If the terminal is resized while the program is sending data, the progress bar will NOT be resized accordingly. However, since the terminal width is not checked until after a client has connected, it is possible to resize the terminal while the program is still waiting for a client to connect.

And btw, only one client is handled. As said, for anything more complex use a real webserver.

That's it. Any comment or bug report is welcome, as usual.

Run cron job every N days

Let's say we want to run a job every N days, or weeks, regardless of month or year boundaries. For example, once every three tuesdays, or once every 17 days, or whatever.

Cron itself (at least the variants I have access to) has no way to specify these time periods, so it would seem this could not be done.

But there's a simple way to do it. It is based on modular arithmetic and on the fact that we know that measurement of time on Unix starts on a concrete date, which is the well-known January the 1st, 1970 (also known as "the Epoch"). For the remainder, I'm assuming UTC and a running time of midnight for simplicity; it should be easy to consider the appropriate time differences where needed.

With this kind of requirement we need to have an actual starting date for the job, that is, when it has to run for the first time, so we can use it as a starting point for the "every N days" intervals.
Once we have an actual date of first execution for our task (say, 2013-01-15, a Tuesday, at 00:00), we can divide the time passed since the Epoch until our date into groups of N days. For this first example, let's say N == 14, two weeks. With the following calculation we can see which place our starting day occupies in a period of 14 days (two weeks):

$ echo $(( $(date +%s -d "2013-01-15 00:00") / 86400 % 14 ))

Dividing by 86400 gives the number of days passed since the Epoch, from which the modulo 14 is calculated. The result is 11, which tells us that at any given time, performing the above calculation using the current date will yield 11 only on $startdate, of course, and on every second Tuesday (well, every 14 days, which is the same) starting from $startdate (or going backwards from $startdate, which is not important here). Simple test code to show that it's true:

# starting from 2013-01-10, calculate the modulo for each day over a period of
# 40 days, checking that only the days we're interested in have modulo 11
for i in {0..39}; do
  curdate=$(date +%s -d "$begin + $i days 00:00")
  modulo=$(( curdate / 86400 % 14 ))
  [ $modulo -eq 11 ] && prefix="*** " || prefix=
  echo "${prefix}Date $(date "+%F %T (%a)" -d @$curdate) has modulo $modulo"

Sample run:

$ ./
Date 2013-01-10 00:00:00 (Thu) has modulo 6
Date 2013-01-11 00:00:00 (Fri) has modulo 7
Date 2013-01-12 00:00:00 (Sat) has modulo 8
Date 2013-01-13 00:00:00 (Sun) has modulo 9
Date 2013-01-14 00:00:00 (Mon) has modulo 10
*** Date 2013-01-15 00:00:00 (Tue) has modulo 11
Date 2013-01-16 00:00:00 (Wed) has modulo 12
Date 2013-01-17 00:00:00 (Thu) has modulo 13
Date 2013-01-18 00:00:00 (Fri) has modulo 0
Date 2013-01-19 00:00:00 (Sat) has modulo 1
Date 2013-01-20 00:00:00 (Sun) has modulo 2
Date 2013-01-21 00:00:00 (Mon) has modulo 3
Date 2013-01-22 00:00:00 (Tue) has modulo 4
Date 2013-01-23 00:00:00 (Wed) has modulo 5
Date 2013-01-24 00:00:00 (Thu) has modulo 6
Date 2013-01-25 00:00:00 (Fri) has modulo 7
Date 2013-01-26 00:00:00 (Sat) has modulo 8
Date 2013-01-27 00:00:00 (Sun) has modulo 9
Date 2013-01-28 00:00:00 (Mon) has modulo 10
*** Date 2013-01-29 00:00:00 (Tue) has modulo 11
Date 2013-01-30 00:00:00 (Wed) has modulo 12
Date 2013-01-31 00:00:00 (Thu) has modulo 13
Date 2013-02-01 00:00:00 (Fri) has modulo 0
Date 2013-02-02 00:00:00 (Sat) has modulo 1
Date 2013-02-03 00:00:00 (Sun) has modulo 2
Date 2013-02-04 00:00:00 (Mon) has modulo 3
Date 2013-02-05 00:00:00 (Tue) has modulo 4
Date 2013-02-06 00:00:00 (Wed) has modulo 5
Date 2013-02-07 00:00:00 (Thu) has modulo 6
Date 2013-02-08 00:00:00 (Fri) has modulo 7
Date 2013-02-09 00:00:00 (Sat) has modulo 8
Date 2013-02-10 00:00:00 (Sun) has modulo 9
Date 2013-02-11 00:00:00 (Mon) has modulo 10
*** Date 2013-02-12 00:00:00 (Tue) has modulo 11
Date 2013-02-13 00:00:00 (Wed) has modulo 12
Date 2013-02-14 00:00:00 (Thu) has modulo 13
Date 2013-02-15 00:00:00 (Fri) has modulo 0
Date 2013-02-16 00:00:00 (Sat) has modulo 1
Date 2013-02-17 00:00:00 (Sun) has modulo 2
Date 2013-02-18 00:00:00 (Mon) has modulo 3

So there we have it, every second Tuesday starting from 2013-01-15. The code shown in can be made generic so that values can be passed from the command line:

# use: [startdate yyyy-mm-dd] [period] [wanted modulo]
for i in {0..39}; do
  curdate=$(date +%s -d "$begin + $i days 00:00")
  modulo=$(( curdate / 86400 % length ))
  [ $modulo -eq $wantedmod ] && prefix="*** " || prefix=
  echo "${prefix}Date $(date "+%F %T (%a)" -d @$curdate) has modulo $modulo"

Another test: let's say we want every fifth day starting from 2012-12-02. Let's calculate the modulo first:

$ echo $(( $(date +%s -d "2012-12-02 00:00") / 86400 % 5 ))

And let's verify it:

$ ./ 2012-12-01 5 0
Date 2012-12-01 00:00:00 (Sat) has modulo 4
*** Date 2012-12-02 00:00:00 (Sun) has modulo 0
Date 2012-12-03 00:00:00 (Mon) has modulo 1
Date 2012-12-04 00:00:00 (Tue) has modulo 2
Date 2012-12-05 00:00:00 (Wed) has modulo 3
Date 2012-12-06 00:00:00 (Thu) has modulo 4
*** Date 2012-12-07 00:00:00 (Fri) has modulo 0
Date 2012-12-08 00:00:00 (Sat) has modulo 1
Date 2012-12-09 00:00:00 (Sun) has modulo 2
Date 2012-12-10 00:00:00 (Mon) has modulo 3
Date 2012-12-11 00:00:00 (Tue) has modulo 4
*** Date 2012-12-12 00:00:00 (Wed) has modulo 0
Date 2012-12-13 00:00:00 (Thu) has modulo 1
Date 2012-12-14 00:00:00 (Fri) has modulo 2
Date 2012-12-15 00:00:00 (Sat) has modulo 3
Date 2012-12-16 00:00:00 (Sun) has modulo 4
*** Date 2012-12-17 00:00:00 (Mon) has modulo 0
Date 2012-12-18 00:00:00 (Tue) has modulo 1
Date 2012-12-19 00:00:00 (Wed) has modulo 2
Date 2012-12-20 00:00:00 (Thu) has modulo 3
Date 2012-12-21 00:00:00 (Fri) has modulo 4
*** Date 2012-12-22 00:00:00 (Sat) has modulo 0
Date 2012-12-23 00:00:00 (Sun) has modulo 1
Date 2012-12-24 00:00:00 (Mon) has modulo 2
Date 2012-12-25 00:00:00 (Tue) has modulo 3
Date 2012-12-26 00:00:00 (Wed) has modulo 4
*** Date 2012-12-27 00:00:00 (Thu) has modulo 0
Date 2012-12-28 00:00:00 (Fri) has modulo 1
Date 2012-12-29 00:00:00 (Sat) has modulo 2
Date 2012-12-30 00:00:00 (Sun) has modulo 3
Date 2012-12-31 00:00:00 (Mon) has modulo 4
*** Date 2013-01-01 00:00:00 (Tue) has modulo 0
Date 2013-01-02 00:00:00 (Wed) has modulo 1
Date 2013-01-03 00:00:00 (Thu) has modulo 2
Date 2013-01-04 00:00:00 (Fri) has modulo 3
Date 2013-01-05 00:00:00 (Sat) has modulo 4
*** Date 2013-01-06 00:00:00 (Sun) has modulo 0
Date 2013-01-07 00:00:00 (Mon) has modulo 1
Date 2013-01-08 00:00:00 (Tue) has modulo 2
Date 2013-01-09 00:00:00 (Wed) has modulo 3

So to use all this in our crons, we need to know the starting date, the frequency (every N days) and calculate the modulo. Once the modulo is known, we run the job if the modulo calculated for "now" (when the job is invoked) matches the modulo we want. So for instance if the period is 13 days and the modulo we want is 6, in our script we do:

if (( $(date +%s) / 86400 % 13 != 6 )); then exit; fi
# run the task here

Or as usual it can also be done in the crontab itself so the script does not need to have special knowledge (it may not even be a script, so in that case the check would have to be external anyway):

0 0 * * *  bash -c '(( $(date +\%s) / 86400 \% 13 == 6 )) &&'

Note: so far, it doesn't seem to have trouble with DST time changes. Corrections welcome.

GRE bridging, IPsec and NFQUEUE

Lots of stuff, apparently unrelated, but it all came together recently, so here it is. Probably the whole thing is useless, but along the way I found many interesting things.

The network topology used for this experiment is as follows:


The initial task was trying to find a way to bridge site A and site B using IPsec (for starters, something that's not terribly useful; and yes, I know that there are other ways, but that's not the point here), I came across the (utterly undocumented) gretap tunnel of iproute2, which is, well, a GRE interface that can encapsulate ethernet frames (rather than IP packets, which is the more usual use case for GRE).

A gretap interface is created thus:

routerA# ip link add gretap type gretap local remote dev eth1
routerA# ip link show gretap
6: gretap@eth1: <BROADCAST,MULTICAST> mtu 1462 qdisc noop state DOWN mode DEFAULT qlen 1000
    link/ether 62:24:67:45:44:ad brd ff:ff:ff:ff:ff:ff

The idea is: add the gretap interface to a bridge at both ends of the tunnel, and voila, site A and site B are bridged. Then, use IPsec to encrypt the GRE traffic between the sites: mission accomplished. Right? No. Turns out it isn't so easy. Let's go step by step.

The MTU problem

As shown above, when the gretap interface is created, it has a default MTU of 1462, which is correct: 1500 (underlying physical interface) - 20 (outer IP header added by GRE) - 4 (GRE header) - 14 (ethernet header of the encapsulated frame) = 1462.

However, the rest of machines in the LAN have an MTU of 1500, of course.

For a normal GRE interface (encapsulating pure IP) a lower tunnel MTU is a bit of an annoyance, but nothing critical: when a packet too big is received, an ICMP error (type 3, code 4 for IPv4) is sent back to the originator of the packet, which will then hopefully take the necessary actions (lower its MTU, retransmit but this time allowing fragmentation, whatever).
However here with our gretap interface we're bridging, which means that there's no "previous hop" where to send the ICMP error.
Furthermore, since the gretap interface is added to a bridge, the bridge MTU is lowered in turn. Since ethernet networks work on the assumption that all the participating interfaces have the same MTU, if a bridge with an MTU of 1462 (maximum frame size 1476) receives a full-sized frame (up to 1514 bytes), it just silently drops it. Not nice, although there's nothing else it can do.

To check, let's add the gretap interface to a bridge on routerA:

routerA# ip link add br0 type bridge
routerA# ip link set eth0 down
routerA# ip addr del 10.32.x.x/24 dev eth0    # remove whatever IP address it had
routerA# ip link set eth0 master br0
routerA# ip link set eth0 up
routerA# ip link set br0 up
routerA# ip addr add dev br0
routerA# ip link set gretap up
routerA# ip link set gretap master br0

and the same on routerB:

routerB# ip link add br0 type bridge
routerB# ip link set eth0 down
routerB# ip addr del 10.32.x.x/24 dev eth0   # remove whatever IP address it had
routerB# ip link set eth0 master br0
routerB# ip link set eth0 up
routerB# ip link set br0 up
routerB# ip addr add dev br0
routerB# ip link set gretap up
routerB# ip link set gretap master br0

(we assign to br0 on routerB to avoid conflicts, since the networks are bridged; these addresses are not used for these examples, anyway).

Now, from host A, we produce a maximum-sized frame:

hostA# ping -s 1472
PING ( 1472(1500) bytes of data.
--- ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2015ms

If we capture the traffic, we see that the small ARP request/reply frames pass through the tunnel, but the frames containing the actual ICMP echo request packets are silently dropped by br0 at routerA (which at this point still has an MTU of 1462).

Seems like the obvious thing to do is to raise the MTU of the gretap interface to 1500, so the bridge can have an MTU of 1500 again:

routerA# ip link show br0
4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1462 qdisc noqueue state UP mode DEFAULT 
    link/ether 00:16:3e:c3:8c:ef brd ff:ff:ff:ff:ff:ff
routerA# ip link set gretap mtu 1500
routerA# ip link show br0
4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT 
    link/ether 00:16:3e:c3:8c:ef brd ff:ff:ff:ff:ff:ff

and the same thing at routerB. Are we done now? Of course no, because while now the bridge does accept full-sized frames, there is a problem when a big frame has to be sent out the gretap "bridge port": after encapsulation, the size of the resulting IP packet can be up to 1514 + 4 + 20 = 1538 bytes. It may or may not be possible to fragment it, and here comes the next interesting point.

IP fragmentation

Now it turns out that, by default, GRE interfaces do Path MTU Discovery (PMTUD for short), which means the packets they produce have the DF (Don't Fragment) flag bit set. As mentioned above, this behavior is useful when encapsulating IP, so the previous hop can be notified of the tunnel bottleneck if needed. But again, here we're encapsulating ethernet and bridging, so there's noone to notify: if the oversized packet cannot be fragmented (and by default it can't, since PMTUD is performed), it's just dropped.

What we want, ideally, is that the encapsulated packets resulting from gretap encapsulation be fragmentable, that is, have the DF bit set to 0. It should be possible to do it, right?

Reading through the scarce iproute2 documentation, we learn that tunnel interfaces can be given a special option nopmtudisc at creation time, whose function, according to the manual, is to "disable Path MTU Discovery on this tunnel." Sounds just like the feature we want, so let's set the flag when creating the interface:

routerA# ip link del gretap
routerA# ip link add gretap type gretap local remote dev eth1 nopmtudisc
routerA# ip link set gretap mtu 1500
routerA# ip link set gretap up
routerA# ip link set gretap master br0

(same at routerB). However, if we now retry the oversized ping (-s 1472), it still doesn't work. How is that? A sample traffic capture shows that IP packets leaving hostA have the DF bit set, and apparently this bit gets copied to the outer IP header added by GRE (despite the nopmtudisc flag), which thus results in an unfragmentable packet which is silently dropped. If we explicitly disable DF on the IP packets produced by ping, it finally works:

ping -Mdont -s 1472
PING ( 1472(1500) bytes of data.
1480 bytes from icmp_req=1 ttl=64 time=1.26 ms
1480 bytes from icmp_req=2 ttl=64 time=0.932 ms
1480 bytes from icmp_req=3 ttl=64 time=1.01 ms
--- ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 0.932/1.071/1.264/0.143 ms

and capturing outgoing traffic at routerA and routerB show fragmented GRE/IP packets (yes, fragments are bad, but probably better than nothing at all).
However, this is no solution, since we artificially modified the behavior of the application on the client, which is not doable for all the LAN hosts.

More bad news: Linux (and probably most operating systems) performs PMTUD by default, meaning that each and every IP packet produced by applications has the DF bit set (unless overridden by the application, which is not usual). There is a /proc entry that can disable this behavior on a global basis (/proc/sys/net/ipv4/ip_no_pmtu_disc, which has to be set to 1 to disable PMTUD), but we'd like to solve the problem on the routers without touching the other machines. So the goal now is: clear the DF bit from the outer IP header, so packets can be fragmented (regardless of the original value of DF).

Here are some ugly drawings showing how the gretap encapsulation works. First, there's the original ethernet frame, which can be up to 1514 bytes (1500 IP + 14 ethernet header):


When this frame is encapsulated, we get this:


Clearing DF: NFQUEUE

One may imagine that it should be possible to perform this apparently simple using standard tools; after all, it's "normal" packet mangling, which iptables can perform. Again, it turns out it's not so easy. There seems to be no native iptables way to do this (while it can touch other parts of the IP header, like TTL or TOS).

However, what iptables does have is a generic mechanism to pass packets to userspace: to use it, we create iptables rules that match the packets we want and apply the NFQUEUE (formerly QUEUE) target to them. Then, a user-space program can register to receive and process these packets, and decide their fate (or "verdict" in nfqueue speak): send them back to iptables (after optionally modifying them), or discard them. The mechanism used for this communication is called nfnetlink. The library is in C, but Perl and Python bindings exist.

Before writing the actual code, however, we have to create the iptables rule that matches the tunneled packets. Thinking about it a bit, it's not obvious which tables and chains it would traverse on the router. Would it traverse the INPUT chain? It's not destined to the local machine (or is it?). What about the OUTPUT chain? It's not locally generated (or is it?). Fortunately, in this case an easy way exists to clear all doubts, since iptables allows tracing and the complete set of tables and chains that are traversed can be easily seen and logged.

 1 TRACE: raw:PREROUTING:policy:2 IN=br0 OUT= PHYSIN=eth0 MAC=00:16:3e:52:ba:6c:00:16:3e:93:08:ca:08:00 SRC= DST= LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=819 SEQ=1
 2 TRACE: filter:FORWARD:policy:1 IN=br0 OUT=br0 PHYSIN=eth0 PHYSOUT=gretap MAC=00:16:3e:52:ba:6c:00:16:3e:93:08:ca:08:00 SRC= DST= LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ICMP TYPE=8 CODE=0 ID=819 SEQ=1
 3 TRACE: raw:OUTPUT:rule:1 IN= OUT=eth1 SRC= DST= LEN=122 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=47
 4 TRACE: raw:OUTPUT:policy:2 IN= OUT=eth1 SRC= DST= LEN=122 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=47
 5 TRACE: filter:OUTPUT:policy:2 IN= OUT=eth1 SRC= DST= LEN=122 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=47

 6 TRACE: raw:PREROUTING:policy:2 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC= DST= LEN=122 TOS=0x00 PREC=0x00 TTL=63 ID=40301 PROTO=47
 7 TRACE: filter:INPUT:policy:1 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC= DST= LEN=122 TOS=0x00 PREC=0x00 TTL=63 ID=40301 PROTO=47
 8 TRACE: raw:PREROUTING:rule:1 IN=br0 OUT= PHYSIN=gretap MAC=00:16:3e:93:08:ca:00:16:3e:52:ba:6c:08:00 SRC= DST= LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18065 PROTO=ICMP TYPE=0 CODE=0 ID=819 SEQ=1
 9 TRACE: raw:PREROUTING:policy:2 IN=br0 OUT= PHYSIN=gretap MAC=00:16:3e:93:08:ca:00:16:3e:52:ba:6c:08:00 SRC= DST= LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18065 PROTO=ICMP TYPE=0 CODE=0 ID=819 SEQ=1
10 TRACE: filter:FORWARD:policy:1 IN=br0 OUT=br0 PHYSIN=gretap PHYSOUT=eth0 MAC=00:16:3e:93:08:ca:00:16:3e:52:ba:6c:08:00 SRC= DST= LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18065 PROTO=ICMP TYPE=0 CODE=0 ID=819 SEQ=1

Lines 1 to 5 show the outgoing packet, and lines 6 to 10 the return packet containing the ICMP echo reply.

Since a bridge is involved, the actual result depends on whether the /proc entry /proc/sys/net/bridge/bridge-nf-call-iptables is set to 0 or 1. If it's set to 1, as most distributions do, the trace will be similar to the one shown above; if it's set to 0, the steps where the packet passes through the bridge will not be shown (lines 1, 2, 8, 9 and 10 above will be missing).

So where do we tap into the flow to get our packets? Obviously we want to see the GRE packets, not the raw ethernet traffic (so whether iptables is called for bridge traffic or not is not important here), and we're only interested in traffic from the local LAN to the tunnel (lines 3 to 5 in the above trace). So a good place could be the OUTPUT chain, either in the raw or the filter table. Let's choose the filter table which is more common:

routerA# iptables -A OUTPUT -s -d -p gre -j NFQUEUE --queue-bypass
routerB# iptables -A OUTPUT -s -d -p gre -j NFQUEUE --queue-bypass

This sends all matching traffic to the NFQUEUE queue number 0 (the default), and moves on the next rule or policy if there's no user application listening on that queue (the --queue-bypass part), so at least small packets can pass by default.

Clearing DF, finally

Now all that's left is writing the user code that receives packets from queue 0, clears the DF bit, and sends them back to iptables. Fortunately, the library provides some sample C code that can be adapted for our purposes. So without further ado, let's fire up an editor and write this code:

 * clear_df.c: clear, uh, DF bit from IPv4 packets. Heavily borrowed from              * 
 * *

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <netinet/in.h>
#include <linux/types.h>
#include <linux/netfilter.h>            /* for NF_ACCEPT */
#include <arpa/inet.h>
#include <time.h>
#include <libnetfilter_queue/libnetfilter_queue.h>

/* Standard IPv4 header checksum calculation, as per RFC 791 */
u_int16_t ipv4_header_checksum(char *hdr, size_t hdrlen) {

  unsigned long sum = 0;
  const u_int16_t *bbp;
  int count = 0;

  bbp = (u_int16_t *)hdr;
  while (hdrlen > 1) {
    /* the checksum field itself should be considered to be 0 (ie, excluded) when calculating the checksum */
    if (count != 10) {
      sum += *bbp;
    bbp++; hdrlen -= 2; count += 2;

  /* in case hdrlen was an odd number, there will be one byte left to sum */
  if (hdrlen > 0) {
    sum += *(unsigned char *)bbp;

  while (sum >> 16) {
    sum = (sum & 0xffff) + (sum >> 16);

  return (~sum);

/* callback function; this is called for every matched packet. */       
static int cb(struct nfq_q_handle *qh, struct nfgenmsg *nfmsg, struct nfq_data *nfa, void *data) {

  u_int32_t queue_id;
  struct nfqnl_msg_packet_hdr *ph;
  int pkt_len;

  char *buf;
  size_t hdr_len;

  /* determine the id of the packet in the queue */
  ph = nfq_get_msg_packet_hdr(nfa);
  if (ph) {
    queue_id = ntohl(ph->packet_id);
  } else {
    return -1;

  /* try to get at the actual packet */
  pkt_len = nfq_get_payload(nfa, &buf);

  if (pkt_len >= 0) {

    hdr_len = ((buf[0] & 0x0f) * 4);

    /* clear DF bit */
    buf[6] &= 0xbf;

    /* set new packet ID */
    *((u_int16_t *)(buf + 4)) = htons((rand() % 65535) + 1);

    /* recalculate checksum */
    *((u_int16_t *)(buf + 10)) = ipv4_header_checksum(buf, hdr_len);

  /* "accept" the mangled packet */
  return nfq_set_verdict(qh, queue_id, NF_ACCEPT, pkt_len, buf);

int main(int argc, char **argv) {

    struct nfq_handle *h;
    struct nfq_q_handle *qh;
    int fd;
    int rv;
    char buf[4096] __attribute__ ((aligned));

    /* printf("opening library handle\n"); */
    h = nfq_open();
    if (!h) {
        fprintf(stderr, "error during nfq_open()\n");

    /* printf("unbinding existing nf_queue handler for AF_INET (if any)\n"); */
    if (nfq_unbind_pf(h, AF_INET) < 0) {
        fprintf(stderr, "error during nfq_unbind_pf()\n");

    /* printf("binding nfnetlink_queue as nf_queue handler for AF_INET\n"); */
    if (nfq_bind_pf(h, AF_INET) < 0) {
        fprintf(stderr, "error during nfq_bind_pf()\n");

    /* printf("binding this socket to queue '0'\n"); */
    qh = nfq_create_queue(h,  0, &cb, NULL);
    if (!qh) {
        fprintf(stderr, "error during nfq_create_queue()\n");

    /* printf("setting copy_packet mode\n"); */
    if (nfq_set_mode(qh, NFQNL_COPY_PACKET, 0xffff) < 0) {
        fprintf(stderr, "can't set packet_copy mode\n");

    fd = nfq_fd(h);

    /* initialize random number generator */

    while ((rv = recv(fd, buf, sizeof(buf), 0)) && rv >= 0) {
        nfq_handle_packet(h, buf, rv);

    /* printf("unbinding from queue 0\n"); */

    /* printf("closing library handle\n"); */


It's useful to refer to this illustration of the IPv4 header to better follow the explanation.
The main thing worth noting of the above code is that, if we clear DF, we also have to fill in the "identification" (id) field (bytes 4 and 5 of the header) of the IPv4 header. This field is generally set to 0 when DF is set, since in that case of course the packet will never be fragmented. However, we're actually allowing fragmentation for a packet for which it was possibly not intended, so we fill the id field with a random 16-bit integer between 1 and 65535; this value is used by whoever has to reassemble the packet, to tell which fragments are part of the same original packet. If we left the field to 0, then all fragments (if any) would have the same ID and the receiver would have a hard time reassembling the original packet, especially if the fragments arrive out of order.
And of course, since we're changing the header, the checksum (bytes 10 and 11 of the header) has to be recalculated.

An obvious optimization of the above code (not implemented here as it's just a proof of concept) would be to immeditaly accept the packet without mangling it if the DF bit is already set to 0. Another possibiility could be to not touch the packet if its length is less than the outgoing interface MTU (the output interface can be obtained using nfq_get_outdev, for example); however in this case we'd be trusting all the hops along the path to have an MTU greater or equal to ours, which may not be true; so when in doubt, we just always clear DF.

Compilation requires the appropriate header files to be present (libnfnetlink-dev and libnetfilter-queue-dev under debian). To compile the code, do:

gcc -o clear_df -lnfnetlink -lnetfilter_queue clear_df.c

So now, let's run our program and retry the damned ping (forcing DF, just to be sure):

routerA# clear_df
routerB# clear_df
hostA# ping -s 1472 -Mdo
PING ( 1472(1500) bytes of data.
1480 bytes from icmp_req=1 ttl=64 time=1.48 ms
1480 bytes from icmp_req=2 ttl=64 time=0.969 ms
1480 bytes from icmp_req=3 ttl=64 time=1.11 ms
1480 bytes from icmp_req=4 ttl=64 time=0.946 ms
1480 bytes from icmp_req=5 ttl=64 time=0.944 ms
--- ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 0.944/1.091/1.481/0.208 ms

So it finally works, and without changing anything on host A. On the routers, capturing the traffic on the link between routerA and routerB does indeed show fragmented packets. So we can finally say that "it works".

Final piece: IPsec

Although this was the original goal, it somehow got lost along the road while we were troubleshooting things, so back on track after the detour. Now that we got this far, adding IPsec should be a piece of cake. To make it a bit more interesting (not much), routerA is going to use ipsec-tools + racoon, while routerB will use Openswan. These are probably the most common IPsec implementations under Linux.

Since the GRE packets already contain router A's and router B's public IPs in the IP header, effectively making this a host-to-host tunnel, we can use IPsec's transport mode to just encrypt and authenticate the payload (this is a case where transport mode is actually useful). Note however that IPsec transport mode inserts a new header (the ESP header) after the first IP header, so the "protocol" field of the latter will change from 47 (GRE) to 50 (ESP). Grafically, after ESP is applied (which however always happens before fragmentation) we now have this monster:


Should we modify the iptables rule that sends the packets to userspace? Let's see. The easiest way to check is, again, tracing a packet and seeing which chains it traverses, so if we do it we see (omitting the bridging parts which are not relevant here):

 1 TRACE: raw:OUTPUT:policy:2 IN= OUT=eth1 SRC= DST= LEN=122 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=47 
 2 TRACE: filter:OUTPUT:rule:2 IN= OUT=eth1 SRC= DST= LEN=122 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=47 
 3 TRACE: raw:OUTPUT:rule:1 IN= OUT=eth1 SRC= DST= LEN=152 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ESP SPI=0xdb28966a 
 4 TRACE: raw:OUTPUT:policy:2 IN= OUT=eth1 SRC= DST= LEN=152 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ESP SPI=0xdb28966a 
 5 TRACE: filter:OUTPUT:policy:3 IN= OUT=eth1 SRC= DST= LEN=152 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=ESP SPI=0xdb28966a 

 6 TRACE: raw:PREROUTING:policy:2 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC= DST= LEN=152 TOS=0x00 PREC=0x00 TTL=63 ID=21840 PROTO=ESP SPI=0x4802e17
 7 TRACE: filter:INPUT:policy:1 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC= DST= LEN=152 TOS=0x00 PREC=0x00 TTL=63 ID=21840 PROTO=ESP SPI=0x4802e17 
 8 TRACE: raw:PREROUTING:rule:1 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC= DST= LEN=122 TOS=0x00 PREC=0x00 TTL=63 ID=21840 PROTO=47 
 9 TRACE: raw:PREROUTING:policy:2 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC= DST= LEN=122 TOS=0x00 PREC=0x00 TTL=63 ID=21840 PROTO=47 
10 TRACE: filter:INPUT:policy:1 IN=eth1 OUT= MAC=00:16:3e:c3:8c:12:00:16:3e:11:b8:8e:08:00 SRC= DST= LEN=122 TOS=0x00 PREC=0x00 TTL=63 ID=21840 PROTO=47 

So interestingly enough, we see that (despite Linux not having a real IPsec virtual interface) IPsec packets traverse the chains twice, once before encryption (with PROTO=47, lines 1-2) and again after encryption (with PROTO=ESP, that is, protocol 50, lines 3-5).
This means that we can either keep the existing iptables rules, in which case our code will receive the unencrypted packets, or change it to match protocol ESP (-p 50 or -p esp), in which case we'll see ESP packets. Note that we can do the latter because ESP does not protect the outer header; if we used AH (protocol 51), we wouldn't be able to change the ID field, which is considered immutable and thus authenticated - ie, signed - without rendering the packet invalid. So if we were using AH in transport mode, we would definitely want to match unencrypted packets (ie, -p 47 or -p gre). Though according to most people, AH is next to useless anyway.

However, since NFQUEUE has to copy packets between kernel space and user space and back, and since unencrypted packets are smaller, it's more efficient to match on protocol 47 (thus we're leaving the existing iptables rules unchanged).

For completeness, here are the sample IPsec confgurations used on routerA and routerB.

ipsec-tools.conf on routerA:

#!/usr/sbin/setkey -f

## Flush the SAD and SPD

spdadd gre -P out ipsec

spdadd gre -P in ipsec

racoon.conf on routerA:

log notify;
path pre_shared_key "/etc/racoon/psk.txt";
path certificate "/etc/racoon/certs";

remote {
        exchange_mode main;
        proposal {
                encryption_algorithm 3des;
                hash_algorithm md5;
                authentication_method pre_shared_key;
                dh_group modp1024;

sainfo address gre address gre {
        pfs_group modp1024;
        encryption_algorithm 3des;
        authentication_algorithm hmac_md5;
        compression_algorithm deflate;

ipsec.conf on routerB:

version 2.0     # conforms to second version of ipsec.conf specification

# basic configuration
config setup

conn to-routerA

For authentication we'll use pre-shared keys, but the changes to use certificate-based authentication are trivial (and not directly related to the main point of the article anyway).

The main thing to note is that we explicitly specify in our policies that we want to encrypt GRE traffic only, since that's what carries the tunneled ethernet frames. Everything in the above configuration can be changed; the policy can be changed to encrypt all traffic (using "any"), or the hash and encryption algorithms can be changed. There's nothing magic or special; it's just plain IPsec configuration.