Skip to content

Some notes on macvlan/macvtap

There's not a lot of documentation about these interfaces. Here are some notes to summarize what I've been able to gather so far. Surely there's more to it (corrections and/or more information welcome).


Macvlan interfaces can be seen as subinterfaces of a main ethernet interface. Each macvlan interface has its own MAC address (different from that of the main interface) and can be assigned IP addresses just like a normal interface.

So with this it's possible to have multiple IP addresses, each with its own MAC address, on the same physical interface. Applications can then bind specifically to the IP address assigned to a macvlan interface, for example. The physical interface to which the macvlan is attached is often referred to as "the lower device" or "the upper device"; here we'll use the term "lower device".

The main use of macvlan seems to be container virtualization (for example LXC guests can be configured to use a macvlan for their networking and the macvlan interface is moved to the container's namespace), but there are other scenarios, mostly very specific cases, like using virtual MAC addresses (see for example this keepalived feature).

A macvlan interface can work in one of four modes, defined at creation time.

  • VEPA (Virtual Ethernet Port Aggregator) is the default mode. If the lower device receives data from a macvlan in VEPA mode, this data is always sent "out" to the upstream switch or bridge, even if it's destined for another macvlan in the same lower device. Since macvlans are almost always assigned to virtual machines or containers, this makes it possible to see and manage inter-VM traffic on a real external switch (whereas with normal bridging it would not leave the hypervisor), with all the features provided by a "real" switch. However, at the same time this implies that, for VMs to be able to communicate, the external switch should send back inter-VM traffic to the hypervisor out of the same interface it was received from, something that is normally prevented from happening by STP. This feature (the so-called "hairpin mode" or "reflective relay") isn't widely supported yet, which means that if using VEPA mode with an ordinary switch, inter-VM traffic leaves the hypervisor but never comes back (unless it's sent back at the IP level by a router somewhere, but then there's nothing special about that, it has always worked that way).
    Since there are few switches supporting hairpin mode, VEPA mode isn't used all that much yet. However it's worth mentioning that Linux's own internal bridge implementation does support hairpin mode in recent versions; assuming eth0 is a port of br0, hairpin mode can be anabled by doing

    # echo 1 > /sys/class/net/br0/brif/eth0/hairpin_mode

    or using a recent version of brctl:

    # brctl hairpin br0 eth0 on

    or even better, using the bridge program that comes with recent versions of iproute2:

    # bridge link set dev eth0 hairpin on

    So a Linux box could very well be used in the role of "external switch" as mentioned above.

  • Bridge mode: this works almost like a traditional bridge, in that data received on a macvlan in bridge mode and destined for another macvlan of the same lower device is sent directly to the target (if the target macvlan is also in bridge mode), rather than being sent outside. This of course works well with non-hairpin switches, and inter-VM traffic has better performance than VEPA mode, since the external round-trip is avoided. In the words of a kernel developer,

    The macvlan is a trivial bridge that doesn't need to do learning as it
    knows every mac address it can receive, so it doesn't need to implement
    learning or stp. Which makes it simple stupid and and fast.

  • Private mode: this is essentially like VEPA mode, but with the added feature that no macvlans on the same lower device can communicate, regardless of where the packets come from (so even if inter-VM traffic is sent back by a hairpin switch or an IP router, the target macvlan is prevented from receiving it). I haven't tried, but I suppose that it is the operating mode of the target macvlan that determines whether it receives the traffic or not. This mode is useful, of course, if we really want macvlan isolation.
  • Passthru mode: this mode was added later, to work around some limitation of macvlans (more details here). I'm not 100% clear on what's the problem passthru mode tries to solve, as I was able to set promiscuous mode, create bridges, vlans and sub-macv{lan,tap} interfaces in KVM guests using a plain macvtap in VEPA mode for their networking (so no need for passthru). Since I'm surely missing something, more information (as usual) is welcome.

VEPA, bridged and private mode come from a standard called EVP (edge virtual bridging); a good article which provide more information can be found here.

Curiously (at least, in the case of the three original operating modes), the operating mode is per-macvlan interface rather than global (per-physical device); I guess that it's then more or less mandatory to configure all the macvlans of the same lower device to operate in the same mode, or at least match the macvlan modes so that only intended inter-VM traffic is possible; not sure what would happen, for instance, if a macvlan using VEPA mode tries to communicate with another one using bridge mode, or viceversa. This may well be worth investigating.

Irrespective of the mode used for the macvlan, there's no connectivity from whatever uses the macvlan (eg a container) to the lower device. This is by design, and is due to the the way macvlan interfaces "hook into" their physical interface. If communication with the host is needed, the solution is kind of easy: just create another macvlan on the host on the same lower device, and use this to communicate with the guest.

The documentation of iproute2 about setting operating mode for macvlans isn't complete, since neither "ip link help" nor the man pages mention how to do that. Fumbling around a bit, it can be seen that the syntax is

# ip link add link eth2 macvlan2 type macvlan mode aaa    # hit enter here to force an error message
Error: argument of "mode" must be "private", "vepa", "bridge" or "passthru"

Even more undocumented (if possible) is the way to show the operating mode of a macvlan, which turns out to be

# ip -d link show macvlan2
27: macvlan2@eth2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT 
    link/ether 26:8a:3c:07:7d:f4 brd ff:ff:ff:ff:ff:ff
    macvlan  mode vepa 

Let's hope that all this appears in the documentation soon.

The MAC address of the macvlan is normally autogenerated; to explicitly specify one, the following syntax can be used (which also specifies custom name and operating mode at the same time):

# ip link add link eth2 FOOMACVLAN address 56:61:4f:7c:77:db type macvlan mode bridge

Final note, it's also possible to create a macvlan interface and bridge it (eg brctl addif br0 macvlan2); though it's a bit weird, it does work fine.

macvtap interfaces

A macvtap is a virtual interfaces based on macvlan (thus tied to another interface) vaguely similar (not much in fact) to a regular tap interface. A macvtap interface is similar to a normal tap interface in that a program can attach to it and read/write frames. However, the similarities end here. The most prominent user of macvtap interfaces seems to be libvirt/KVM, which allows guests to be connected to macvtap interfaces. Doing so allows for (almost) bridged-like behavior of guests but without the need to have a real bridge on the host, as a regular ethernet interface can be used as the macvtap's lower device.

Some notes about macvtap (more information is always welcome):

  • Since it's based on macvlan, macvtap shares the same operating modes it can be in (VEPA, bridge, private and passthru)
  • Similarly, a guest using a macvatp interface cannot communicate directly with its lower device in the host. In fact, if you run tcpdump on the macvtap interface on the host, no traffic will be seen. Again this is by design, but can be surprising. This link has some details and suggests workarounds for KVM in case this functionality is needed. A quick workaround is to create a macvlan (not macvtap) interface on the host, which will then be visible from the guests. (On a side note, this is also a way to use routed mode for the macvtap guests: put the host's macvlan and all guests on the same IP subnet, configure the guests to use the host macvlan's IP as their default gateway, and have the host do NAT between the macvlan and the physical interface. But then, in this case, it's probably easier to use a real bridge).
  • Creation of a macvtap interface is not done by opening /dev/net/tun; instead, it looks like the only way to create one is to directly send appropriate messages to the kernel via a netlink socket (at least, that's how iproute2 and libvirt do it; strace and/or the source will show the details, as there seems to be no documentation whatsoever). This makes it a bit more complicated than a normal tun/tap interface.
  • macvtap interfaces are persistent by default. Once the macvtap interface has been created via netlink, an actual chracter device file appears under /dev (this does not happen with normal tap interfaces), The device file is called /dev/tapNN, where NN is the interface index of the macvtap (can be seen for example with "ip link show"). It's this device file that has to be opened by programs wanting to use the interface (eg libvirtd/qemu to connect a guest).
  • One consequence of there being an actual device file for the macvtap interface is that traffic entering the interface can be seen and "stolen" to the intended recipient by simply reading from the device file; doing "cat /dev/tap22" (for example) while a guest VM is using it dumps the raw ehernet frames and prevents the VM from seeing them. On the other hand, neither seeing outgoing traffic nor injecting frames by writing to the device file from the outside seem to be possible.
  • If a VM is connected to the macvtap, the MAC address of the macvtap interface as seen on the host is the same that is seen by the guest; this is different from regular tap interfaces, where the guest is somehow "behind" the tap interface (the vnetX interfaces on the host have a MAC address which is not the same that the guest uses).
  • All traffic for guests connected to a macvtap does show up if running tcpdump on the lower device, even in bridge mode and for guest-to-guest traffic. However, as said, tcpdump (on the host) on the macvtap device itself shows no traffic.
  • If the lower device is a wireless card, macvtap doesn't work (the guest is isolated, nothing enters, nothing exits). Perhaps it's just that it only works with some wireless cards, and I happened to have one that doesn't work. Again, I could not find more information.

As said, creating a macvtap interface via code is a bit complicated, but luckily iproute2 can do it on the command line. To create a macvtap interface called macvtap2, with eth2 as its lower physical interface:

# ip link add link eth2 macvtap2 address 00:22:33:44:55:66 type macvtap mode bridge
# ip link set macvtap2 up
# ip link show macvtap2
18: macvtap2@eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT qlen 500
    link/ether 00:22:33:44:55:66 brd ff:ff:ff:ff:ff:ff
# ls -l /dev/tap18 
crw------- 1 root root 250, 1 May 26 10:51 /dev/tap18

To delete the interface, the usual command can be used:

# ip link del macvtap2

Two links which provide good information about macvtap:

Smart ranges in sed

Since there seem to be still quite a few people who want to do this with sed...let's see how to select ranges of lines in the same way as with awk (explained here).

We should also avoid the same issue described there, that is, if other /BEGIN/ lines are found while we are inside a range, those lines should be printed. So with this input:

2 foo
3 bar
5 baz

at least lines 2 to 5 should be printed (line 1, or 6, or both may also be printed, depending on whether and which range endpoint we are including/excluding).

We're going to assume a sed with ERE (-E) support (as should be the norm these days anyway).

From BEGIN to END, inclusive

This is obviously the easy one:

# print lines from /BEGIN/ to /END/, inclusive
$ sed '/BEGIN/,/END/!d'
$ sed -n '/BEGIN/,/END/p'

No mysteries here. Let's get to the interesting cases.

From BEGIN to END, excluding END

# print lines from /BEGIN/ to /END/, excluding /END/
$ sed '/BEGIN/!d; :loop; n; /END/d; $!bloop'

We start a loop when we see a /BEGIN/, and keep looping until we see an /END/, at which point we delete the line so it's not printed.

From BEGIN to END, excluding BEGIN

# print lines from /BEGIN/ to /END/, excluding /BEGIN/
$ sed -E '/BEGIN/!d; :loop; N; /END/{ s/^[^\n]*\n//; p; d;}; $!bloop'

Same loop, but the lines are accumulated in the pattern space, and the first of them is removed before printing the whole block (note that the "D" command cannot be used for that purpose here, as it starts a new cycle).

From BEGIN to END, not inclusive

This is of course just a small variation on the preceding one, in that we delete both the first and the last line:

# print lines from /BEGIN/ to /END/, excluding both lines
$ sed -E '/BEGIN/!d; :loop; N; /END/{ s/^[^\n]*\n//; s/\n?[^\n]*$//; /./p; d;}; $!bloop'

Since we're excluding both the start and the end line, what's left after removing them may be empty, so we check that there's at least one character left and we only print the pattern space if that is the case.

For anything more complex, just use awk!

Pulling out strings

This is a generic text-processing need that often occurs in different kinds of scripts. Simply put, you want to get a list of the strings in the file (or files) that match a certain pattern. Let's use this simple file as an example:


Our pattern is (using ERE syntax) "foobar[0-9]+", that is, "foobar" followed by any number of digits. We will refine it a bit later.

Using common shell tools, we have several possibilities.

GNU grep

Probably the simplest one, if GNU grep is available, is to use its -o option, to return only the part of the input that matches the pattern, so:

$ grep -Eo 'foobar[0-9]+' test.txt

As said, this needs GNU grep due to the -o option.

GNU awk and BusyBox awk

These two awk implementations support, as a non-standard extension, the assignment of a regular expression to RS, and make whatever matched RS available in the special variable RT (mawk seems to support the former feature, but not the latter, which make it unsuitable to be used in the way we describe here). So here's how to use these awks for the task:

$ gawk -v RS='foobar[0-9]+' 'RT{print RT}' test.txt

Note that using RS/RT this way allows to match patterns that contain newlines, something that's not easily achieved with other tools (except Perl, see below).

These methods are easy and quick; however, if none of the above implementations is available, we need to use something more standard.

Standard awk

With standard awk, a way to extract all occurrences is to use a loop over each line, repeatedly using match():

$ cat matches.awk
  line = $0
  while (match(line, /foobar[0-9]+/) > 0) {
    print substr(line, RSTART, RLENGTH)
    line = substr(line, RSTART + RLENGTH)
$ awk -f matches.awk test.txt

Here the original line is saved (in case it's needed for further processing) and a copy is used to find matches. Since match() only finds the first match in the string, when a match is found it's removed so running match() again can find the following occurrence (if any). For this reason, the above code will loop forever if it's given a pattern that can match the empty string, like for example a*. When you do that, you really want a+ instead anyway, so use the latter. The code above is a common awk idiom to find all matches of a pattern.


With sed the task is a bit complicated. Basically, we need to somehow "mark" the parts of the data that match our pattern, so we can later delete everything that's not between markers, leaving thus only what we want.

A safe character to use as marker is the newline character (\n), since sed guarantees that, under normal conditions, no input line as seen in the pattern space will contain that character. For the first of the following solutions to work, a sed implementation that recognize \n in the RHS and the special bracket expression [^\n] (any character except \n) is needed. And since our pattern is a ERE (though it could be rewritten as BRE), we need a sed that recognizes EREs. GNU sed has all these features, and we're going to assume it in the examples.

That said, let's see a couple of ways to solve the task with sed.

One somewhat laborious solution is as follows:

$ sed -E '
t ok
s/(foobar[0-9]+)[^\n]*/\1/g' test.txt

Here we prepend a \n to each match, then delete what's before the very first match in the line (zero or more non-\n followed by a \n at the beginning of the string). Finally we delete all the parts between matches, which leaves us with just the matches, nicely separated by \n characters.

Another approach to the problem is implemented with the following code (which also has the benefit of using standard syntax; changing the ERE into BRE (foobar[0-9][0-9]*) and converting all the "\n" in the RHS to literal escaped newlines would allow this solution to be used with a standard sed):

$ sed -E '
D' test.txt

Here the approach is to "isolate" each match with a \n before and one after (if the pattern space doesn't already have one). If the line begins with a match, it's printed with "P" (up to the following \n, which is what we want). Regardless, the part up to and including the first \n is deleted (with "D"). If something is left, go to the beginning to do the previous steps again, until the whole pattern space is entirely consumed. If there were no matches in the original line, "D" will just delete it entirely and start a new cycle. Rinse and repeat for every input line.


With perl we can do it pretty easily thanks to its powerful regular expression matching operators:

$ perl -ne 'print "$_\n" for (/foobar\d+/g);' test.txt

If the pattern we want has newlines in it, we can just tell perl to slurp the file with perl -n000e and we're set.

Context comes to town

All the solutions seen so far strictly match a pattern, regardless of where it appears. In other words, they ignore the context of the matches. However there may be cases where this is important. In our example input data, we might want to match foobar[0-9]+ only if it's delimited, where "delimited" here is defined as "preceded by either a hash (#) or beginning of line, and followed by either a hash or end of line". Clearly, with this new requirements we don't want the foobar12 in the last line.

We thus need to consider the context in the regular expressions, making them include a larger text, so that matches only happen where there's data that we want; however, since the matched text will now be larger than what we need, we need to subsequently "clean up" the match, extracting only what we want from it. Our regular expression becomes now (ERE syntax)


Let's see how to modify the previous solutions to work with context.

GNU grep

Grep can't really edit text, so it would seem like it's out of the discussion here, but with a silly trick we can still use it:

$ grep -Eo '(^|#)foobar[0-9]+(#|$)' test.txt | grep -Eo 'foobar[0-9]+'

The first grep prints all matches with their context, and the second one, operating only on the good data, strictly "extracts" the matches that we need.

GNU awk and BusyBox awk

Setting RS to a non-default value obviously causes awk to stop working in line-oriented mode, so the beginning of line and end-of line anchors in our regular expression need to be augmented to consider the newline character.

Now, with the extended RS, RT will contain the full match with context, so we use gsub() to clean it up:

$ gawk -v RS='(^|#|\n)foobar[0-9]+(#|\n|$)' 'RT{gsub(/^(#|\n)|(#|\n)$/, "", RT); print RT}' test.txt

The critical part here is obviously the gsub(), which should be written carefully to remove the context stuff and only leave what we want.

Standard awk

Here we don't change RS so we're using the traditional line-oriented mode:

$ cat matches2.awk
  line = $0
  while (match(line, /(^|#)foobar[0-9]+(#|$)/)>0) {
    m = substr(line, RSTART, RLENGTH)
    gsub(/^#|#$/, "", m); print m
    line = substr(line, RSTART + RLENGTH)
$ awk -f matches2.awk test.txt


Things start to get complicated with sed if we want context. However we can still do it.

Of the two sed solutions presented previously, the easiest to adapt is the second one, so here it is:

$ sed -E '
/^#?foobar[0-9]+#?\n/ {
D' test.txt

Again, the critical bit is the part where the context (that we needed to match only the "correct" parts, but no longer want) is removed. This part will be highly dependent on the actual input data and problem requirements.


Perl is again an easy winner, as we can match with context and pull out only the interesting parts in a single go:

$ perl -ne 'print "$_\n" for (/(?:^|#)(foobar\d+)(?:#|$)/g);' test.txt

The regular expressions for what comes before and after are non-capturing, so the list returned byt the overall match is already made of clean strings, which we thus just need to print.

Overlap problems

You might have noticed that at the same time we introduced context to the matches, we also introduced the potential for overlap. Consider the following sample input data:


If we run for example the above GNU awk solution on this data, we get:

$ gawk -v RS='(^|#|\n)foobar[0-9]+(#|\n|$)' 'RT{gsub(/^(#|\n)|(#|\n)$/, "", RT); print RT}' test.txt

The foobar9999 is missed since the regular expression that matches foobar3 also "consumes" its surrounding context (the leading and trailing hash) and thus applying the regex with context again on what's left fails to match the second occurrence of the pattern.

However, this does not happen with all the solutions; only with some of them. The standard awk and the sed solutions still work since the previous match is deleted from the line, and the extended pattern we use to include context works if the match is at the beginning of a line without a delimiter, too. In the example, once #foobar3# has been matched and removed what's left is "^foobar9999#blah$", and the expression we're using for the match can still match again it since the pattern is at the very beginning and ^ is a possible anchor.
Of course, this happens to work because of the specific combination of input data and regular expressions that we're using; generally speaking, this doesn't have to be the case. It will depend on the actual situation.

The modern RE engine answer to safely solve the overlapping context problem is, naturally, lookaround, which turns actual consumed characters into zero-length assertions, and leaves them available for the next match attempt. This means that sed and awk are excluded, since their RE engines do not support lookaround.

What's left is GNU grep (with its -P option to match in PCRE mode, where available), and of course perl.


$ grep -Po '(?<=^|#)foobar[0-9]+(?=#|$)' test2.txt

There's also a pcregrep utility that comes with the PCRE library, with a syntax similar to that of grep. In particular, it supports the -o option, se we can also do:

$ pcregrep -o '(?<=^|#)foobar[0-9]+(?=#|$)' test2.txt

Let's try perl:

$ perl -ne 'print "$_\n" for (/(?<=^|#)(foobar\d+)(?=#|$)/g);' test2.txt
Variable length lookbehind not implemented in regex m/(?<=^|#)(foobar\d+)(?=#|$)/ at -e line 1. seems PCRE is more advanced than perl itself in this particular feature. As man pcrepattern informs us,

The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are several top-level alternatives, they do not all have to have the same fixed length. Thus


is permitted, but


causes an error at compile time. Branches that match different length strings are permitted only at the top level of a lookbehind assertion. This is an extension compared with Perl, which requires all branches to match the same length of string. An assertion such as


is not permitted, because its single top-level branch can match two different lengths, but it is acceptable to PCRE if rewritten to use two top-level branches:


So what can we do with perl? We have two possibilities.

We note that, strictly speaking, and in this particular case, only what follows the match has to be preserved for the next attempt; the lookbehind is not strictly needed, and we can replace it with a regular match. Thus:

$ perl -ne 'print "$_\n" for (/(?:^|#)(foobar\d+)(?=#|$)/g);' test2.txt

Another way to solve the problem is a bit ugly, but it works: we can just move the ^ anchor outside the lookbehind and make it part of a regular alternation; since it's a zero-length match anyway, nothing is harmed:

$ perl -ne 'print "$_\n" for (/(?:^|(?<=#))(foobar\d+)(?=#|$)/g);' test2.txt

It is important to understand that there's no generic rule here, and the solution will necessarily have to depend on the problem at hand. Depending on the actual situation, transforming a variable-length lookbehind into something accepted by perl may not always be so easy (or even possible).

Diskless iSCSI boot with PXE HOWTO

Here we will boot a machine (diskless or not, but even if it has a disk it won't be used) entirely from the network using PXE and the iSCSI protocol.

There are a few options to boot a system whose root partition is on iSCSI:

  • The machine could have a local bootloader that loads a local kernel and initrd. With suitable options, the initrd scripts are directed to log into an iSCSI LUN and use it as /. In this case, the LUN that is used as root filesystem does not need to have a kernel or bootloader installed.
  • Same as above, but the kernel and initrd are downloaded using PXE (via TFTP or HTTP).
  • The most interesting option (and the one that will be described here) is booting directly the iSCSI LUN via PXE. In this case, the LUN looks exactly like a local disk, with partitions, MBR, bootloader (grub) etc. The MBR is read and executed, which loads the second-stage bootloader and so on, just as if the disk were local.

A peculiar thing about iSCSI is that it doesn't really like the network going away while a session is connected. For this reason it is very important that the network be stable and reliable, but there are also a few specific boot-time tweaks to do in the Linux distribution that is being run from iSCSI. One of them is, of course, supplying the needed iSCSI information to the kernel; another one is preventing the initscripts from trying to (re)configure the network on the interface that is being used for the iSCSI session, as this may cause it to go down temporarily. In this case, the network is configured early, by the initrd, and should not be touched afterwards.

For this example, we will boot a Debian Wheezy over iSCSI, using PXE to read the LUN right from the very beginning (MBR and bootloader stage). For this to work, a PXE implementation that supports booting from iSCSI is obviously needed. iPXE is one such implementation (see here for more information on how to setup a more complete PXE infrastructure); here we will assume that the booting client is sent iPXE commands.


Debian does not (yet?) support direct installation to iSCSI, so there are two ways to do this: the first way is to transfer an existing installation to the LUN (eg using dd or rsync). The second (described here) is to use debootsrap on an existing helper machine to partition, install and prepare the LUN. The specific tweaks described starting from "iSCSI boot configuration" have to be performed regardless of whether it's an existing or a new install (if it's an existing installation, remember to chroot into it before).

When commands are shown, the prompt shows where they have to be run: "helper" is the helper machine, "client" is the chroot environment (ie the future iSCSI boot client).

Log into the LUN

We assume that our LUN is provided by the SAN at (, is called and has a size of 10G. So from a (possibly Debian or Ubuntu) machine with open-iscsi installed, we can log into it:

helper# iscsiadm -m discovery -t sendtargets -p,1
helper# iscsiadm -m node -T '' -p -l
Logging in to [iface: default, target:, portal:,3260] (multiple)
Login to [iface: default, target:, portal:,3260] successful.
helper# ls -l /dev/disk/by-path
lrwxrwxrwx 1 root root  9 Nov  2 15:03 -> ../../sda

To make things more interesting (not much), we're going to use the newer GPT partitioning. For simplicity, here we'll create a 512MB swap partition and a 9.5G root partition. On BIOS systems, which are still the majority, GPT also needs a small partition at the beginning of the disk, the so-called "BIOS boot partition" (type EF02). See here, here and here for more info (all three documents are very interesting reads). So here's the disk layout:

helper# gdisk -l /dev/sda
GPT fdisk (gdisk) version 0.8.5

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sda: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 67D92849-CD16-4CB1-8B3B-0758E62227CA
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048            8191   3.0 MiB     EF02  BIOS boot partition
   2            8192         1056767   512.0 MiB   8200  Linux swap
   3         1056768        20971486   9.5 GiB     8300  Linux filesystem
helper# mkfs.ext4 /dev/sda3
mke2fs 1.42.5 (29-Jul-2012)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
622592 inodes, 2489339 blocks
124466 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2550136832
76 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

helper# mkswap /dev/sda2
Setting up swapspace version 1, size = 524284 KiB
no label, UUID=e4f25981-3886-4939-a5cf-b05a0c7058a6
System installation

Let's mount the partition and install a minimal system with debootstrap:

helper# mkdir /mnt/chroot
helper# mount /dev/sda3 /mnt/chroot
helper# debootstrap wheezy /mnt/chroot
I: Retrieving Release
I: Retrieving Release.gpg
I: Checking Release signature
I: Configuring tasksel...
I: Configuring tasksel-data...
I: Base system installed successfully.

Now let's chroot into the system to finish the install:

helper# mount -t proc none /mnt/chroot/proc
helper# mount -t sysfs none /mnt/chroot/sys
helper# mount --bind /dev /mnt/chroot/dev
helper# chroot /mnt/chroot /bin/bash

Let's create /etc/mtab which is needed by many programs:

client# cp /proc/mounts /etc/mtab
client# sed -i '\|^/dev/sda3|,$!d' /etc/mtab

The sed command removes the first lines from the file, which are not relevant for the chrooted system, and keeps only lines from the one starting with /dev/sda3 to the end (replace sda3 if your partition name is different, of course).

Now let's create /etc/fstab. In this case, the best option is working with UUIDS, so let's find them:

client# blkid /dev/sda2 /dev/sda3
/dev/sda2: UUID="e4f25981-3886-4939-a5cf-b05a0c7058a6" TYPE="swap" 
/dev/sda3: UUID="6c816f51-0613-45e7-a15b-bc2d5cd00f88" TYPE="ext4"
client# echo 'UUID=6c816f51-0613-45e7-a15b-bc2d5cd00f88 / ext4 errors=remount-ro 0 1' >> /etc/fstab
client# echo 'UUID=e4f25981-3886-4939-a5cf-b05a0c7058a6 none swap sw 0 0' >> /etc/fstab
client# cat /etc/fstab
UUID=6c816f51-0613-45e7-a15b-bc2d5cd00f88 / ext4 errors=remount-ro 0 1
UUID=e4f25981-3886-4939-a5cf-b05a0c7058a6 none swap sw 0 0

Here we can install any extra package that we want:

client# apt-get install vim less openssh-server locales

This is also the time to do any other needed customization (eg localization, setting hostname, repositories, etc.).

Finally, we need to install a kernel, a bootloader and the initramfs utilities that we'll use later:

client# apt-get install linux-image-amd64 grub2 initramfs-tools

When prompted, we choose to install grub to /dev/sda, just as we'd do with a local hard disk.

iSCSI boot configuration

Now it's time to finally do what it takes for the actual boot process to work. Basically, we need a special initrd that configures the network, logs into the iSCSI target LUN, mounts it as / and calls pivot_root() on it. We will provide the needed information in the form of kernel command line arguments.

The open-iscsi package includes the necessary initrd hooks to do the above, so let's install it:

client# apt-get install open-iscsi

The relevant bit are in /usr/share/initramfs-tools/scripts/local-top/iscsi, where we learn that we can pass information by setting various ISCSI_* variables. We also want early (ie, kernel-level) IP configuration, which again can be done with special arguments to the kernel. We pass all this information by modifying the grub kernel command line, so we need the following line in the client's /etc/default/grub:

GRUB_CMDLINE_LINUX=" ISCSI_TARGET_IP= ISCSI_TARGET_PORT=3260 root=UUID=6c816f51-0613-45e7-a15b-bc2d5cd00f88 ip="

Here we're using static IP configuration, use "ip=dhcp" for DHCP (here the full story). Also, the GRUB_CMDLINE_LINUX_DEFAULT variable is normally set to "quiet", but it's probably better to remove that to be able to see what happens at boot. It can be readded back later if wanted.

Also note that if the SAN needs authentication more variables are needed, most likely ISCSI_USERNAME and ISCSI_PASSWORD.

Looking into /usr/share/initramfs-tools/hooks/iscsi, we learn that for the initrd update process to know that we want the iSCSI stuff included, we need to create the file /etc/iscsi/iscsi.initramfs:

client# touch /etc/iscsi/iscsi.initramfs

We also see that the file /etc/iscsi/initiatorname.iscsi gets copied into the inird and sourced to learn the initiator name, so let's write it inside it in the expected format:

client# echo "" > /etc/iscsi/initiatorname.iscsi

Now to apply all our changes, we regenerate grub config and the initrd:

client# update-grub
Generating grub.cfg ...
Found linux image: /boot/vmlinuz-3.2.0-4-amd64
Found initrd image: /boot/initrd.img-3.2.0-4-amd64
client# update-initramfs -u
update-initramfs: Generating /boot/initrd.img-3.2.0-4-amd64

We also need to set a root password, otherwise we won't be able to login:

client# passwd
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully

Lastly, as we said we don't want that Debian initscripts try to configure eth0 at boot. This is achieved in a simple way by either removing any reference to eth0 in /etc/network/interfaces, or just telling Debian that the configuration is "manual":

auto eth0
    iface eth0 inet manual
# other interfaces here ...

We can finally exit the chroot environment and log out of the iSCSI LUN in the helper machine:

client# exit
helper# umount /mnt/chroot/{dev,proc,sys,}
helper# iscsiadm -m node -T '' -p -u


Let's summarize what happens when our client is booted:

  • iPXE configures the network (either via DHCP or statically)
  • iPXE logs into the iSCSI LUN, mapping it as a local disk.
  • The MBR is read, and the boot process is kickstarted, which loads the kernel and the initrd.
  • Early IP configuration is performed during the boot, and an initrd script logs into the iSCSI LUN as specified on the kernel command line (the kernel is unaware of the PXE login)
  • pivot_root() is called on the iSCSI partition specified on the command line with root=, and from there the boot process proceeds normally

So we need to configure the first three steps. Using iPXE, all that we have to do is sending this iPXE script to the client:

set initiator-iqn

This is the bare minimum; if your SAN needs authentication, then username and password should also be set before attempting to boot (see the iPXE docs, and SAN URIs explained).

Test it!

So if we boot our client, we should see that iPXE logs into the LUN and loads GRUB:

Registered as SAN device 0x80
Booting from SAN device 0x80
GRUB loading.
Welcome to GRUB!

and after GRUB has booted the kernel, something like this in the kernel messages:

[    2.073406] scsi2 : iSCSI Initiator over TCP/IP
[    2.335112] scsi 2:0:0:0: Direct-Access     EQLOGIC  100E-00          4.3  PQ: 0 ANSI: 5
[    2.337709] scsi 2:0:0:0: Attached scsi generic sg1 type 0
[    2.349859] sd 2:0:0:0: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
[    2.351322] sd 2:0:0:0: [sda] Write Protect is off
[    2.352271] sd 2:0:0:0: [sda] Mode Sense: 77 00 00 08
[    2.353451] sd 2:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[    2.368450]  sda: sda1 sda2 sda3
[    2.370812] sd 2:0:0:0: [sda] Attached SCSI disk
[    3.396538] EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null)
[    4.810052] Adding 524284k swap on /dev/sda2.  Priority:-1 extents:1 across:524284k 
[    4.824409] EXT4-fs (sda3): re-mounted. Opts: (null)
[    4.959888] EXT4-fs (sda3): re-mounted. Opts: errors=remount-ro

At this point, we can use this machine and do all the normal administrative operations (add/remove packages, upgrades, kernel configuration, etc.) in the usual way, as if it had a local hard disk.

PXE server with dnsmasq, apache and iPXE

Here we're going to set up a PXE server that would boot even cards with bad or buggy PXE firmware, without having to flash them.

First, some words about PXE.


PXE, acronym of Preboot eXecution Environment, is a specification originally developed by Intel that allows a computer to boot over the network. This has obvious applications in the case of diskless boxes (eg thin clients), but it can also be useful for normal machines, for example to temporarily boot a rescue disk, or (re)install the OS over the network without needing any physical medium.

Simplifying a bit, it goes like this:

  • A machine is turned on. In the BIOS, the boot order says to try PXE first (or a key can be pressed during the POST to the same effect, normally).
  • Its network card (NIC) has a chip with a special firmware, which implements a minimal stack of TCP/IP protocols (DHCP, TFTP, possibly DNS).
  • This firmware is loaded and performs a DHCP broadcast to get an IP address and other pieces of information.
  • If a suitable DHCP server sees the request, it selects an IP address and assigns it to the client.
  • Up to here, it's not different from normal DHCP. However, the server also sends two special pieces of information to the client in the DHCP offer: one is the name or IP address of a server (the so-called "next server", which may be the same DHCP server or not), the other one is the name of a file to download from there (so-called "boot filename" in DHCP speak, or "network boot program" (NBP) in PXE speak).
  • The PXE client configures its TCP/IP stack with the received information, then tries to download the boot filename from the next server via TFTP.
  • If it succeeds, it loads the NBP in memory and runs it. From now on, the NBP takes over and does whatever it takes to fully boot the machine.

Sounds simple, but as usual life isn't as simple as it seems. There are a few things to be noted.

First, while originally the NBP was downloaded via TFTP (and many times still is), some enhanced PXE implementations (like gPXE or iPXE) can use HTTP. They also support extra protocols like iSCSI o A0E (to boot from SANs).

Second, PXE isn't just a sequence of steps to bootstrap a machine; it also specifies an API. This means that the NBP runs in a special environment and can make use of many functionalities made available by the PXE that loaded it. In particular, if the calling PXE supports HTTP networking, this means that the NBP can too, via the PXE API, even if it wouldn't otherwise support it natively.

Let's take the case of pxelinux, probably the most used NBP for its flexibility. Only recent versions support HTTP natively; however, older versions (starting from 3.70, which is quite old) can use the PXE API and do HTTP if they are invoked from an HTTP-capable PXE implementation like gPXE or iPXE mentioned above. Since ideally we want our PXE server to serve stuff over HTTP as much as possible rather than TFTP, all this is quite good.

However, these enhanced PXE implementations are normally not found in consumer-end NICs, which instead tend to come with limited or buggy PXE implementations. There are a few workarounds for this:

  • Load the enhanced PXE firmware from a floppy, CDROM, or USB stick. So in the BIOS, the machine is configured to boot from the appropriate removable media, which loads the PXE firmware, which in turn boots from the network. In general, this is not very practical (and the media can be lost or damaged, or the reader can break. Many machine don't even have a floppy or CD reader anymore).
  • The NIC ROM can be flashed with the enhanced firmware. This is better, but it still requires some special action. For hundreds of machines, again this is not very practical.
  • The enhanced PXE firmware can be downloaded (chainloaded) by the buggy PXE as if it were an NBP (via TFTP), then take over and do the "real" PXE boot, downloading the "real" NBP which will then be able to use the API in the enhanced environment (with HTTP and all).

The last option is the easiest and most convenient to implement, since it does not require to mess around with sneakernet or ROM flashing, and is what is described here.

The plan

So we are going to use dnsmasq as our DHCP and TFTP server, apache to serve HTTP (for no particular reason, just because it's easy to set up with PHP), and iPXE for the enhanced PXE firmware. All running on the same machine for convenience, but there's no reason why the web server could not run on another box.

Since the DHCP server will possibly see (at least) two different DHCP queries (first one from the buggy PXE firmware, then one from iPXE), and has to send different NBP strings to them, a way is needed to tell which query we are seeing.

This is quite straightforward: if we capture the traffic with tcpdump, we see that the requests coming from iPXE have at least two identifying characteristics that are not present in requests not coming from iPXE. The first is DHCP option number 175, which is used for iPXE/gPXE-specific information. The second is the iPXE user class, which again is not normally present.

15:14:41.719114 IP (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 415) > BOOTP/DHCP, Request from 00:12:34:56:78:90, length 387, xid 0x71ceb4, secs 4, Flags [none]
	  Client-Ethernet-Address 00:12:34:56:78:90
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Discover
	    MSZ Option 57, length 2: 1472
	    ARCH Option 93, length 2: 0
	    NDI Option 94, length 3: 1.2.1
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
	    CLASS Option 77, length 4: "iPXE"
	    Parameter-Request Option 55, length 13: 
	      Subnet-Mask, Default-Gateway, Domain-Name-Server, LOG
	      Hostname, Domain-Name, RP, Vendor-Option
	      Vendor-Class, TFTP, BF, Option 175
	      Option 203
	    T175 Option 175, length 45:
	    Client-ID Option 61, length 7: ether 00:12:34:56:78:90
	    GUID Option 97, length 17:

It's also easy to see the same information in the DHCP server log.

In dnsmasq, we set a tag if we detect that the request comes from iPXE, and do different things depending on whether or not the tag is set. If the request is from a non-enhanced PXE client, we send them the iPXE firmware; otherwise, it's iPXE so we direct it to an HTTP URL to continue the boot process (see below).

To have maximum flexibility, we want to be able to tell which client we're talking to, and possibly give different orders to different clients. ("Orders" here means "iPXE scripts", which are textual sequences of iPXE directives that tell the clients to do certain things.)

To this end, we direct iPXE to do an HTTP GET request containing various parameteres that identify the client. On the server this runs a PHP script that decides what to do based on the received values. We thus send back an iPXE script containing further instructions to the client (eg "chainload pxelinux.0", "boot from iscsi", etc. See below for the examples).

This allows us to do things like (for example) "Client X: go get pxelinux from the local HTTP server to boot a rescue environment. Client Y: boot from iSCSI, here is the LUN URL. Client Z: boot pxelinux from another HTTP server to do an unattended Debian install..."


Now that we have defined the plan, let's finally get to the practical bits. It is assumed that the PXE server ( has IP address, the network's default gateway is, and the DNS server is It is also assumed that no other DHCP servers are present in the network.

dnsmasq configuration

The configuration of dnsmasq is short (of course adapt as needed):


# enable logging

# set tag "ENH" if request comes from iPXE ("iPXE" user class)

# alternative way, look for option 175

# if request comes from dumb firmware, send them iPXE (via TFTP)

# if request comes from iPXE, direct it to boot from boot1.txt



So we set the tag ENH (set:ENH) if the request comes from iPXE. The tag:!ENH syntax means "if the ENH tag is NOT set". Note that this syntax requires a reasonably recent version of dnsmasq; in older versions, "net:" had to be used instead of "tag:", and "#ENH" instead of "!ENH" (ie, "net:#ENH") to say "tag ENH not set".

The file undionly.kpxe (or a symlink to it) has to be in /var/www, and is the iPXE implementation used for chainloading, which is sent to the dumb clients via TFTP. This is the only TFTP transaction in the whole process. Once the client has loaded iPXE, everything happens over HTTP.

As a special case (in a positive sense), when PXE-booting a KVM virtual machine the very first request that the server sees already comes from iPXE, since that's what qemu uses to implement the VM's PXE "firmware". This means that in that case the process will be faster, since the chainloading phase will be skipped and the client sent directly to the HTTP URL.

Regardless of whether the client is originally dumb or not, it will eventually end up fetching boot1.txt (see below) via HTTP.

The last configuration lines enable dnsmasq's internal TFTP server, telling it to serve files (not coincidentally) from /var/www. And so...

Apache configuration

Any web server with PHP support would work, in fact; it's just that with apache, a running PHP is just two commands away with zero configuration.
And of course, it doesn't even have to be PHP: any server-side scripting language will do.

So our client (which is running iPXE, and can do HTTP) fetches boot1.txt, which lives in /var/www. Here's how it looks like



This is an iPXE script that chainloads another URL. Basically, it's just a cheap trick to send as much information as possible about the client to the server via a gigantic HTTP GET, so the client can be identified for further processing (though 99% of the times only the MAC address will be looked at, it's good to have as many variables as possible). iPXE replaces the various ${mac}, ${ip} etc. variables with the actual values for the client and also does URL-encoding. The full list of available parameters is here in the docs.

The above URL could also be supplied directly from dnsmasq, by replacing the URL in the dhcp-boot=tag:ENH, line with the one in boot1.txt. However it looks like that way the URL gets truncated if it's too long, so it's better to be safe and put it in its own file.

Now, finally, let's look at how boot2.php (which must also be in /var/www) looks like. Here is where we actually decide what to do with each client.

# send a suitable iPXE script to a client

echo "#!ipxe\n";
switch ($_GET['mac']) {
  case '00:12:34:56:78:90':
    # boot pxelinux from this server
    echo "chain\n";
  case '00:11:22:33:44:55':
    # boot from iSCSI
    echo "set initiator-iqn\n";
    # see for the syntax
    echo "sanboot\n";
  case '00:77:21:ab:cd:ee':
    # boot's super cool boot menu      
    echo "chain\n";
    # exit iPXE and let machine go on with BIOS boot sequence
    echo "exit\n";

In short, each client will receive an iPXE scrit telling it what to do. Here clients are detected by their MACs, but any variable among those that we pass can be used, of course.
If a client has no specific treatment set up for it, it will end up in the "default" branch of the switch statement, which will just direct it to exit iPXE and try the next device in the BIOS boot sequence, which would normally mean it will boot from its local hard disk (again this can be changed, of course). Another option is to chainload another bootloader that is able to boot a local disk, for example GRUB4DOS as explained in this page.

It's even possible to fetch and boot stuff off the Internet, as in the iPXE demo image, which can be loaded by directing the client to chain It really works. But the coolest service is, as shown for the third client in the above example,, which allows booting and installing a lot of operating systems off the Internet. It's really impressive. Well done!


If we direct a client to load pxelinux, then there is another degree of flexibility there, since pxelinux will try to load several configuration files, named from the most specific to the most generic, until it succeeds. Normally the sequence of attempts looks something like this:

GET /pxelinux.cfg/44454c4c-3900-104e-804e-b9c04f4d344a
GET /pxelinux.cfg/01-00-26-b9-5e-30-3a
GET /pxelinux.cfg/C0A80744
GET /pxelinux.cfg/C0A8074
GET /pxelinux.cfg/C0A807
GET /pxelinux.cfg/C0A80
GET /pxelinux.cfg/C0A8
GET /pxelinux.cfg/C0A
GET /pxelinux.cfg/C0
GET /pxelinux.cfg/C
GET /pxelinux.cfg/default

So again what a given client does can be decided by assigning it a pxelinux configuration file with a name more speficic than "default", which is what gets loaded if nothing better is found.

And of course, pxelinux.0 plus any other file needed by the configuration files (eg menu.c32 etc.) need to be present in the document root of the web server (or symlinks to them).

Since pxelinux is running with HTTP support thanks to iPXE, HTTP URLs can be used anywhere a file name would, eg

# ok, this doesn't make much sense

and even if you don't explicitly specify, it implicitly assumes that it has to use HTTP anyway (in that case, it automatically prepends the URL it's booting from to the names).


With this system it really becomes possible to do whatever one may imagine via PXE, and everything is controlled and managed from a single place.

Further reading (on the interactions between pxelinux and gPXE, but also relevant for iPXE):

Clarifying the relationship between PXELinux, Etherboot and gPXE/iPXE