PXE server with dnsmasq, apache and iPXE

Posted by waldner on 24 November 2013, 4:42 pm

Here we're going to set up a PXE server that would boot even cards with bad or buggy PXE firmware, without having to flash them.

First, some words about PXE.

PXE

PXE, acronym of Preboot eXecution Environment, is a specification originally developed by Intel that allows a computer to boot over the network. This has obvious applications in the case of diskless boxes (eg thin clients), but it can also be useful for normal machines, for example to temporarily boot a rescue disk, or (re)install the OS over the network without needing any physical medium.

Simplifying a bit, it goes like this:

A machine is turned on. In the BIOS, the boot order says to try PXE first (or a key can be pressed during the POST to the same effect, normally).
Its network card (NIC) has a chip with a special firmware, which implements a minimal stack of TCP/IP protocols (DHCP, TFTP, possibly DNS).
This firmware is loaded and performs a DHCP broadcast to get an IP address and other pieces of information.
If a suitable DHCP server sees the request, it selects an IP address and assigns it to the client.
Up to here, it's not different from normal DHCP. However, the server also sends two special pieces of information to the client in the DHCP offer: one is the name or IP address of a server (the so-called "next server", which may be the same DHCP server or not), the other one is the name of a file to download from there (so-called "boot filename" in DHCP speak, or "network boot program" (NBP) in PXE speak).
The PXE client configures its TCP/IP stack with the received information, then tries to download the boot filename from the next server via TFTP.
If it succeeds, it loads the NBP in memory and runs it. From now on, the NBP takes over and does whatever it takes to fully boot the machine.

Sounds simple, but as usual life isn't as simple as it seems. There are a few things to be noted.

First, while originally the NBP was downloaded via TFTP (and many times still is), some enhanced PXE implementations (like gPXE or iPXE) can use HTTP. They also support extra protocols like iSCSI o A0E (to boot from SANs).

Second, PXE isn't just a sequence of steps to bootstrap a machine; it also specifies an API. This means that the NBP runs in a special environment and can make use of many functionalities made available by the PXE that loaded it. In particular, if the calling PXE supports HTTP networking, this means that the NBP can too, via the PXE API, even if it wouldn't otherwise support it natively.

Let's take the case of pxelinux, probably the most used NBP for its flexibility. Only recent versions support HTTP natively; however, older versions (starting from 3.70, which is quite old) can use the PXE API and do HTTP if they are invoked from an HTTP-capable PXE implementation like gPXE or iPXE mentioned above. Since ideally we want our PXE server to serve stuff over HTTP as much as possible rather than TFTP, all this is quite good.

However, these enhanced PXE implementations are normally not found in consumer-end NICs, which instead tend to come with limited or buggy PXE implementations. There are a few workarounds for this:

Load the enhanced PXE firmware from a floppy, CDROM, or USB stick. So in the BIOS, the machine is configured to boot from the appropriate removable media, which loads the PXE firmware, which in turn boots from the network. In general, this is not very practical (and the media can be lost or damaged, or the reader can break. Many machine don't even have a floppy or CD reader anymore).
The NIC ROM can be flashed with the enhanced firmware. This is better, but it still requires some special action. For hundreds of machines, again this is not very practical.
The enhanced PXE firmware can be downloaded (chainloaded) by the buggy PXE as if it were an NBP (via TFTP), then take over and do the "real" PXE boot, downloading the "real" NBP which will then be able to use the API in the enhanced environment (with HTTP and all).

The last option is the easiest and most convenient to implement, since it does not require to mess around with sneakernet or ROM flashing, and is what is described here.

The plan

So we are going to use dnsmasq as our DHCP and TFTP server, apache to serve HTTP (for no particular reason, just because it's easy to set up with PHP), and iPXE for the enhanced PXE firmware. All running on the same machine for convenience, but there's no reason why the web server could not run on another box.

Since the DHCP server will possibly see (at least) two different DHCP queries (first one from the buggy PXE firmware, then one from iPXE), and has to send different NBP strings to them, a way is needed to tell which query we are seeing.

This is quite straightforward: if we capture the traffic with tcpdump, we see that the requests coming from iPXE have at least two identifying characteristics that are not present in requests not coming from iPXE. The first is DHCP option number 175, which is used for iPXE/gPXE-specific information. The second is the iPXE user class, which again is not normally present.

15:14:41.719114 IP (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 415)
    0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:12:34:56:78:90, length 387, xid 0x71ceb4, secs 4, Flags [none]
	  Client-Ethernet-Address 00:12:34:56:78:90
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Discover
	    MSZ Option 57, length 2: 1472
	    ARCH Option 93, length 2: 0
	    NDI Option 94, length 3: 1.2.1
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
	    CLASS Option 77, length 4: "iPXE"
	    Parameter-Request Option 55, length 13: 
	      Subnet-Mask, Default-Gateway, Domain-Name-Server, LOG
	      Hostname, Domain-Name, RP, Vendor-Option
	      Vendor-Class, TFTP, BF, Option 175
	      Option 203
	    T175 Option 175, length 45: 177.5.1.26.244.16.0.24.1.1.35.1.1.34.1.1.25.1.1.33.1.1.16.1.2.19.1.1.17.1.1.235.3.1.0.0.23.1.1.21.1.1.18.1.1
	    Client-ID Option 61, length 7: ether 00:12:34:56:78:90
	    GUID Option 97, length 17: 0.99.10.237.238.79.65.104.58.28.29.107.2.246.140.217.96

It's also easy to see the same information in the DHCP server log.

In dnsmasq, we set a tag if we detect that the request comes from iPXE, and do different things depending on whether or not the tag is set. If the request is from a non-enhanced PXE client, we send them the iPXE firmware; otherwise, it's iPXE so we direct it to an HTTP URL to continue the boot process (see below).

To have maximum flexibility, we want to be able to tell which client we're talking to, and possibly give different orders to different clients. ("Orders" here means "iPXE scripts", which are textual sequences of iPXE directives that tell the clients to do certain things.)

To this end, we direct iPXE to do an HTTP GET request containing various parameteres that identify the client. On the server this runs a PHP script that decides what to do based on the received values. We thus send back an iPXE script containing further instructions to the client (eg "chainload pxelinux.0", "boot from iscsi", etc. See below for the examples).

This allows us to do things like (for example) "Client X: go get pxelinux from the local HTTP server to boot a rescue environment. Client Y: boot from iSCSI, here is the LUN URL. Client Z: boot pxelinux from another HTTP server to do an unattended Debian install..."

Configuration

Now that we have defined the plan, let's finally get to the practical bits. It is assumed that the PXE server (pxe.example.com) has IP address 10.188.0.10/24, the network's default gateway is 10.188.0.1, and the DNS server is 10.188.0.20. It is also assumed that no other DHCP servers are present in the network.

dnsmasq configuration

The configuration of dnsmasq is short (of course adapt as needed):

interface=eth0
domain=example.com
dhcp-range=10.188.0.60,10.188.0.70,12h
dhcp-option=option:router,10.188.0.1
dhcp-option=option:dns-server,10.188.0.20
dhcp-authoritative

# enable logging
log-queries
log-dhcp

# set tag "ENH" if request comes from iPXE ("iPXE" user class)
dhcp-userclass=set:ENH,iPXE

# alternative way, look for option 175
#dhcp-match=set:ENH,175

# if request comes from dumb firmware, send them iPXE (via TFTP)
dhcp-boot=tag:!ENH,undionly.kpxe,10.188.0.10

# if request comes from iPXE, direct it to boot from boot1.txt
dhcp-boot=tag:ENH,http://pxe.example.com/boot1.txt

dhcp-no-override

enable-tftp
tftp-root=/var/www

So we set the tag ENH (set:ENH) if the request comes from iPXE. The tag:!ENH syntax means "if the ENH tag is NOT set". Note that this syntax requires a reasonably recent version of dnsmasq; in older versions, "net:" had to be used instead of "tag:", and "#ENH" instead of "!ENH" (ie, "net:#ENH") to say "tag ENH not set".

The file undionly.kpxe (or a symlink to it) has to be in /var/www, and is the iPXE implementation used for chainloading, which is sent to the dumb clients via TFTP. This is the only TFTP transaction in the whole process. Once the client has loaded iPXE, everything happens over HTTP.

As a special case (in a positive sense), when PXE-booting a KVM virtual machine the very first request that the server sees already comes from iPXE, since that's what qemu uses to implement the VM's PXE "firmware". This means that in that case the process will be faster, since the chainloading phase will be skipped and the client sent directly to the HTTP URL.

Regardless of whether the client is originally dumb or not, it will eventually end up fetching boot1.txt (see below) via HTTP.

The last configuration lines enable dnsmasq's internal TFTP server, telling it to serve files (not coincidentally) from /var/www. And so...

Apache configuration

Any web server with PHP support would work, in fact; it's just that with apache, a running PHP is just two commands away with zero configuration.
And of course, it doesn't even have to be PHP: any server-side scripting language will do.

So our client (which is running iPXE, and can do HTTP) fetches boot1.txt, which lives in /var/www. Here's how it looks like

#!ipxe

chain http://pxe.example.com/boot2.php?mac=${mac}&ip=${ip}&asset=${asset}&netmask=${netmask}&gateway=${gateway}&dns=${dns}&domain=${domain}&filename=${filename}&nextserver=${next-server}&hostname=${hostname}&uuid=${uuid}&userclass=${user-class}&manufacturer=${manufacturer}&product=${product}&serial=${serial}&asset=${asset}

This is an iPXE script that chainloads another URL. Basically, it's just a cheap trick to send as much information as possible about the client to the server via a gigantic HTTP GET, so the client can be identified for further processing (though 99% of the times only the MAC address will be looked at, it's good to have as many variables as possible). iPXE replaces the various ${mac}, ${ip} etc. variables with the actual values for the client and also does URL-encoding. The full list of available parameters is here in the docs.

The above URL could also be supplied directly from dnsmasq, by replacing the URL in the dhcp-boot=tag:ENH,http://pxe.example.com/boot1.txt line with the one in boot1.txt. However it looks like that way the URL gets truncated if it's too long, so it's better to be safe and put it in its own file.

Now, finally, let's look at how boot2.php (which must also be in /var/www) looks like. Here is where we actually decide what to do with each client.

<?php
 
# send a suitable iPXE script to a client

echo "#!ipxe\n";
 
switch ($_GET['mac']) {
 
  case '00:12:34:56:78:90':
    # boot pxelinux from this server
    echo "chain http://pxe.example.com/pxelinux.0\n";
    break;
 
  case '00:11:22:33:44:55':
    # boot from iSCSI
    echo "set initiator-iqn iqn.2007-08.com.example.initiator:initiator\n";
    # see http://ipxe.org/sanuri for the syntax
    echo "sanboot iscsi:san.example.com:6:3260:0:iqn.2007-08.com.example.san:sometarget\n";
    break;
 
  case '00:77:21:ab:cd:ee':
    # boot boot.salstar.sk's super cool boot menu      
    echo "chain http://boot.salstar.sk\n";
    break;
 
  default:
    # exit iPXE and let machine go on with BIOS boot sequence
    echo "exit\n";
    break;
}
 
 
?>

In short, each client will receive an iPXE scrit telling it what to do. Here clients are detected by their MACs, but any variable among those that we pass can be used, of course.
If a client has no specific treatment set up for it, it will end up in the "default" branch of the switch statement, which will just direct it to exit iPXE and try the next device in the BIOS boot sequence, which would normally mean it will boot from its local hard disk (again this can be changed, of course). Another option is to chainload another bootloader that is able to boot a local disk, for example GRUB4DOS as explained in this page.

Another thing that can be done here, in case the client is told to chainload pxelinux, and pxelinux resides on the same server, is generating some pieces of pxelinux config dynamically, write them to some file which will then be included by the main pxelinux configuration (since syslinux/pxelinux, to the best of my knowledge, do not allow variables in the configuration).
Typical examples are kernel parameters for the client (ie, those that are passed using APPEND in pxelinux), for example console port and speed definition or module parameters related to the actual client hardware, or syslinux/pxelinux menu customizations.

It's even possible to fetch and boot stuff off the Internet, as in the iPXE demo image, which can be loaded by directing the client to chain http://boot.ipxe.org/demo/boot.php. It really works. But the coolest service is, as shown for the third client in the above example, http://boot.salstar.sk, which allows booting and installing a lot of operating systems off the Internet. It's really impressive. Well done!

pxelinux

If we direct a client to load pxelinux, then there is another degree of flexibility there, since pxelinux will try to load several configuration files, named from the most specific to the most generic, until it succeeds. Normally the sequence of attempts looks something like this:

GET /pxelinux.cfg/44454c4c-3900-104e-804e-b9c04f4d344a
GET /pxelinux.cfg/01-00-26-b9-5e-30-3a
GET /pxelinux.cfg/C0A80744
GET /pxelinux.cfg/C0A8074
GET /pxelinux.cfg/C0A807
GET /pxelinux.cfg/C0A80
GET /pxelinux.cfg/C0A8
GET /pxelinux.cfg/C0A
GET /pxelinux.cfg/C0
GET /pxelinux.cfg/C
GET /pxelinux.cfg/default

So again what a given client does can be decided by assigning it a pxelinux configuration file with a name more speficic than "default", which is what gets loaded if nothing better is found.

And of course, pxelinux.0 plus any other file needed by the configuration files (eg menu.c32 etc.) need to be present in the document root of the web server (or symlinks to them).

Since pxelinux is running with HTTP support thanks to iPXE, HTTP URLs can be used anywhere a file name would, eg

# ok, this doesn't make much sense
LINUX http://server1.example.com/vmlinuz
INITRD http://server2.example.com/initram.gz

and even if you don't explicitly specify http://server.name, it implicitly assumes that it has to use HTTP anyway (in that case, it automatically prepends the URL it's booting from to the names).

Conclusions

With this system it really becomes possible to do whatever one may imagine via PXE, and everything is controlled and managed from a single place.

Further reading (on the interactions between pxelinux and gPXE, but also relevant for iPXE):

Clarifying the relationship between PXELinux, Etherboot and gPXE/iPXE

Filed under linux, worksforme Tagged apache, dnsmasq, iPXE, network boot, php, pxelinux

Comments are closed | Permalink

\1