Policy routing, multihoming and all that jazz

Posted by waldner on 7 October 2012, 3:25 pm

This is to remind me of how to do the most common tasks involved with multihoming.

Scenario: dual- (or multi-, for that matter) homed Linux box acting as router for one or more local networks. Here we'll assume two local networks, and two upstream ISPs (no dynamic routing - only static defaults). The two ISPs are on the router's eth2 and eth3 respectively, while the internal networks are on eth0 (dev network) and eth1 (R&D network). It's straightforward to extend the sample code shown here to manage more ISPs or more internal LANs. Here's a picture:

Sample scenario

The Linux router also does NAT for internally-initiated connections. What we want to achieve is:

Load balancing of the traffic among the ISPs (within reason)
ISP failover: if one ISP or the link to it fails, stop using it

NAT

Let's take care of the easy things first: since our Linux router will perform NAT on outgoing packets, let's write those rules first:

# SNAT packets going out eth2 to ISP1
iptables -t nat -A POSTROUTING -o eth2 -j SNAT --to-source 1.1.1.1
 
# SNAT packets going out eth3 to ISP2
iptables -t nat -A POSTROUTING -o eth3 -j SNAT --to-source 2.2.2.1

Load balancing

Fortunately, iptables can classify all kinds of traffic with its stateful connection tracking mechanism, even if it's not strictly flow-oriented (think UDP or ICMP). What we want in our setup is that all the packets belonging to a "flow" (also called a connection in iptables speak) use the same ISP (ie, same interface on the router), because otherwise packets will have their source addresses rewritten to different IPs which will most likely confuse the targets and any stateful device along the path.

From iptables' point of view, all packets have a state. For our purposes. the most interesting state is NEW, which means that it doesn't belong to any existing connection (and thus, a new connection will be created in the conntrack table), and ESTABLISHED/RELATED, which identify packets belonging (or related) to an existing connection.

To achieve our goal, we will use the marking mechanism that iptables provides. In iptables, we can mark a single packet or the entire connection to which a packet belongs: it is possible to mark a packet, then assign (ie, save) the packet mark to the connection, or the other way round (assign the connection mark to the packet). Let's make clear that these "markings" do not modify the packets; they only live in the router's memory, in the form of metadata in the connection tracking table. With the userspace conntrack utility, the connection table can be printed, which will show the marks.
Once a packet is marked (and all packets belonging to the same connection will have the same mark), we will route using one or the other ISP based on the packet mark.

The last piece of the puzzle is how to decide which ISP to use when we see a brand new packet (a packet that is creating a new flow, and for which no previous flow exists). To load balance among N ISPs, we should ideally try to send 1/N of all the new flows to one ISP, 1/N to another, etc. In the 2-ISP case, this means dividing the traffic 50% to ISP1 and 50% to ISP2. With iptables, there are two common ways to achieve that, and both use the statistics module. The statistics module can operate in two modes: the so-called nth (matches if the packet is, er, the nth when counting in a round-robin fashion), or it can operate in random mode (that is, with a probability X of matching a packet).

So here's what we'll do:

if the packet is NEW, choose an ISP and mark the packet accordingly. For this, we need as many different mark values as we have ISPs (two in this example). Once the packet is marked, mark the new connection with the same mark.
if the packet belongs to an existing connection (state ESTABLISHED/RELATED), mark the packet with the same mark that the connection has (it must have been marked before, when it was NEW).
based on the packet mark (regardless of how we obtained it), decide which ISP to use.

Let's break it down and implement it with actual iptables and iproute2 rules.

iptables

Here are the iptables command to run on the router:

# chain which marks a packet (MARK) and its connection (CONNMARK) with mark 1 (for ISP1)
iptables -t mangle -N MARK-ISP1
iptables -t mangle -A MARK-ISP1 -j MARK --set-mark 1
iptables -t mangle -A MARK-ISP1 -j CONNMARK --save-mark
 
# chain which marks a packet (MARK) and its connection (CONNMARK) with mark 2
iptables -t mangle -N MARK-ISP2
iptables -t mangle -A MARK-ISP2 -j MARK --set-mark 2
iptables -t mangle -A MARK-ISP2 -j CONNMARK --save-mark
 
# real work begins here
 
# do not touch inter-LAN traffic
iptables -t mangle -A PREROUTING -i eth0 -s 192.168.1.0/24 -d 192.168.2.0/24 -j ACCEPT
iptables -t mangle -A PREROUTING -i eth1 -s 192.168.2.0/24 -d 192.168.1.0/24 -j ACCEPT
 
# If the packet is not NEW, there must be a connection for it, so get the connection
# mark and apply it to the packet
 
# packets from dev network
iptables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
 
# packets from R&D network
iptables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
 
# on the other hand, it the state is NEW, we have to decide where to send it
# Use the statistics match in nth mode
 
# dev network
iptables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 0 -j MARK-ISP1
iptables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 1 -j MARK-ISP2
 
# same for R&D network
iptables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 0 -j MARK-ISP1
iptables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 1 -j MARK-ISP2

Routing

Now that packets are marked (either because they're new or because they belong to an existing flow), they can be routed based on their mark. To do this, we use iproute2 rules. In Linux it is possible to have multiple routing tables, and that's what we're going to do here. Each routing table has a number associated with it, but it's easier to use names. Thus, we can edit the file /etc/iproute2/rt_tables and add two new values for two new routing tables:

#
# reserved values
#
255     local
254     main
253     default
0       unspec

# add these two
250     ISP1
249     ISP2

Next we initialize the two new routing tables. Each table only needs a default route pointing to the upstream ISP's interface, however adding local routes can't hurt:

ip route flush table ISP1
ip route add table ISP1 default dev eth2 via 1.1.1.2
# add local routes too
ip route add table ISP1 1.1.1.0/24 dev eth2 src 1.1.1.1                           
ip route add table ISP1 2.2.2.0/24 dev eth3 src 2.2.2.1                           
ip route add table ISP1 192.168.1.0/24 dev eth0 src 192.168.1.254                           
ip route add table ISP1 192.168.2.0/24 dev eth1 src 192.168.2.254     
 
ip route flush table ISP2
ip route add table ISP2 default dev eth3 via 2.2.2.2
ip route add table ISP2 1.1.1.0/24 dev eth2 src 1.1.1.1                           
ip route add table ISP2 2.2.2.0/24 dev eth3 src 2.2.2.1                           
ip route add table ISP2 192.168.1.0/24 dev eth0 src 192.168.1.254                           
ip route add table ISP2 192.168.2.0/24 dev eth1 src 192.168.2.254

And the rules to use them:

ip rule del from all fwmark 2 2>/dev/null
ip rule del from all fwmark 1 2>/dev/null
ip rule add fwmark 1 table ISP1
ip rule add fwmark 2 table ISP2
ip route flush cache

The above commands (including the iptables command shown earlier) can be added to some boot-time script (see below for an example which is also smarter), so the tables are initialized automatically at system boot.

To confirm that the rules are present, we can display them:

# ip rule show
0:      from all lookup local
32764:  from all fwmark 0x2 lookup ISP2
32765:  from all fwmark 0x1 lookup ISP1
32766:  from all lookup main
32767:  from all lookup default

Finally, make sure that the rp_filter option is disabled on the router, otherwise it could drop packets:

# for i in /proc/sys/net/ipv4/conf/*/rp_filter; do echo 0 > "$i"; done

That's it, the router should be working and balancing traffic now. To confirm it, we can use tcpdump and verify that traffic is equally distributed among the links (or at least, that we see "some" traffic on each link).

Local traffic

All the above works fine for traffic that originates from the LANs and traverses the router. What about traffic originated by the router itself? This case is a bit more complicated since routing also has to consider the source IP address that the router's kernel chooses to put in outgoing packets, and that cannot always be controlled.
Suggestions on how to manage local traffic efficiently (ie, using the two ISPs) are welcome. For the time being, since local traffic is usually not relevant, what can be done is to just add a default route in the router's main routing table pointing directly to one of the available ISPs. This will not be shown in this example.

IPv6

It all works with IPv6 too. The only difference is that with IPv6 there is no need to do NAT (although it won't be long before that will be possible, and one way or another every vendor or manufacturer seems to be adding support to NAT66 to their products). Here's some skeleton code to implement the same logic with IPv6:

# mark for ISP1
ip6tables -t mangle -N MARK-ISP1
ip6tables -t mangle -A MARK-ISP1 -j MARK --set-mark 1
ip6tables -t mangle -A MARK-ISP1 -j CONNMARK --save-mark
 
# mark for ISP2
ip6tables -t mangle -N MARK-ISP2
ip6tables -t mangle -A MARK-ISP2 -j MARK --set-mark 2
ip6tables -t mangle -A MARK-ISP2 -j CONNMARK --save-mark
 
# accept intra-LAN traffic
ip6tables -t mangle -A PREROUTING -i eth0 -s 2001:db8:0:1681::/64 -d 2001:db8:0:1682::/64 -j ACCEPT
ip6tables -t mangle -A PREROUTING -i eth1 -s 2001:db8:0:1682::/64 -d 2001:db8:0:1681::/64 -j ACCEPT
 
ip6tables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
ip6tables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
 
ip6tables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 0 -j MARK-ISP1
ip6tables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 1 -j MARK-ISP2
 
ip6tables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 0 -j MARK-ISP1
ip6tables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 1 -j MARK-ISP2
 
# routing stuff
ip -6 route flush table ISP1
ip -6 route add table ISP1 default dev eth2 via 2001:db8:0:1::2   # ISP1's interface
ip -6 route add table ISP1 2001:db8:0:1::/64 dev eth2
ip -6 route add table ISP1 2001:db8:0:2::/64 dev eth3
ip -6 route add table ISP1 2001:db8:0:1681::/64 dev eth0
ip -6 route add table ISP1 2001:db8:0:1682::/64 dev eth1
 
ip -6 route flush table ISP2
ip -6 route add table ISP2 default dev eth3 via 2001:db8:0:2::2   # ISP2's interface
ip -6 route add table ISP2 2001:db8:0:1::/64 dev eth2
ip -6 route add table ISP2 2001:db8:0:2::/64 dev eth3
ip -6 route add table ISP2 2001:db8:0:1681::/64 dev eth0
ip -6 route add table ISP2 2001:db8:0:1682::/64 dev eth1
 
ip -6 rule del from all fwmark 1 2>/dev/null
ip -6 rule del from all fwmark 2 2>/dev/null
ip -6 rule add fwmark 1 table ISP1
ip -6 rule add fwmark 2 table ISP2
ip -6 route flush cache

ISP Failover

In our scenario, one of the two upstream links to the two ISP can fail; in that case, we have to stop using it and send all the traffic out the working link.

There is a nice, albeit poorly documented, program called Link Status Monitor that can periodically check any number of connections and run a user-defined script when it detects status changes. Normally the check will be based on pinging a specific IP address, and declare failure if a configurable number of probes fail in a row, or the latency changes beyond a certain value, and so on. Choosing the right IP to ping is very important; if we ping the ISP interface facing us but the ISP has a failure upstream, then we will think the ISP is up but traffic sent to it will effectively be blackholed, so it may be better to ping some address located a few hops upstream. On the other hand, if we choose to do so and the ping fails, it may be due to a failure of the host we're pinging and the ISP as a whole may still be working fine, so it should not be disabled. LSM allows the definition of connection groups, and more complex policies can be created (for example, declare the group - ie, ISP - down only if all the members fail, and so on).

But all this is a matter of taste and policy, so everyone will configure them as they wish. What matters for the purposes of this discussion is that LSM invokes a script every time it detects a status change on one of the links (or ISP, or connection groups, etc.) it's monitoring. The idea is thus that each ISP would get a state file, and the script invoked by LSM would update the state file with the new status every time it detects a change. Then, the main router configuration script would be invoked, which would set up routing to all available providers. Let's assume that we use /var/run/ISP1_state and /var/run/ISP2_state as state files for our upstreams. Then the script invoked by LSM would be something like:

#!/bin/bash
state=${1}
name=${2}
 
echo "$state" > "/var/run/${name}_state"
config_router.sh

Obviously one will also want to do other things like send an email containing detailed information about the event (the script called default_script that comes with LSM can be useful here), but the basic functionality is what is shown above. Now the script config_router.sh will contain the commands shown earlier, but it will also check which ISPs are up and configure iptables and routing rules to only use the available ISPs. For example (IPv4 only, adding IPv6 is trivial):

#!/bin/bash
 
# using associative arrays to store the information
 
declare -a isp
declare -A iface ip localip mark status
 
isp=( ISP1 ISP2 )
 
iface["ISP1"]="eth2"
iface["ISP2"]="eth3"
 
ip["ISP1"]="1.1.1.2"
ip["ISP2"]="2.2.2.2"
 
localip["ISP1"]="1.1.1.1"
localip["ISP2"]="2.2.2.1"
 
mark["ISP1"]=1
mark["ISP2"]=2
 
statedir=/var/tmp
 
upcount=0
for i in "${isp[@]}"; do
 
  # if there's no state file for the ISP, assume it's up
 
  if [ -f "${statedir}/${i}_state" ]; then
    status[$i]=$(< "${statedir}/${i}_state")
  else
    status[$i]="up"
  fi
 
  [ "${status[$i]}" = "up" ] && upcount=$((upcount+1))
done
 
# IPv4
 
# flush everything
iptables -F
iptables -t nat -F
iptables -t mangle -F
 
for i in "${isp[@]}"; do
  iptables -t mangle -X "MARK-${i}" 2>/dev/null
done
 
 
# SNAT for outgoing traffic, use providers that are available
for i in "${isp[@]}"; do
  if [ "${status[$i]}" = "up" ]; then
    iptables -t nat -A POSTROUTING -o "${iface[$i]}" -j SNAT --to-source "${localip[$i]}"
  fi
done
 
# chain to mark traffic for a specific provider
for i in "${isp[@]}"; do
  if [ "${status[$i]}" = "up" ]; then
    iptables -t mangle -N "MARK-${i}"
    iptables -t mangle -A "MARK-${i}" -j MARK --set-mark "${mark[$i]}"
    iptables -t mangle -A "MARK-${i}" -j CONNMARK --save-mark
  fi
done
 
# accept intra-LAN traffic
iptables -t mangle -A PREROUTING -i eth0 -s 192.168.1.0/24 -d 192.168.2.0/24 -j ACCEPT
iptables -t mangle -A PREROUTING -i eth1 -s 192.168.2.0/24 -d 192.168.1.0/24 -j ACCEPT
 
iptables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
iptables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
 
c=0
for i in "${isp[@]}"; do
  if [ "${status[$i]}" = "up" ]; then
    iptables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate NEW -m statistic --mode nth --every ${upcount} --packet "${c}" -j "MARK-$i"
    iptables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate NEW -m statistic --mode nth --every ${upcount} --packet "${c}" -j "MARK-$i"
    c=$((c+1))
  fi
done
 
# routing
 
for i in "${isp[@]}"; do
  ip route flush table "$i"
  if [ "${status[$i]}" = "up" ]; then
    # default is ISP-specific
    ip route add table "$i" default dev "${iface[$i]}" via "${ip[$i]}"
    # local routes
    ip route add table "$i" 1.1.1.0/24 dev eth2 src 1.1.1.1 
    ip route add table "$i" 2.2.2.0/24 dev eth3 src 2.2.2.1 
    ip route add table "$i" 192.168.1.0/24 dev eth0 src 192.168.1.254 
    ip route add table "$i" 192.168.2.0/24 dev eth1 src 192.168.2.254 
 fi
done
 
for i in "${isp[@]}"; do
  ip rule del from all fwmark "${mark[$i]}" 2>/dev/null
  if [ "${status[$i]}" = "up" ]; then
    ip rule add fwmark "${mark[$i]}" table "$i"
  fi
done
 
ip route flush cache

This is just a skeleton, the important point to remember are:

There's no need to delete anything, since the script recreates all the configuration from scratch every time;
The script should be idempotent, that is, it should be possible to run it as many times as we wish and no configuration should be duplicated after every run;
Since it redoes everything from scratch every time, the script must include any other iptables or iproute2 rules (not related to the task described here) that may be needed for whatever reason.

Conclusions

Adding more upstreams

The setup just described works fine not with just two, but any number of upstream ISP, provided that the Linux router is configured accordingly (there needs to be a routing table for each ISP defined in /etc/iproute2/rt_tables). If the weight assigned to each ISP is not the same (ie, some should get more or less traffic than the others), then the algorithm that marks new connections using the nth mode of the statistic match should be adapted correspondingly (for example, with three active ISPs, sending two new connections every three to ISP1, and the remaining one to ISP2, or whatever); it may even be easier to use the random mode in that case, as long as the various probabilities assigned to the active ISPs add to 1. Whatever one chooses, this obviously requires changing the sample code shown above.

Drawbacks

This setup only provides load balancing and redundancy: while this is good, it does not provide more bandwidth. As should be evident, any single flow will always use one given ISP, so the maximum bandwidth achievable will be that offered by that ISP. Also, depending on the exact traffic pattern produced by users, it may happen that during specific periods of times one ISP will be overutilized or underutilized (ie, every second new connection is a huge download, etc.). In general, those conditions should be temporary and on average one should get a fair balance between the two ISP.
There may be websites or services that do not like having connections to the same page or session supposedly made by the same user but coming from different IPs (perhaps some SSL websites, banks, etc); if that is the case, static routing rules need to be put in place for those targets, so that traffic to them is not load-balanced.
The Linux router is obviously a single point of failure. There are ways to use another machine as a hot standby router, which would take over if the first one fails. This is the job of the conntrack tools and the conntrackd daemon, and perhaps will be covered in a future article.

Filed under linux, networking, tips, worksforme Tagged conntrack, iptables, marking, multihome, routing

Comments are closed | Permalink

4 Comments

Ravi Trivedi says:

August 26, 2014 at 08:00

we have done load balancing on two different links in exactly same way. However, we have observed an issue when particularly a SYN packet is lost or dropped in the network ( ISP1 ). In that case the user will retry sending it and our router ( load balance ) would send it to other link ( ISP2 ) according to iptables rule considering SYN as NEW connection state. However, the masquarade target still uses source IP as the IP address of ISP1. So 2nd time packet goes on ISP2 but with the source IP of ISP1. our ISP2 appears to be dropping these packets.

Has anyone observed this issue, can anyone suggest the solution ?
- waldner says:
  
  August 31, 2014 at 00:14
  
  That shouldn't be happening, as masquerading is (or should be) done after the output ISP is chosen.

Alex says:

April 29, 2013 at 16:06

Thanks for your work, it is really helpful. I'm trying to implement a site multi-homing which is pretty much like what you have done. the difference is that I have only one internal network address and considering only ISP fail over scenario (primary/backup) link , and the most important difference with this implementation is that i want to use ( Network prefix translation of IPv6) that is defined in IETF 6296 [http://tools.ietf.org/html/rfc6296] which is supported by linux kernel 3.7.1 and onward , for the translation mechanism other than ipv4 nat . can you please guide me how to modify your current work to my work?

waldner says:

April 29, 2013 at 19:13

Regarding the only internal network, I hope the changes are obvious (just remove everything where eth1 is referenced, as you only have eth0).

Since you don't want load balancing but only failover, this makes things much easier as you don't need all the fancy marking stuff, nor multiple routing tables. Also remove the parts that deal with the iptables' statistic match and just point the default route to the ISP you want to be the primary one.

Now, when you detect failover you just point the default route to the "other" ISP.

Regarding IPv6 NAT (which I personally find unnecessary), you'll have to do pretty much the same thing that is done for IPv4. Sample failover script follows:

#!/bin/bash
 
# using associative arrays to store the information
 
declare -a isp
declare -A iface ip ip6 localip localip6 status
 
isp=( ISP1 ISP2 )
 
iface["ISP1"]="eth2"
iface["ISP2"]="eth3"
 
ip["ISP1"]="1.1.1.2"
ip["ISP2"]="2.2.2.2"
 
ip6["ISP1"]="2001:db8:0:1::2"
ip6["ISP2"]="2001:db8:0:2::2"

localip["ISP1"]="1.1.1.1"
localip["ISP2"]="2.2.2.1"

localip6["ISP1"]="2001:db8:0:1::1"
localip6["ISP2"]="2001:db8:0:2::2"

statedir=/var/tmp

# check which ISP is up
 
for i in "${isp[@]}"; do
  # if there's no state file for the ISP, assume it's up
  if [ -f "${statedir}/${i}_state" ]; then
    status[$i]=$(< "${statedir}/${i}_state")
  else
    status[$i]="up"
  fi
done

# if ISP1 is up, use it

if [ "${status[ISP1]}" = 'up' ]; then

  outiface=${iface[ISP1]}
  outip=${ip[ISP1]}
  outlocalip=${localip[ISP1]}
  outip6=${ip6[ISP1]}
  outlocalip6=${localip6[ISP1]}

elif [ "${status[ISP2]}" = 'up' ]; then

  outiface=${iface[ISP2]}
  outip=${ip[ISP2]}
  outlocalip=${localip[ISP2]}
  outip6=${ip6[ISP2]}
  outlocalip6=${localip6[ISP2]}

else
  # no ISP is up, exit
  echo "No ISP is up, terminating!" >&2
  exit 1
fi
 
# IPv4
 
# flush everything
iptables -F
iptables -t nat -F
 
# do NAT first

# SNAT packets going out; MASQ may also be used instead
iptables -t nat -A POSTROUTING -o "$outiface" -j SNAT --to-source "$outlocalip"
 
# routing
 
ip route del default
ip route add default dev "${outiface}" via "${outip}"
ip route flush cache


# IPv6
 
# flush all the rules
ip6tables -F
ip6tables -t nat -F
 
# do NAT first

# SNAT packets going out; MASQ may also be used instead
ip6tables -t nat -A POSTROUTING -o "$outiface" -j SNAT --to-source "$outlocalip6"
 
# routing

ip -6 route del default
ip -6 route add default dev "${outiface}" via "${outip6}"
ip -6 route flush cache

WARNING: this is untested, but should give you a starting point. Of course you have to do the necessary adjustments like adding custom iptables rules which you surely have, and replace the IP addresses and interface names with your actual ones.

\1