\1

On-demand tar copy across the network

Posted by waldner on 10 January 2013, 9:16 am

Everyone who has worked a while with the command line knows the classic trick to move files between machines using tar and ssh:

source$ tar -cf - -C /src/dir . | ssh user@target 'tar -xf - -C /dest/dir'

(I'll use -C here; if using a tar that doesn't support it, of course one can always do cd /source/dir && tar ...)

While this works fine, it has some encryption overhead that is not necessary if the machines we're moving files between are both in the local LAN. So one could use netcat (or equivalent) instead of ssh, something like the following:

target$ nc -l -p 1234 | tar -xf - -C /dest/dir
source$ tar -cf - -C /src/dir . | nc target 1234

(adapt the syntax of netcat to your specific variant, of which there are many; in particular, you may need to add a timeout or some option to handle half-closes so that it terminates correctly when it has received all the input, if it doesn't do it by default).

However, this requires setting up things in advance on the target machine, before the transfer can be started on the source. It would be nice if the target machine part could be somehow automated, so everything could be controlled entirely from the source machine. Here are some ideas.

xinetd

Most distributions come with xinetd preinstalled and running, so this seems like a good fit. Let's create a new service managed by xinetd (adapt as needed, of course):

service nettar
{
  port = 1234
  type = UNLISTED
  socket_type = stream
  instances = 1
  wait = no
  user = someuser
  server = /usr/local/bin/nettar.sh
  log_on_success += USERID PID HOST EXIT DURATION
  log_on_failure += USERID HOST ATTEMPT
  disable = no
}

It is important to specify wait = no if we want xinetd to pass data to our program on standard input. This is not clearly documented, see this page and this page for more information.
The script /usr/local/bin/nettar.sh looks something like this:

#!/bin/bash
 
# this reads just one line of input
IFS= read -r destdir
 
if ! cd "$destdir"; then
  echo "Error entering destination directory $destdir" >&2
  exit 1
fi
 
# just untar what we get on stdin from xinetd
tar -xf -

So on the source machine we can then do

source$ { echo "/dest/dir"; tar -cf - -C /src/dir . ; } | nc target 1234

and wait for the transfer to complete. Obviously this can all be put inside a wrapper script that does the necessary sanity checks to ensure that the first line of what we send to the target machine is indeed the name of the remote directory.

To transfer to multiple machines at once, process substitution can be used, as in

{ echo "/dest/dir"; tar -cf - -C /src/dir . ; } | tee >(nc target1 1234) >(nc target2 1234) | nc target3 1234  # etc

Assuming of course that our xinetd service is running on all target machines.

Start netcat by ssh

This method is more kludgy than the previous one, but has the advantage that it does not require xinetd on the target machine, only ssh (and netcat, of course). The idea is to create a wrapper script that connects to the target machine using ssh, starts netcat in listen mode, and finally starts the transfer on the source machine. No need to say that having public key authentication for ssh helps significantly here.
A sample script could be something like:

#!/bin/bash
 
#nettar2.sh
 
sourcedir=$1
target=$2
 
targetmachine=${target%%:*}
targetdir=${target#*:}
 
if ssh user@"$targetmachine" "cd '$targetdir' || exit 1; { nc -l -p 1234 | tar -xvf - ; } </dev/null >/dev/null 2>&1 &"; then
  tar -cf - -C "$sourcedir" . | nc "$targetmachine" 1234
else
  echo "Error changing to $targetdir on $targetmachine" >&2
  exit 1 
fi

Here we do an explicit cd on target, to make sure the directory exists and fail immediately if not. Of course, everything can be changed and/or adapted.

We should take care that after starting netcat on target ssh returns instead of hanging, hence all the /dev/null redirections (local redirections inside braces override the external ones).

So with this we can do

source$ nettar2.sh /src/dir target:/dest/dir

and transfer our files at full speed without having to connect manually to the target machine to prepare the transfer.

Conclusion

The tricks described here are a bit kludgy and possibly not always worth the effort, but there are scenarios where they can be employed successfully and make life easier than other more usual methods.

Compression can be added to tar if wanted (you may want to do some test and see if it makes things better or worse, or make it a commandline option to the script).

Filed under shell, tips, worksforme. Tagged netcat, networking, tar, xinetd

Comments Off

IPv4 to IPv6 communication (and viceversa): some kludges

Posted by waldner on 18 December 2012, 1:48 pm

Ok, long title. A scenario that may occur sometimes is the need to communicate between an IPv4-only host and an IPv6-only host, and viceversa. There are kludges to do that at the IP level (eg NAT64 etc.), but let's assume we don't have any of those available. Let's use a different kludge, one that works at the application level.

If the service that we need to use runs on TCP and we have a dual-stacked intermediate machine that we can access, then we can use SSH's port forwarding mechanism. Here's an example (IPv4 to IPv6):

# the -4 is to emphasize that we're connecting over iPv4
ipv4box$ ssh -4 user@intermediate -L 1234:[2001:db8:0:1::1]:1234

After doing this, connecting to 127.0.0.1:1234 will actually connect to the service at 2001:db8:0:1::1 port 1234 (the chosen port numbers are of course arbitrary, but you get the idea).

The reverse works just the same; if we're IPv6-only (currently very unlikely, but anyway) and the service we need is IPv4-only, then we can do

# the -6 is to emphasize that we're connecting over iPv6
ipv6box$ ssh -6 user@intermediate -L 1234:10.55.12.3:1234

and then use [::1]:1234 to access the service on 10.55.12.3. If the service being accessed is HTTP, most likely you'll need to set up some DNS or /etc/hosts entries to make the browser send the correct Host: header and so on, but overall things should work fine.

And so on: -R, -W, ProxyCommand and all the other tricks can of course be used as well.

Since SSH port forwarding works at the TCP level, it just has to move the data in the TCP payload back and forth between an IPv4 socket and an IPv6 one, so there's no complicated protocol translation stuff going on.

Actually, we don't strictly need SSH, any TCP proxy would do the job; it's just that SSH is universally available and has the functionality built-in. Let's see how to do the IPv4-to-IPv6 case with socat, for example. On the intermediate machine, we need to run

# TCP4-L accepts options to make it spawn children etc., see the man page
intermediate$ socat TCP4-L:1234 TCP6:[2001:db8:0:1::1]:1234

in this case then, we can connect over IPv4 to intermediate:1234 and access the service. In fact, with socat we can even protocol-translate UDP services, since (I believe) socat can move the UDP payload from one "connection" to the other. For example to send all the DNS queries to an IPv6-only server (doesn't make a lot of sense, but that's just the basic idea):

intermediate# socat UDP4-RECVFROM:53,fork UDP6-SENDTO:[2001:db8:0:1::2]:53

All that being said, the idea that the sooner you go dual-stack with native IPv6, the better still applies more than ever.

Filed under linux, networking, shell, tips, worksforme. Tagged IPv4, IPv6, kludges, port forwarding, ssh

Comments Off

A strange packet loss

Posted by waldner on 16 November 2012, 1:20 pm

(not really, as usual, once you know what's going on...)

During a network troubleshooting session, while running the excellent mtr to a remote IP, I needed to run another instance of it to another destination.

So while I was running this:

 Host                                                    Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. gw.office                                             0.0%   148    0.3   0.4   0.2   3.9   0.7
 2. ge-0-1.transit                                        0.0%   148    1.6   4.7   1.3  74.1   9.5
 3. rtr4.example.com                                      0.0%   148    1.1  15.9   0.9 195.1  38.8
...

I opened another window and ran another mtr. Right from the beginning, this second mtr (but it could just as easily have been the first one, as we'll see) was showing a ridiculous percentage of packet loss for the first hop:

 Host                                                    Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. gw.office                                            85.7%     8    0.3   0.3   0.3   0.3   0.0
 2. ge-0-1.transit                                        0.0%     8    1.8   1.7   1.4   2.2   0.3
 3. lon-par.upstream2.example.com                         0.0%     8    1.1   1.1   1.0   1.2   0.1
...

Obviously there was something strange (especially because packets were certainly getting through, given the 0% loss at the following hops), but what?

Before proceeding, let's briefly review how mtr works. The full story is here; the executive summary is that, like any good traceroute-like program, mtr needs to be able to receive ICMP error messages as proof that hosts along the path are alive. The error packets that are returned are of the "time to live exceeded" kind, except for the last hop which sends back a regular echo reply in ICMP mode, and a "port unreachable" ICMP error when in UDP mode.
In the above scenario, I was using ICMP mode, but it would have been the same in UDP mode.

If mtr is expecting an ICMP error message to come back from a certain host but it doesn't receive it, it consider this fact a symptom of packet loss happening at that hop.

Now, let's go back to our two concurrent mtr instances. As can be seen, the first hop (where the supposed packet loss was happening according to the second instance) is the same for both traces. Since each mtr instance by default sends one packet per second, the office gateway should have been sending back two ICMP error messages every second. But running tcpdump, only ONE such message was observed on my machine. So obviously, the mtr instance that wasn't getting its expected ICMP error was declaring packet loss at the first hop.

Now here's the catch: the office gateway is a Linux machine, and by default Linux limits the rate at which certain ICMP packets can be sent. The limit is per-IP. There are a couple of files under /proc/sys/net/ipv4/ that determine which packet types are rate-limited, and the limit rate itself.
Turns out that by default, the ICMP error messages that mtr needs (both the "time to live exceeded" and the "port unreachable") are rate-limited, and the rate is of one packet per second. This is from man 7 icmp:


       icmp_ratelimit (integer; default: 1000; since Linux 2.4.10)
              Limit  the  maximum rates for sending ICMP packets whose type matches icmp_ratemask
              (see below) to specific targets.  0 to disable any limiting, otherwise the  minimum
              space between responses in milliseconds.

       icmp_ratemask (integer; default: see below; since Linux 2.4.10)
              Mask made of ICMP types for which rates are being limited.

              Significant bits: IHGFEDCBA9876543210
              Default mask:     0000001100000011000 (0x1818)

              Bit definitions (see the Linux kernel source file include/linux/icmp.h):

                  0 Echo Reply
                  3 Destination Unreachable *
                  4 Source Quench *
                  5 Redirect
                  8 Echo Request
                  B Time Exceeded *
                  C Parameter Problem *
                  D Timestamp Request
                  E Timestamp Reply
                  F Info Request
                  G Info Reply
                  H Address Mask Request
                  I Address Mask Reply

              The  bits marked with an asterisk are rate limited by default (see the default mask above).

Mistery solved. When running two instances, the instance that gets the ICMP error can be either one, depending on how their packets are interleaved; in general, since they both send packets at regular intervals of one second, one of them will get all the ICMP errors, and the other one none, which is what we observed. But here the rate-limited hop was in the local LAN; if the shared, rate-limited hop is further away, then latency can play a role and the returning ICMP errors can have "jitter", so to speak, and thus be more distributed between the two mtr instances (which will then both report packet loss at that hop whereas it's likely that there is none).

Solution: do not run two or more mtr instances from the same source IP address, whose traces share a Linux hop. And preferrably do not use an interval less than one second for probes, even if you're running a single instance, otherwise you'll see strange packet losses popping up if some hops of your trace happen to be Linux machines. This is probably the case if you're sending N probes per second and a packet loss percentage of about (N-1) * 100 / N is reported.
If you have access to the Linux host in question, another approach could be to remove or loosen the rate limit setting (but think carefully, because the limitation is there for a reason).

Finally, you can of course do nothing, if you are aware of this fact and can live with it.

Many thanks to Jordi Clariana for the discussion we had about this and the ideas he brought in.

Filed under linux, networking, shell, tips. Tagged icmp, mtr

3 Comments

Policy routing, multihoming and all that jazz

Posted by waldner on 7 October 2012, 3:25 pm

This is to remind me of how to do the most common tasks involved with multihoming.

Scenario: dual- (or multi-, for that matter) homed Linux box acting as router for one or more local networks. Here we'll assume two local networks, and two upstream ISPs (no dynamic routing - only static defaults). The two ISPs are on the router's eth2 and eth3 respectively, while the internal networks are on eth0 (dev network) and eth1 (R&D network). It's straightforward to extend the sample code shown here to manage more ISPs or more internal LANs. Here's a picture:

Sample scenario

The Linux router also does NAT for internally-initiated connections. What we want to achieve is:

Load balancing of the traffic among the ISPs (within reason)
ISP failover: if one ISP or the link to it fails, stop using it

NAT

Let's take care of the easy things first: since our Linux router will perform NAT on outgoing packets, let's write those rules first:

# SNAT packets going out eth2 to ISP1
iptables -t nat -A POSTROUTING -o eth2 -j SNAT --to-source 1.1.1.1
 
# SNAT packets going out eth3 to ISP2
iptables -t nat -A POSTROUTING -o eth3 -j SNAT --to-source 2.2.2.1

Load balancing

Fortunately, iptables can classify all kinds of traffic with its stateful connection tracking mechanism, even if it's not strictly flow-oriented (think UDP or ICMP). What we want in our setup is that all the packets belonging to a "flow" (also called a connection in iptables speak) use the same ISP (ie, same interface on the router), because otherwise packets will have their source addresses rewritten to different IPs which will most likely confuse the targets and any stateful device along the path.

From iptables' point of view, all packets have a state. For our purposes. the most interesting state is NEW, which means that it doesn't belong to any existing connection (and thus, a new connection will be created in the conntrack table), and ESTABLISHED/RELATED, which identify packets belonging (or related) to an existing connection.

To achieve our goal, we will use the marking mechanism that iptables provides. In iptables, we can mark a single packet or the entire connection to which a packet belongs: it is possible to mark a packet, then assign (ie, save) the packet mark to the connection, or the other way round (assign the connection mark to the packet). Let's make clear that these "markings" do not modify the packets; they only live in the router's memory, in the form of metadata in the connection tracking table. With the userspace conntrack utility, the connection table can be printed, which will show the marks.
Once a packet is marked (and all packets belonging to the same connection will have the same mark), we will route using one or the other ISP based on the packet mark.

The last piece of the puzzle is how to decide which ISP to use when we see a brand new packet (a packet that is creating a new flow, and for which no previous flow exists). To load balance among N ISPs, we should ideally try to send 1/N of all the new flows to one ISP, 1/N to another, etc. In the 2-ISP case, this means dividing the traffic 50% to ISP1 and 50% to ISP2. With iptables, there are two common ways to achieve that, and both use the statistics module. The statistics module can operate in two modes: the so-called nth (matches if the packet is, er, the nth when counting in a round-robin fashion), or it can operate in random mode (that is, with a probability X of matching a packet).

So here's what we'll do:

if the packet is NEW, choose an ISP and mark the packet accordingly. For this, we need as many different mark values as we have ISPs (two in this example). Once the packet is marked, mark the new connection with the same mark.
if the packet belongs to an existing connection (state ESTABLISHED/RELATED), mark the packet with the same mark that the connection has (it must have been marked before, when it was NEW).
based on the packet mark (regardless of how we obtained it), decide which ISP to use.

Let's break it down and implement it with actual iptables and iproute2 rules.

iptables

Here are the iptables command to run on the router:

# chain which marks a packet (MARK) and its connection (CONNMARK) with mark 1 (for ISP1)
iptables -t mangle -N MARK-ISP1
iptables -t mangle -A MARK-ISP1 -j MARK --set-mark 1
iptables -t mangle -A MARK-ISP1 -j CONNMARK --save-mark
 
# chain which marks a packet (MARK) and its connection (CONNMARK) with mark 2
iptables -t mangle -N MARK-ISP2
iptables -t mangle -A MARK-ISP2 -j MARK --set-mark 2
iptables -t mangle -A MARK-ISP2 -j CONNMARK --save-mark
 
# real work begins here
 
# do not touch inter-LAN traffic
iptables -t mangle -A PREROUTING -i eth0 -s 192.168.1.0/24 -d 192.168.2.0/24 -j ACCEPT
iptables -t mangle -A PREROUTING -i eth1 -s 192.168.2.0/24 -d 192.168.1.0/24 -j ACCEPT
 
# If the packet is not NEW, there must be a connection for it, so get the connection
# mark and apply it to the packet
 
# packets from dev network
iptables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
 
# packets from R&D network
iptables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
 
# on the other hand, it the state is NEW, we have to decide where to send it
# Use the statistics match in nth mode
 
# dev network
iptables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 0 -j MARK-ISP1
iptables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 1 -j MARK-ISP2
 
# same for R&D network
iptables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 0 -j MARK-ISP1
iptables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 1 -j MARK-ISP2

Routing

Now that packets are marked (either because they're new or because they belong to an existing flow), they can be routed based on their mark. To do this, we use iproute2 rules. In Linux it is possible to have multiple routing tables, and that's what we're going to do here. Each routing table has a number associated with it, but it's easier to use names. Thus, we can edit the file /etc/iproute2/rt_tables and add two new values for two new routing tables:

#
# reserved values
#
255     local
254     main
253     default
0       unspec

# add these two
250     ISP1
249     ISP2

Next we initialize the two new routing tables. Each table only needs a default route pointing to the upstream ISP's interface, however adding local routes can't hurt:

ip route flush table ISP1
ip route add table ISP1 default dev eth2 via 1.1.1.2
# add local routes too
ip route add table ISP1 1.1.1.0/24 dev eth2 src 1.1.1.1                           
ip route add table ISP1 2.2.2.0/24 dev eth3 src 2.2.2.1                           
ip route add table ISP1 192.168.1.0/24 dev eth0 src 192.168.1.254                           
ip route add table ISP1 192.168.2.0/24 dev eth1 src 192.168.2.254     
 
ip route flush table ISP2
ip route add table ISP2 default dev eth3 via 2.2.2.2
ip route add table ISP2 1.1.1.0/24 dev eth2 src 1.1.1.1                           
ip route add table ISP2 2.2.2.0/24 dev eth3 src 2.2.2.1                           
ip route add table ISP2 192.168.1.0/24 dev eth0 src 192.168.1.254                           
ip route add table ISP2 192.168.2.0/24 dev eth1 src 192.168.2.254

And the rules to use them:

ip rule del from all fwmark 2 2>/dev/null
ip rule del from all fwmark 1 2>/dev/null
ip rule add fwmark 1 table ISP1
ip rule add fwmark 2 table ISP2
ip route flush cache

The above commands (including the iptables command shown earlier) can be added to some boot-time script (see below for an example which is also smarter), so the tables are initialized automatically at system boot.

To confirm that the rules are present, we can display them:

# ip rule show
0:      from all lookup local
32764:  from all fwmark 0x2 lookup ISP2
32765:  from all fwmark 0x1 lookup ISP1
32766:  from all lookup main
32767:  from all lookup default

Finally, make sure that the rp_filter option is disabled on the router, otherwise it could drop packets:

# for i in /proc/sys/net/ipv4/conf/*/rp_filter; do echo 0 > "$i"; done

That's it, the router should be working and balancing traffic now. To confirm it, we can use tcpdump and verify that traffic is equally distributed among the links (or at least, that we see "some" traffic on each link).

Local traffic

All the above works fine for traffic that originates from the LANs and traverses the router. What about traffic originated by the router itself? This case is a bit more complicated since routing also has to consider the source IP address that the router's kernel chooses to put in outgoing packets, and that cannot always be controlled.
Suggestions on how to manage local traffic efficiently (ie, using the two ISPs) are welcome. For the time being, since local traffic is usually not relevant, what can be done is to just add a default route in the router's main routing table pointing directly to one of the available ISPs. This will not be shown in this example.

IPv6

It all works with IPv6 too. The only difference is that with IPv6 there is no need to do NAT (although it won't be long before that will be possible, and one way or another every vendor or manufacturer seems to be adding support to NAT66 to their products). Here's some skeleton code to implement the same logic with IPv6:

# mark for ISP1
ip6tables -t mangle -N MARK-ISP1
ip6tables -t mangle -A MARK-ISP1 -j MARK --set-mark 1
ip6tables -t mangle -A MARK-ISP1 -j CONNMARK --save-mark
 
# mark for ISP2
ip6tables -t mangle -N MARK-ISP2
ip6tables -t mangle -A MARK-ISP2 -j MARK --set-mark 2
ip6tables -t mangle -A MARK-ISP2 -j CONNMARK --save-mark
 
# accept intra-LAN traffic
ip6tables -t mangle -A PREROUTING -i eth0 -s 2001:db8:0:1681::/64 -d 2001:db8:0:1682::/64 -j ACCEPT
ip6tables -t mangle -A PREROUTING -i eth1 -s 2001:db8:0:1682::/64 -d 2001:db8:0:1681::/64 -j ACCEPT
 
ip6tables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
ip6tables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
 
ip6tables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 0 -j MARK-ISP1
ip6tables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 1 -j MARK-ISP2
 
ip6tables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 0 -j MARK-ISP1
ip6tables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate NEW -m statistic --mode nth --every 2 --packet 1 -j MARK-ISP2
 
# routing stuff
ip -6 route flush table ISP1
ip -6 route add table ISP1 default dev eth2 via 2001:db8:0:1::2   # ISP1's interface
ip -6 route add table ISP1 2001:db8:0:1::/64 dev eth2
ip -6 route add table ISP1 2001:db8:0:2::/64 dev eth3
ip -6 route add table ISP1 2001:db8:0:1681::/64 dev eth0
ip -6 route add table ISP1 2001:db8:0:1682::/64 dev eth1
 
ip -6 route flush table ISP2
ip -6 route add table ISP2 default dev eth3 via 2001:db8:0:2::2   # ISP2's interface
ip -6 route add table ISP2 2001:db8:0:1::/64 dev eth2
ip -6 route add table ISP2 2001:db8:0:2::/64 dev eth3
ip -6 route add table ISP2 2001:db8:0:1681::/64 dev eth0
ip -6 route add table ISP2 2001:db8:0:1682::/64 dev eth1
 
ip -6 rule del from all fwmark 1 2>/dev/null
ip -6 rule del from all fwmark 2 2>/dev/null
ip -6 rule add fwmark 1 table ISP1
ip -6 rule add fwmark 2 table ISP2
ip -6 route flush cache

ISP Failover

In our scenario, one of the two upstream links to the two ISP can fail; in that case, we have to stop using it and send all the traffic out the working link.

There is a nice, albeit poorly documented, program called Link Status Monitor that can periodically check any number of connections and run a user-defined script when it detects status changes. Normally the check will be based on pinging a specific IP address, and declare failure if a configurable number of probes fail in a row, or the latency changes beyond a certain value, and so on. Choosing the right IP to ping is very important; if we ping the ISP interface facing us but the ISP has a failure upstream, then we will think the ISP is up but traffic sent to it will effectively be blackholed, so it may be better to ping some address located a few hops upstream. On the other hand, if we choose to do so and the ping fails, it may be due to a failure of the host we're pinging and the ISP as a whole may still be working fine, so it should not be disabled. LSM allows the definition of connection groups, and more complex policies can be created (for example, declare the group - ie, ISP - down only if all the members fail, and so on).

But all this is a matter of taste and policy, so everyone will configure them as they wish. What matters for the purposes of this discussion is that LSM invokes a script every time it detects a status change on one of the links (or ISP, or connection groups, etc.) it's monitoring. The idea is thus that each ISP would get a state file, and the script invoked by LSM would update the state file with the new status every time it detects a change. Then, the main router configuration script would be invoked, which would set up routing to all available providers. Let's assume that we use /var/run/ISP1_state and /var/run/ISP2_state as state files for our upstreams. Then the script invoked by LSM would be something like:

#!/bin/bash
state=${1}
name=${2}
 
echo "$state" > "/var/run/${name}_state"
config_router.sh

Obviously one will also want to do other things like send an email containing detailed information about the event (the script called default_script that comes with LSM can be useful here), but the basic functionality is what is shown above. Now the script config_router.sh will contain the commands shown earlier, but it will also check which ISPs are up and configure iptables and routing rules to only use the available ISPs. For example (IPv4 only, adding IPv6 is trivial):

#!/bin/bash
 
# using associative arrays to store the information
 
declare -a isp
declare -A iface ip localip mark status
 
isp=( ISP1 ISP2 )
 
iface["ISP1"]="eth2"
iface["ISP2"]="eth3"
 
ip["ISP1"]="1.1.1.2"
ip["ISP2"]="2.2.2.2"
 
localip["ISP1"]="1.1.1.1"
localip["ISP2"]="2.2.2.1"
 
mark["ISP1"]=1
mark["ISP2"]=2
 
statedir=/var/tmp
 
upcount=0
for i in "${isp[@]}"; do
 
  # if there's no state file for the ISP, assume it's up
 
  if [ -f "${statedir}/${i}_state" ]; then
    status[$i]=$(< "${statedir}/${i}_state")
  else
    status[$i]="up"
  fi
 
  [ "${status[$i]}" = "up" ] && upcount=$((upcount+1))
done
 
# IPv4
 
# flush everything
iptables -F
iptables -t nat -F
iptables -t mangle -F
 
for i in "${isp[@]}"; do
  iptables -t mangle -X "MARK-${i}" 2>/dev/null
done
 
 
# SNAT for outgoing traffic, use providers that are available
for i in "${isp[@]}"; do
  if [ "${status[$i]}" = "up" ]; then
    iptables -t nat -A POSTROUTING -o "${iface[$i]}" -j SNAT --to-source "${localip[$i]}"
  fi
done
 
# chain to mark traffic for a specific provider
for i in "${isp[@]}"; do
  if [ "${status[$i]}" = "up" ]; then
    iptables -t mangle -N "MARK-${i}"
    iptables -t mangle -A "MARK-${i}" -j MARK --set-mark "${mark[$i]}"
    iptables -t mangle -A "MARK-${i}" -j CONNMARK --save-mark
  fi
done
 
# accept intra-LAN traffic
iptables -t mangle -A PREROUTING -i eth0 -s 192.168.1.0/24 -d 192.168.2.0/24 -j ACCEPT
iptables -t mangle -A PREROUTING -i eth1 -s 192.168.2.0/24 -d 192.168.1.0/24 -j ACCEPT
 
iptables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
iptables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate ESTABLISHED,RELATED -j CONNMARK --restore-mark
 
c=0
for i in "${isp[@]}"; do
  if [ "${status[$i]}" = "up" ]; then
    iptables -t mangle -A PREROUTING -i eth0 -m conntrack --ctstate NEW -m statistic --mode nth --every ${upcount} --packet "${c}" -j "MARK-$i"
    iptables -t mangle -A PREROUTING -i eth1 -m conntrack --ctstate NEW -m statistic --mode nth --every ${upcount} --packet "${c}" -j "MARK-$i"
    c=$((c+1))
  fi
done
 
# routing
 
for i in "${isp[@]}"; do
  ip route flush table "$i"
  if [ "${status[$i]}" = "up" ]; then
    # default is ISP-specific
    ip route add table "$i" default dev "${iface[$i]}" via "${ip[$i]}"
    # local routes
    ip route add table "$i" 1.1.1.0/24 dev eth2 src 1.1.1.1 
    ip route add table "$i" 2.2.2.0/24 dev eth3 src 2.2.2.1 
    ip route add table "$i" 192.168.1.0/24 dev eth0 src 192.168.1.254 
    ip route add table "$i" 192.168.2.0/24 dev eth1 src 192.168.2.254 
 fi
done
 
for i in "${isp[@]}"; do
  ip rule del from all fwmark "${mark[$i]}" 2>/dev/null
  if [ "${status[$i]}" = "up" ]; then
    ip rule add fwmark "${mark[$i]}" table "$i"
  fi
done
 
ip route flush cache

This is just a skeleton, the important point to remember are:

There's no need to delete anything, since the script recreates all the configuration from scratch every time;
The script should be idempotent, that is, it should be possible to run it as many times as we wish and no configuration should be duplicated after every run;
Since it redoes everything from scratch every time, the script must include any other iptables or iproute2 rules (not related to the task described here) that may be needed for whatever reason.

Conclusions

Adding more upstreams

The setup just described works fine not with just two, but any number of upstream ISP, provided that the Linux router is configured accordingly (there needs to be a routing table for each ISP defined in /etc/iproute2/rt_tables). If the weight assigned to each ISP is not the same (ie, some should get more or less traffic than the others), then the algorithm that marks new connections using the nth mode of the statistic match should be adapted correspondingly (for example, with three active ISPs, sending two new connections every three to ISP1, and the remaining one to ISP2, or whatever); it may even be easier to use the random mode in that case, as long as the various probabilities assigned to the active ISPs add to 1. Whatever one chooses, this obviously requires changing the sample code shown above.

Drawbacks

This setup only provides load balancing and redundancy: while this is good, it does not provide more bandwidth. As should be evident, any single flow will always use one given ISP, so the maximum bandwidth achievable will be that offered by that ISP. Also, depending on the exact traffic pattern produced by users, it may happen that during specific periods of times one ISP will be overutilized or underutilized (ie, every second new connection is a huge download, etc.). In general, those conditions should be temporary and on average one should get a fair balance between the two ISP.
There may be websites or services that do not like having connections to the same page or session supposedly made by the same user but coming from different IPs (perhaps some SSL websites, banks, etc); if that is the case, static routing rules need to be put in place for those targets, so that traffic to them is not load-balanced.
The Linux router is obviously a single point of failure. There are ways to use another machine as a hot standby router, which would take over if the first one fails. This is the job of the conntrack tools and the conntrackd daemon, and perhaps will be covered in a future article.

Filed under linux, networking, tips, worksforme. Tagged conntrack, iptables, marking, multihome, routing

4 Comments

OpenVPN LDAP authentication

Posted by waldner on 14 September 2012, 12:16 pm

Migrated here: https://github.com/waldner/openvpn-ldap.

Filed under linux, networking, shell, tips, worksforme. Tagged authentication, bash, ldap, openvpn, perl

2 Comments

\1

On-demand tar copy across the network

xinetd

Start netcat by ssh

Conclusion

IPv4 to IPv6 communication (and viceversa): some kludges

A strange packet loss

Policy routing, multihoming and all that jazz

NAT

Load balancing

iptables

Routing

Local traffic

IPv6

ISP Failover

Conclusions

Adding more upstreams

Drawbacks

OpenVPN LDAP authentication

BTC

Recent Posts

Categories

Archives

\1

On-demand tar copy across the network

xinetd

Start netcat by ssh

Conclusion

IPv4 to IPv6 communication (and viceversa): some kludges

A strange packet loss

Policy routing, multihoming and all that jazz

NAT

Load balancing

iptables

Routing

Local traffic

IPv6

ISP Failover

Conclusions

Adding more upstreams

Drawbacks

OpenVPN LDAP authentication

BTC

Recent Posts

Categories

Tags

Archives