Skip to content
 

The strange case of Dr. Multihome and Mr. Nobind

Topology as follows:

openvpn-mh

The server is multihomed to two ISPs, and OpenVPN is running with this configuration:

port 1194
multihome   
proto udp
dev tap0
ca ca.crt
cert server.crt
key server.key
script-security 2
up up.sh
down down.sh
dh dh1024.pem
server-bridge 10.0.1.251 255.255.255.0 10.0.1.10 10.0.1.20
client-to-client
keepalive 10 120
comp-lzo
persist-key
persist-tun
verb 6

(it could very well be using routed mode, that's not the problem here)
Both clients (well, all clients, for that matter) run this config:

client
remote 1.1.1.1 1194
remote 2.2.2.2 1194
remote-random
proto udp
dev tap0
ca ca.crt
cert client.crt
key client.key
keepalive 10 120
comp-lzo
persist-key
persist-tun
verb 3

The idea is that clients should connect randomly to one of the addresses on the server, to provide some distribution of the traffic between the two ISPs. On the server, multihome is used so replies to traffic coming in from one ISP go out the same ISP.

Problem: under certain mysterious circumstances, some clients are not able to connect. Better (or worse): the client connects, say to 1.1.1.1, the handshake completes successfully (up to the "Initialization Sequence Completed" message), but as soon as data traffic begins to flow, the client log fills with these messages:

Thu Nov 12 19:58:59 2009 TCP/UDP: Incoming packet rejected from 2.2.2.2:1194[2], expected peer address: 1.1.1.1:1194 (allow this incoming source address/port by removing --remote or adding --float)

Some investigation on the server side, raising the debug level, shows that the server changes the outgoing interface after the initialization sequence is completed and the first data packet is sent to the client (in bold below):

...
Thu Nov 12 19:58:55 2009 us=570541 Test-Client/192.168.2.2:1194 UDPv4 READ [77] from 192.168.2.2:1194 (via 1.1.1.1): P_DATA_V1 kid=0 DATA len=76
Thu Nov 12 19:58:55 2009 us=570881 Test-Client/192.168.2.2:1194 TUN WRITE [42]
Thu Nov 12 19:58:55 2009 us=571510 Test-Client/192.168.2.2:1194 TUN READ [42]
Thu Nov 12 19:58:55 2009 us=571790 Test-Client/192.168.2.2:1194 UDPv4 WRITE [77] to 192.168.2.2:1194 (via 2.2.2.2): P_DATA_V1 kid=0 DATA len=76

and given the above, the client rightly complains.

To make things worse, the problem does not always happen. The first client that connects always works fine, regardless of the IP address it connects to. If a different client connects, that works fine as well, again regardless of the IP. But if a client connected to one server's IP disconnects and then reconnects after a short time, and that new connection goes to the server's other IP, the problem happens.
To add some fun, the fact that I have another virtually identical setup in production, and that works flawlessly.

After some days of unsuccessful troubleshooting, no progress on the problem. Searching the Internet, an old message on the mailing list hints at something similar, but it's not clear whether that could apply to the case in question, so it's no help.
Being lost and thinking of some bug or some other obscure corner case (though the config looked straightforward), I ask for help on the openvpn-devel mailing list, where another (different) issue related to multihome was being discussed (and, btw, sorry for hijacking the thread).

Well, the reply surprised me. James suggested to add nobind to the **client** configuration. I was thinking of something weird on the server instead. And adding nobind to the client config indeed did the trick. A quick check of the client config in the working production environment reveals that the clients are indeed using nobind there.

Here's the official explanation for the behavior:

Using nobind on the client for UDP client connections generates a socket
with a dynamic source port number. This is key because it means that
when the client reconnects, it does so with a new source port number,
and this allows OpenVPN to detect that the initial UDP packet represents
a new connection, and is not part of the old connection.

The problem is that when nobind is not used, the source port on the new
connection is recycled -- it's the same as the old connection. So when
OpenVPN sees the connection-initiating packet, after the client switches
over to the secondary server address, it gets confused because it
doesn't expect sessions from a given source address to change its
destination address mid-session.

The whole tread is available here.

Bottom line: always, always, always use nobind on the clients, even if they are single-homed, unless you're perfectly sure of what you're doing. Lesson learned.

Update (17/03/2010): this has now been included in the man page, which reads

Note: clients connecting to a --multihome server should always use the --nobind option.