Some notes on macvlan/macvtap

Posted by waldner on 20 March 2014, 9:16 am

There's not a lot of documentation about these interfaces. Here are some notes to summarize what I've been able to gather so far. Surely there's more to it (corrections and/or more information welcome).

macvlan

Macvlan interfaces can be seen as subinterfaces of a main ethernet interface. Each macvlan interface has its own MAC address (different from that of the main interface) and can be assigned IP addresses just like a normal interface.

So with this it's possible to have multiple IP addresses, each with its own MAC address, on the same physical interface. Applications can then bind specifically to the IP address assigned to a macvlan interface, for example. The physical interface to which the macvlan is attached is often referred to as "the lower device" or "the upper device"; here we'll use the term "lower device".

The main use of macvlan seems to be container virtualization (for example LXC guests can be configured to use a macvlan for their networking and the macvlan interface is moved to the container's namespace), but there are other scenarios, mostly very specific cases, like using virtual MAC addresses (see for example this keepalived feature).

A macvlan interface can work in one of four modes, defined at creation time.

VEPA (Virtual Ethernet Port Aggregator) is the default mode. If the lower device receives data from a macvlan in VEPA mode, this data is always sent "out" to the upstream switch or bridge, even if it's destined for another macvlan in the same lower device. Since macvlans are almost always assigned to virtual machines or containers, this makes it possible to see and manage inter-VM traffic on a real external switch (whereas with normal bridging it would not leave the hypervisor), with all the features provided by a "real" switch. However, at the same time this implies that, for VMs to be able to communicate, the external switch should send back inter-VM traffic to the hypervisor out of the same interface it was received from, something that is normally prevented from happening by STP. This feature (the so-called "hairpin mode" or "reflective relay") isn't widely supported yet, which means that if using VEPA mode with an ordinary switch, inter-VM traffic leaves the hypervisor but never comes back (unless it's sent back at the IP level by a router somewhere, but then there's nothing special about that, it has always worked that way).
Since there are few switches supporting hairpin mode, VEPA mode isn't used all that much yet. However it's worth mentioning that Linux's own internal bridge implementation does support hairpin mode in recent versions; assuming eth0 is a port of br0, hairpin mode can be anabled by doing
```
# echo 1 > /sys/class/net/br0/brif/eth0/hairpin_mode
```
or using a recent version of brctl:
```
# brctl hairpin br0 eth0 on
```
or even better, using the bridge program that comes with recent versions of iproute2:
```
# bridge link set dev eth0 hairpin on
```
So a Linux box could very well be used in the role of "external switch" as mentioned above.
Bridge mode: this works almost like a traditional bridge, in that data received on a macvlan in bridge mode and destined for another macvlan of the same lower device is sent directly to the target (if the target macvlan is also in bridge mode), rather than being sent outside. This of course works well with non-hairpin switches, and inter-VM traffic has better performance than VEPA mode, since the external round-trip is avoided. In the words of a kernel developer,

The macvlan is a trivial bridge that doesn't need to do learning as it
knows every mac address it can receive, so it doesn't need to implement
learning or stp. Which makes it simple stupid and and fast.
Private mode: this is essentially like VEPA mode, but with the added feature that no macvlans on the same lower device can communicate, regardless of where the packets come from (so even if inter-VM traffic is sent back by a hairpin switch or an IP router, the target macvlan is prevented from receiving it). I haven't tried, but I suppose that it is the operating mode of the target macvlan that determines whether it receives the traffic or not. This mode is useful, of course, if we really want macvlan isolation.
Passthru mode: this mode was added later, to work around some limitation of macvlans (more details here). I'm not 100% clear on what's the problem passthru mode tries to solve, as I was able to set promiscuous mode, create bridges, vlans and sub-macv{lan,tap} interfaces in KVM guests using a plain macvtap in VEPA mode for their networking (so no need for passthru). Since I'm surely missing something, more information (as usual) is welcome.

VEPA, bridged and private mode come from a standard called EVB (edge virtual bridging); a good article which provide more information can be found here.

Curiously (at least, in the case of the three original operating modes), the operating mode is per-macvlan interface rather than global (per-physical device); I guess that it's then more or less mandatory to configure all the macvlans of the same lower device to operate in the same mode, or at least match the macvlan modes so that only intended inter-VM traffic is possible; not sure what would happen, for instance, if a macvlan using VEPA mode tries to communicate with another one using bridge mode, or viceversa. This may well be worth investigating.

Irrespective of the mode used for the macvlan, there's no connectivity from whatever uses the macvlan (eg a container) to the lower device. This is by design, and is due to the the way macvlan interfaces "hook into" their physical interface. If communication with the host is needed, the solution is kind of easy: just create another macvlan on the host on the same lower device, and use this to communicate with the guest.

The documentation of iproute2 about setting operating mode for macvlans isn't complete, since neither "ip link help" nor the man pages mention how to do that. Fumbling around a bit, it can be seen that the syntax is

# ip link add link eth2 macvlan2 type macvlan mode aaa    # hit enter here to force an error message
Error: argument of "mode" must be "private", "vepa", "bridge" or "passthru"

Even more undocumented (if possible) is the way to show the operating mode of a macvlan, which turns out to be

# ip -d link show macvlan2
27: macvlan2@eth2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT 
    link/ether 26:8a:3c:07:7d:f4 brd ff:ff:ff:ff:ff:ff
    macvlan  mode vepa

Let's hope that all this appears in the documentation soon.

The MAC address of the macvlan is normally autogenerated; to explicitly specify one, the following syntax can be used (which also specifies custom name and operating mode at the same time):

# ip link add link eth2 FOOMACVLAN address 56:61:4f:7c:77:db type macvlan mode bridge

Final note, it's also possible to create a macvlan interface and bridge it (eg brctl addif br0 macvlan2); though it's a bit weird, it does work fine.

macvtap interfaces

A macvtap is a virtual interfaces based on macvlan (thus tied to another interface) vaguely similar (not much in fact) to a regular tap interface. A macvtap interface is similar to a normal tap interface in that a program can attach to it and read/write frames. However, the similarities end here. The most prominent user of macvtap interfaces seems to be libvirt/KVM, which allows guests to be connected to macvtap interfaces. Doing so allows for (almost) bridged-like behavior of guests but without the need to have a real bridge on the host, as a regular ethernet interface can be used as the macvtap's lower device.

Some notes about macvtap (more information is always welcome):

Since it's based on macvlan, macvtap shares the same operating modes it can be in (VEPA, bridge, private and passthru)
Similarly, a guest using a macvatp interface cannot communicate directly with its lower device in the host. In fact, if you run tcpdump on the macvtap interface on the host, no traffic will be seen. Again this is by design, but can be surprising. This link has some details and suggests workarounds for KVM in case this functionality is needed. A quick workaround is to create a macvlan (not macvtap) interface on the host, which will then be visible from the guests. (On a side note, this is also a way to use routed mode for the macvtap guests: put the host's macvlan and all guests on the same IP subnet, configure the guests to use the host macvlan's IP as their default gateway, and have the host do NAT between the macvlan and the physical interface. But then, in this case, it's probably easier to use a real bridge).
Creation of a macvtap interface is not done by opening /dev/net/tun; instead, it looks like the only way to create one is to directly send appropriate messages to the kernel via a netlink socket (at least, that's how iproute2 and libvirt do it; strace and/or the source will show the details, as there seems to be no documentation whatsoever). This makes it a bit more complicated than a normal tun/tap interface.
macvtap interfaces are persistent by default. Once the macvtap interface has been created via netlink, an actual chracter device file appears under /dev (this does not happen with normal tap interfaces), The device file is called /dev/tapNN, where NN is the interface index of the macvtap (can be seen for example with "ip link show"). It's this device file that has to be opened by programs wanting to use the interface (eg libvirtd/qemu to connect a guest).
One consequence of there being an actual device file for the macvtap interface is that traffic entering the interface can be seen and "stolen" to the intended recipient by simply reading from the device file; doing "cat /dev/tap22" (for example) while a guest VM is using it dumps the raw ehernet frames and prevents the VM from seeing them. On the other hand, neither seeing outgoing traffic nor injecting frames by writing to the device file from the outside seem to be possible.
If a VM is connected to the macvtap, the MAC address of the macvtap interface as seen on the host is the same that is seen by the guest; this is different from regular tap interfaces, where the guest is somehow "behind" the tap interface (the vnetX interfaces on the host have a MAC address which is not the same that the guest uses).
All traffic for guests connected to a macvtap does show up if running tcpdump on the lower device, even in bridge mode and for guest-to-guest traffic. However, as said, tcpdump (on the host) on the macvtap device itself shows no traffic.
If the lower device is a wireless card, macvtap doesn't work (the guest is isolated, nothing enters, nothing exits). Perhaps it's just that it only works with some wireless cards, and I happened to have one that doesn't work. Again, I could not find more information.

As said, creating a macvtap interface via code is a bit complicated, but luckily iproute2 can do it on the command line. To create a macvtap interface called macvtap2, with eth2 as its lower physical interface:

# ip link add link eth2 macvtap2 address 00:22:33:44:55:66 type macvtap mode bridge
# ip link set macvtap2 up
# ip link show macvtap2
18: macvtap2@eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT qlen 500
    link/ether 00:22:33:44:55:66 brd ff:ff:ff:ff:ff:ff
# ls -l /dev/tap18 
crw------- 1 root root 250, 1 May 26 10:51 /dev/tap18

To delete the interface, the usual command can be used:

# ip link del macvtap2

Two links which provide good information about macvtap:
http://seravo.fi/2012/virtualized-bridged-networking-with-macvtap
http://virt.kernelnewbies.org/MacVTap.

Filed under linux, networking, shell, tips Tagged iproute2, macvlan, macvtap

Comments are closed | Permalink

8 Comments

Tom Davies says:

May 23, 2018 at 21:27

Hi :)
I read somewhere that wireless doesn't work because frames coming in are then found to have a 'wrong' mac address. The frames contain a mac address of the physical device but appear to have come in through the macvlan's mac address. This is seen as suspicious and the frame gets dropped.

I have no idea if i've understood that correctly or made it comprehensible. I had no understanding at all about networking until very recently and i may have gone backwards from there. The article might have been talking specifically about Wireless Access Points.

I hope this helps or is ignorable!
Regards from
Tom Davies
- waldner says:
  
  May 24, 2018 at 13:29
  
  Looks plausible. Thanks for sharing!
  - Tom Davies says:
    
    May 25, 2018 at 10:32
    
    Thanks :)) Also thanks for an excellent article! I can't find much documentation about macvlan and this is the only article i've seen about macvtap. So this is still the most recent article about these 'new' technologies. It also seems one of the best articles on virtual networking generally.
    Many thanks and regards from Tom :)
Johannes Ernst says:

June 12, 2015 at 19:19

Thanks for documenting this, this is very useful.
Rami Rosen says:

July 18, 2014 at 16:56

Typo:
EVP (edge virtual bridging)
should be:
EVB (edge virtual bridging)

regards,
Rami Rosen
http://ramirose.wix.com/ramirosen
- waldner says:
  
  July 18, 2014 at 17:04
  
  Fixed, thanks.
Al Fansome says:

July 14, 2014 at 21:06

| Creation of a macvtap interface is not done by opening /dev/net/tun; instead, it looks like the only way to create one is to directly send appropriate messages to the kernel via a netlink socket

To create /dev/tapNNN from a shell script:

# ip link add link eth1 name eth1_macvtap type macvtap
# ls /dev/tap*

For scripting, to get the device name:

echo /dev/tap$(< /sys/class/net/eth1_macvtap/ifindex)
- waldner says:
  
  July 14, 2014 at 23:52
  
  It's explained in the text too. Btw, the interface index is also shown in the output of "ip link" or "ip address".

\1