Linux Network Equipment – Bridge & Veth Pair

We continue to introduce common network devices in Linux. Today we mainly talk about Linux Bridge and Veth Pair. Understanding these two devices will be more helpful for subsequent understanding of containerized networks. We have also introduced other network equipment before. If you are interested, you can read related:

  • Linux network equipment – TUN/TAP

Veth pair communicates at both ends

Let’s take a look at veth pair first. This is a virtual network device that appears in pairs. You can simply imagine that these are two network cards connected by a network cable in the middle.
veth pair

Through the ip command, a veth pair can be created. We create veth0 and veth1, bind the IP respectively and set it to the UP state:

$ ip link add veth0 type veth peer name veth1
$ ip link add veth0 type veth peer name veth1

$ ip addr add 10.1.1.100/24 dev veth0
$ ip addr add 10.1.1.101/24 dev veth1

$ ip link set veth0 up
$ ip link set veth1 up

ip a Check the network device status. veth0 and veth1 have been created and bound to IPs normally:

$ ip a
...
20: veth1@veth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 36:c9:5b:1a:6d:9b brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.101/24 scope global veth1
       valid_lft forever preferred_lft forever
    inet6 fe80::34c9:5bff:fe1a:6d9b/64 scope link
       valid_lft forever preferred_lft forever
21: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether c6:5e:14:55:f5:b1 brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.100/24 scope global veth0
       valid_lft forever preferred_lft forever
    inet6 fe80::c45e:14ff:fe55:f5b1/64 scope link
       valid_lft forever preferred_lft forever

Let’s try setting a section of the vair pair to DOWN, and then check the network device status. You can see that I just set veth0 to DOWN, but then veth1 also entered. M-DOWN status. Because the two network devices of the veth pair are facing each other, when veth0 is set to DOWN, it means that it cannot receive data, and the other end veth1 will not work properly. , so the kernel will automatically set it to M-DOWN to maintain consistency at both ends.

$ ip link set veth1 down
$ ip a
20: veth1@veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 36:c9:5b:1a:6d:9b brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.101/24 scope global veth1
       valid_lft forever preferred_lft forever
21: veth0@veth1: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default qlen 1000
    link/ether c6:5e:14:55:f5:b1 brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.100/24 scope global veth0
       valid_lft forever preferred_lft forever
    inet6 fe80::c45e:14ff:fe55:f5b1/64 scope link
       valid_lft forever preferred_lft forever

We say that veth0 and veth1 are logically connected by a network cable, so in theory it is possible to ping veth1 through veth0 Yes, let’s try:

ping -c 1 -I veth0 10.1.1.101
PING 10.1.1.101 (10.1.1.101) from 10.1.1.100 veth0: 56(84) bytes of data.
From 10.1.1.100 icmp_seq=1 Destination Host Unreachable
--- 10.1.1.101 ping statistics ---
1 packets transmitted, 0 received, + 1 errors, 100% packet loss, time 0ms

It can be seen that the ping cannot pass. Let’s capture the packet in veth0 and take a look:

tcpdump -n -i veth0
14:41:23.473479 ARP, Request who-has 10.1.1.101 tell 10.1.1.100, length 28
14:41:24.496485 ARP, Request who-has 10.1.1.101 tell 10.1.1.100, length 28
14:41:25.520479 ARP, Request who-has 10.1.1.101 tell 10.1.1.100, length 28

The reason is that although veth0 and veth1 are on the same network segment, since this is the first communication, 10.1.1.101 does not exist in the ARP table. > For MAC information, an ARP Request needs to be sent first, but no response is received. This is caused by some default configuration restrictions in the Ubuntu kernel. Let’s loosen the restrictions first:

$ echo 1 > /proc/sys/net/ipv4/conf/veth0/accept_local
$ echo 1 > /proc/sys/net/ipv4/conf/veth1/accept_local

Now you can ping:

$ ping -c 1 -I veth0 10.1.1.101
PING 10.1.1.101 (10.1.1.101) from 10.1.1.100 veth0: 56(84) bytes of data.
64 bytes from 10.1.1.101: icmp_seq=1 ttl=64 time=0.140 ms
--- 10.1.1.101 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms

Let’s take a look at the packet capture process at veth0 and veth1:

# shell-0
$ tcpdump -n -i veth0 icmp
14:54:33.638414 IP 10.1.1.100 > 10.1.1.101: ICMP echo request, id 12, seq 1, length 64

# shell-1
$ tcpdump -n -i veth1 icmp
14:54:33.638420 IP 10.1.1.100 > 10.1.1.101: ICMP echo request, id 12, seq 1, length 64

# shell-2
$ ping -c 1 -I veth0 10.1.1.101
PING 10.1.1.101 (10.1.1.101) from 10.1.1.100 veth0: 56(84) bytes of data.
64 bytes from 10.1.1.101: icmp_seq=1 ttl=64 time=0.043 ms
--- 10.1.1.101 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms

From the tiny difference in timestamps of shell-0 and shell-1, we can see that when executing the ping command, veth0 comes first After receiving the ICMP Request, it is immediately sent to the peer’s veth1, but no ICMP Reply is seen on either device, but the ping command does show 1 packets transmitted.
In fact, the reason is very simple. ICMP Reply uses the loopback port.

$ tcpdump -n -i lo icmp
14:54:33.638441 IP 10.1.1.101 > 10.1.1.100: ICMP echo reply, id 12, seq 1, length 64

Let’s take a look at the flow of data packets:
veth pair

The process is roughly as follows:

  1. First, the ping program constructs an ICMP Request and sends it to the kernel’s network protocol stack through the Socket API;
  2. In ping, we specify the veth0 network card through -I veth0, so the protocol stack will hand the data packet to veth0;
  3. Since there is a logical network cable between veth0 and veth1, the data packet will be delivered directly to veth1;
  4. veth1 does not process the data packet after receiving it, and transfers it to the kernel protocol stack;
  5. After receiving the data packet, the kernel protocol stack found that 10.1.1.101 was the local IP, so it constructed an ICMP Reply packet. After checking the routing table, it found that 10.1.1.101 should be Take the loopback port (ip route show table local);
  6. After loopback receives the ICMP Reply, it forwards it to the kernel protocol stack;
  7. Finally, the protocol stack hands the data packet to the ping process, and ping successfully receives the ICMP Reply packet;

When configuring IPs for veth0 and veth1, the kernel will automatically add routes in the local table:

$ ip route show table local
local 10.1.1.100 dev veth0 proto kernel scope host src 10.1.1.100
local 10.1.1.101 dev veth1 proto kernel scope host src 10.1.1.101
broadcast 10.1.1.255 dev veth1 proto kernel scope link src 10.1.1.101
broadcast 10.1.1.255 dev veth0 proto kernel scope link src 10.1.1.100
...

Network between host and container

We know that namespaces are widely used in containers to achieve isolation between containers. The same is true for container networks. Let’s first take a look at the network devices under the host’s default namespace:

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:cb:f0:b3 brd ff:ff:ff:ff:ff:ff
    inet 192.168.31.92/24 brd 192.168.31.255 scope global dynamic noprefixroute enp1s0
       valid_lft 42114sec preferred_lft 42114sec
    inet6 fe80::b5fc:b1f8:2b4:a62d/64 scope link noprefixroute
       valid_lft forever preferred_lft forever

Then we create a netns ns1, and then check the network devices in ns1:

$ ip netns add ns1
$ ip netns exec ns1 ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@bridge1:~#

You can see that in ns1, there is only one lo device created by default. In addition, all routing rules, firewall rules, etc. are also completely isolated from the default namespace.

At this time, if we want to connect the network of ns1 to the host, we can consider using veth pair:

# Create namespace `ns1`
$ip netns add ns1

#Create veth pair
$ ip link add veth0 type veth peer name veth1

# Set netns of `veth1` to `ns1`
$ ip link set veth1 netns ns1

# Initialize the settings of `veth0` and `veth1` respectively
$ ip addr add 10.1.1.100/24 dev veth0
$ ip link set veth0 up
$ ip netns exec ns1 ip addr add 10.1.1.101/24 dev veth1
$ ip netns exec ns1 ip link set veth1 up

At this time, check the default namespace and ns1 network devices. You can see that the veth0 and veth1 devices are configured normally:

$ ip a
...
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:cb:f0:b3 brd ff:ff:ff:ff:ff:ff
    inet 192.168.31.92/24 brd 192.168.31.255 scope global dynamic noprefixroute enp1s0
       valid_lft 41661sec preferred_lft 41661sec
    inet6 fe80::b5fc:b1f8:2b4:a62d/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
10: veth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether c6:5e:14:55:f5:b1 brd ff:ff:ff:ff:ff:ff link-netns ns1
    inet 10.1.1.100/24 scope global veth0
       valid_lft forever preferred_lft forever
    inet6 fe80::c45e:14ff:fe55:f5b1/64 scope link
       valid_lft forever preferred_lft forever

$ ip netns exec ns1 ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
9: veth1@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 36:c9:5b:1a:6d:9b brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.1.1.101/24 scope global veth1
       valid_lft forever preferred_lft forever
    inet6 fe80::34c9:5bff:fe1a:6d9b/64 scope link
       valid_lft forever preferred_lft forever

Then try to ping veth0 under the default namespace through the veth1 interface of ns1 and capture the packet on veth0 :

# shell-0
$ ip netns exec ns1 ping -c 1 10.1.1.100
PING 10.1.1.100 (10.1.1.100) 56(84) bytes of data.
64 bytes from 10.1.1.100: icmp_seq=1 ttl=64 time=0.143 ms
64 bytes from 10.1.1.100: icmp_seq=2 ttl=64 time=0.046 ms
--- 10.1.1.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 1026ms

# shell-1
$ tcpdump -n -i veth0 icmp
08:39:52.182061 IP 10.1.1.101 > 10.1.1.100: ICMP echo request, id 44179, seq 1, length 64
08:39:52.182131 IP 10.1.1.100 > 10.1.1.101: ICMP echo reply, id 44179, seq 1, length 64

Because of the characteristics of veth pair, even if veth0 and veth1 are in different namespaces, they can communicate normally, and the veth0 network card Above we can see the complete ICMP Request and ICMP Reply packets.

Remember in the above example, ICMP Reply is sent through the loopback interface? In the current scenario, because veth0 and veth1 are already in different namespaces, when veth0 is returning the packet, go to the routing table local query will not consider veth1 10.1.1.100 to be the local interface, so the packet will be returned from veth0 normally. , we can look at the routing tables of these two namespaces to verify the judgment:

$ ip route show table local
local 10.1.1.100 dev veth0 proto kernel scope host src 10.1.1.100
broadcast 10.1.1.255 dev veth0 proto kernel scope link src 10.1.1.100
...
$ ip netns exec ns1 ip route show table local
local 10.1.1.101 dev veth1 proto kernel scope host src 10.1.1.101
broadcast 10.1.1.255 dev veth1 proto kernel scope link src 10.1.1.101

Network between containers

We just saw that between the host’s default namespace and the container’s namespace ns1, the network can be opened through a veth pair. Logically, it is equivalent to the host and the container being two independent hosts connected by a network cable. together.

Normally, there will not be only one container in a host. The host and the container have a one-to-many relationship. If you want to open up the network between containers, you can also do it simply relying on the veth pair, but the configuration is more complicated, so in Here we introduce bridge to achieve it.

Linux’s bridge is a virtual Ethernet bridge provided by the kernel. It is similar to a physical switch in principle and works on the second layer. We create a bridge on Linux, and then insert the container into the bridge through a veth pair to achieve network communication between containers.

First create and enable bridge br0:

$ brctl addbr br0
$ ip link set br0 up

Then simulate the container network through namespace:

# `ns0` is used as the container network, put one end of the veth pair into `ns0`, and the other end into the switch `br0`
$ip netns add ns0
$ ip link add veth0 type veth peer name veth0_br
$ ip link set veth0 netns ns0
$ ip netns exec ns0 ip addr add 10.1.1.100/24 dev veth0
$ ip link set veth0_br up
$ ip netns exec ns0 ip link set veth0 up
$ brctl addif br0 veth0_br

# `ns1` is used as the container network, put one end of the veth pair into `ns1`, and the other end into the switch `br0`
$ip netns add ns1
$ ip link add veth1 type veth peer name veth1_br
$ ip link set veth1 netns ns1
$ ip netns exec ns1 ip addr add 10.1.1.101/24 dev veth1
$ ip link set veth1_br up
$ ip netns exec ns1 ip link set veth1 up
$ brctl addif br0 veth1_br

You can see that the container network can be connected normally:

ip netns exec ns0 ping -c1 10.1.1.101
PING 10.1.1.101 (10.1.1.101) 56(84) bytes of data.
64 bytes from 10.1.1.101: icmp_seq=1 ttl=64 time=0.058 ms
--- 10.1.1.101 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms

Container accesses external network

In the above example, container networks in different namespaces can be connected to each other through veth pairs and bridges. However, if the container wants to access the external network, some additional configuration is required.

Let me explain my network environment here. The Ubuntu machine I used for testing is plugged into a router. The IP of this router is 192.168.31.1, and the DHCP network segment is 192.168. 31.0/24, in order to enable the container network to communicate with the external network, here we use a simpler method, which is to set the container’s network card IP to be in the same network segment as the physical network card’s IP.

First create bridge br0:

$ brctl addbr br0
$ ip link set br0 up

Then create namespace ns0 as the container network, create veth pair veth0 and veth0_br, put one end of it into ns0, and the other Plug one end into the network bridge br0 (note here that the IP 192.168.31.100 I configured for veth0 is the network segment of the physical network).

$ ip netns add ns0

$ ip link add veth0 type veth peer name veth0_br
$ ip link set veth0 netns ns0
$ ip netns exec ns0 ip addr add 192.168.31.100/24 dev veth0
$ ip link set veth0_br up
$ ip netns exec ns0 ip link set veth0 up

$ brctl addif br0 veth0_br

There is another important configuration here, which is to configure the gateway of the default route in ns0 as the gateway of the physical network 192.168.31.1.

$ ip netns exec ns0 ip route add default via 192.168.31.1 dev veth0
$ ip netns exec ns0 route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.31.1 0.0.0.0 UG 0 0 0 veth0
192.168.31.0 0.0.0.0 255.255.255.0 U 0 0 0 veth0

At this time, the veth0_br network card has been inserted into br0, but we know that if this is the case, we cannot communicate with the external network. We also need to insert the physical network card into the bridge. In this way, logically the physical router, physical network card, veth0_br and veth0 will be in the same layer 2.

ip addr add 192.168.31.92/24 dev br0
ip addr del 192.168.31.92/24 dev enp1s0
brctl addif br0 enp1s0

veth pair

With this configuration, the container can access the external network normally.

$ ip netns exec ns0 ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 39.156.66.10: icmp_seq=1 ttl=46 time=37.3 ms
--- 39.156.66.10 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 2003ms

Today we introduced the idea of implementing a single-node container network through bridge + veth pair, and will introduce the implementation in a cross-node scenario later.

If you have any questions about the content of the article or have other technical exchanges, you can follow my official account: Li Ruonian