We continue to introduce common network devices in Linux. Today we mainly talk about Linux Bridge and Veth Pair. Understanding these two devices will be more helpful for subsequent understanding of containerized networks. We have also introduced other network equipment before. If you are interested, you can read related:
- Linux network equipment – TUN/TAP
Veth pair communicates at both ends
Let’s take a look at veth pair first. This is a virtual network device that appears in pairs. You can simply imagine that these are two network cards connected by a network cable in the middle.
Through the ip command, a veth pair can be created. We create veth0
and veth1
, bind the IP respectively and set it to the UP state:
$ ip link add veth0 type veth peer name veth1 $ ip link add veth0 type veth peer name veth1 $ ip addr add 10.1.1.100/24 dev veth0 $ ip addr add 10.1.1.101/24 dev veth1 $ ip link set veth0 up $ ip link set veth1 up
ip a
Check the network device status. veth0
and veth1
have been created and bound to IPs normally:
$ ip a ... 20: veth1@veth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 36:c9:5b:1a:6d:9b brd ff:ff:ff:ff:ff:ff inet 10.1.1.101/24 scope global veth1 valid_lft forever preferred_lft forever inet6 fe80::34c9:5bff:fe1a:6d9b/64 scope link valid_lft forever preferred_lft forever 21: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether c6:5e:14:55:f5:b1 brd ff:ff:ff:ff:ff:ff inet 10.1.1.100/24 scope global veth0 valid_lft forever preferred_lft forever inet6 fe80::c45e:14ff:fe55:f5b1/64 scope link valid_lft forever preferred_lft forever
Let’s try setting a section of the vair pair to DOWN, and then check the network device status. You can see that I just set
veth0
to DOWN, but thenveth1
also entered. M-DOWN status. Because the two network devices of the veth pair are facing each other, whenveth0
is set to DOWN, it means that it cannot receive data, and the other endveth1
will not work properly. , so the kernel will automatically set it to M-DOWN to maintain consistency at both ends.$ ip link set veth1 down $ ip a 20: veth1@veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN group default qlen 1000 link/ether 36:c9:5b:1a:6d:9b brd ff:ff:ff:ff:ff:ff inet 10.1.1.101/24 scope global veth1 valid_lft forever preferred_lft forever 21: veth0@veth1: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default qlen 1000 link/ether c6:5e:14:55:f5:b1 brd ff:ff:ff:ff:ff:ff inet 10.1.1.100/24 scope global veth0 valid_lft forever preferred_lft forever inet6 fe80::c45e:14ff:fe55:f5b1/64 scope link valid_lft forever preferred_lft forever
We say that veth0
and veth1
are logically connected by a network cable, so in theory it is possible to ping veth1
through veth0
Yes, let’s try:
ping -c 1 -I veth0 10.1.1.101 PING 10.1.1.101 (10.1.1.101) from 10.1.1.100 veth0: 56(84) bytes of data. From 10.1.1.100 icmp_seq=1 Destination Host Unreachable --- 10.1.1.101 ping statistics --- 1 packets transmitted, 0 received, + 1 errors, 100% packet loss, time 0ms
It can be seen that the ping cannot pass. Let’s capture the packet in veth0
and take a look:
tcpdump -n -i veth0 14:41:23.473479 ARP, Request who-has 10.1.1.101 tell 10.1.1.100, length 28 14:41:24.496485 ARP, Request who-has 10.1.1.101 tell 10.1.1.100, length 28 14:41:25.520479 ARP, Request who-has 10.1.1.101 tell 10.1.1.100, length 28
The reason is that although veth0
and veth1
are on the same network segment, since this is the first communication, 10.1.1.101
does not exist in the ARP table. > For MAC information, an ARP Request needs to be sent first, but no response is received. This is caused by some default configuration restrictions in the Ubuntu kernel. Let’s loosen the restrictions first:
$ echo 1 > /proc/sys/net/ipv4/conf/veth0/accept_local $ echo 1 > /proc/sys/net/ipv4/conf/veth1/accept_local
Now you can ping:
$ ping -c 1 -I veth0 10.1.1.101 PING 10.1.1.101 (10.1.1.101) from 10.1.1.100 veth0: 56(84) bytes of data. 64 bytes from 10.1.1.101: icmp_seq=1 ttl=64 time=0.140 ms --- 10.1.1.101 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Let’s take a look at the packet capture process at veth0
and veth1
:
# shell-0 $ tcpdump -n -i veth0 icmp 14:54:33.638414 IP 10.1.1.100 > 10.1.1.101: ICMP echo request, id 12, seq 1, length 64 # shell-1 $ tcpdump -n -i veth1 icmp 14:54:33.638420 IP 10.1.1.100 > 10.1.1.101: ICMP echo request, id 12, seq 1, length 64 # shell-2 $ ping -c 1 -I veth0 10.1.1.101 PING 10.1.1.101 (10.1.1.101) from 10.1.1.100 veth0: 56(84) bytes of data. 64 bytes from 10.1.1.101: icmp_seq=1 ttl=64 time=0.043 ms --- 10.1.1.101 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms
From the tiny difference in timestamps of shell-0
and shell-1
, we can see that when executing the ping command, veth0
comes first After receiving the ICMP Request, it is immediately sent to the peer’s veth1
, but no ICMP Reply is seen on either device, but the ping command does show 1 packets transmitted
.
In fact, the reason is very simple. ICMP Reply uses the loopback port.
$ tcpdump -n -i lo icmp 14:54:33.638441 IP 10.1.1.101 > 10.1.1.100: ICMP echo reply, id 12, seq 1, length 64
Let’s take a look at the flow of data packets:
The process is roughly as follows:
- First, the ping program constructs an ICMP Request and sends it to the kernel’s network protocol stack through the Socket API;
- In ping, we specify the
veth0
network card through-I veth0
, so the protocol stack will hand the data packet toveth0
; - Since there is a logical network cable between
veth0
andveth1
, the data packet will be delivered directly toveth1
; veth1
does not process the data packet after receiving it, and transfers it to the kernel protocol stack;- After receiving the data packet, the kernel protocol stack found that
10.1.1.101
was the local IP, so it constructed an ICMP Reply packet. After checking the routing table, it found that10.1.1.101
should be Take the loopback port (ip route show table local
); - After loopback receives the ICMP Reply, it forwards it to the kernel protocol stack;
- Finally, the protocol stack hands the data packet to the ping process, and ping successfully receives the ICMP Reply packet;
When configuring IPs for
veth0
andveth1
, the kernel will automatically add routes in thelocal
table:$ ip route show table local local 10.1.1.100 dev veth0 proto kernel scope host src 10.1.1.100 local 10.1.1.101 dev veth1 proto kernel scope host src 10.1.1.101 broadcast 10.1.1.255 dev veth1 proto kernel scope link src 10.1.1.101 broadcast 10.1.1.255 dev veth0 proto kernel scope link src 10.1.1.100 ...
Network between host and container
We know that namespaces are widely used in containers to achieve isolation between containers. The same is true for container networks. Let’s first take a look at the network devices under the host’s default namespace:
$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:cb:f0:b3 brd ff:ff:ff:ff:ff:ff inet 192.168.31.92/24 brd 192.168.31.255 scope global dynamic noprefixroute enp1s0 valid_lft 42114sec preferred_lft 42114sec inet6 fe80::b5fc:b1f8:2b4:a62d/64 scope link noprefixroute valid_lft forever preferred_lft forever
Then we create a netns ns1
, and then check the network devices in ns1
:
$ ip netns add ns1 $ ip netns exec ns1 ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 root@bridge1:~#
You can see that in ns1
, there is only one lo
device created by default. In addition, all routing rules, firewall rules, etc. are also completely isolated from the default namespace.
At this time, if we want to connect the network of ns1
to the host, we can consider using veth pair:
# Create namespace `ns1` $ip netns add ns1 #Create veth pair $ ip link add veth0 type veth peer name veth1 # Set netns of `veth1` to `ns1` $ ip link set veth1 netns ns1 # Initialize the settings of `veth0` and `veth1` respectively $ ip addr add 10.1.1.100/24 dev veth0 $ ip link set veth0 up $ ip netns exec ns1 ip addr add 10.1.1.101/24 dev veth1 $ ip netns exec ns1 ip link set veth1 up
At this time, check the default namespace and ns1
network devices. You can see that the veth0
and veth1
devices are configured normally:
$ ip a ... 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:cb:f0:b3 brd ff:ff:ff:ff:ff:ff inet 192.168.31.92/24 brd 192.168.31.255 scope global dynamic noprefixroute enp1s0 valid_lft 41661sec preferred_lft 41661sec inet6 fe80::b5fc:b1f8:2b4:a62d/64 scope link noprefixroute valid_lft forever preferred_lft forever 10: veth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether c6:5e:14:55:f5:b1 brd ff:ff:ff:ff:ff:ff link-netns ns1 inet 10.1.1.100/24 scope global veth0 valid_lft forever preferred_lft forever inet6 fe80::c45e:14ff:fe55:f5b1/64 scope link valid_lft forever preferred_lft forever $ ip netns exec ns1 ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 9: veth1@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 36:c9:5b:1a:6d:9b brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.1.1.101/24 scope global veth1 valid_lft forever preferred_lft forever inet6 fe80::34c9:5bff:fe1a:6d9b/64 scope link valid_lft forever preferred_lft forever
Then try to ping veth0
under the default namespace through the veth1
interface of ns1
and capture the packet on veth0
:
# shell-0 $ ip netns exec ns1 ping -c 1 10.1.1.100 PING 10.1.1.100 (10.1.1.100) 56(84) bytes of data. 64 bytes from 10.1.1.100: icmp_seq=1 ttl=64 time=0.143 ms 64 bytes from 10.1.1.100: icmp_seq=2 ttl=64 time=0.046 ms --- 10.1.1.100 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 1026ms # shell-1 $ tcpdump -n -i veth0 icmp 08:39:52.182061 IP 10.1.1.101 > 10.1.1.100: ICMP echo request, id 44179, seq 1, length 64 08:39:52.182131 IP 10.1.1.100 > 10.1.1.101: ICMP echo reply, id 44179, seq 1, length 64
Because of the characteristics of veth pair, even if veth0
and veth1
are in different namespaces, they can communicate normally, and the veth0
network card Above we can see the complete ICMP Request and ICMP Reply packets.
Remember in the above example, ICMP Reply is sent through the loopback interface? In the current scenario, because veth0
and veth1
are already in different namespaces, when veth0
is returning the packet, go to the routing table local
query will not consider veth1
10.1.1.100
to be the local interface, so the packet will be returned from veth0
normally. , we can look at the routing tables of these two namespaces to verify the judgment:
$ ip route show table local local 10.1.1.100 dev veth0 proto kernel scope host src 10.1.1.100 broadcast 10.1.1.255 dev veth0 proto kernel scope link src 10.1.1.100 ... $ ip netns exec ns1 ip route show table local local 10.1.1.101 dev veth1 proto kernel scope host src 10.1.1.101 broadcast 10.1.1.255 dev veth1 proto kernel scope link src 10.1.1.101
Network between containers
We just saw that between the host’s default namespace and the container’s namespace ns1
, the network can be opened through a veth pair. Logically, it is equivalent to the host and the container being two independent hosts connected by a network cable. together.
Normally, there will not be only one container in a host. The host and the container have a one-to-many relationship. If you want to open up the network between containers, you can also do it simply relying on the veth pair, but the configuration is more complicated, so in Here we introduce bridge to achieve it.
Linux’s bridge is a virtual Ethernet bridge provided by the kernel. It is similar to a physical switch in principle and works on the second layer. We create a bridge on Linux, and then insert the container into the bridge through a veth pair to achieve network communication between containers.
First create and enable bridge br0
:
$ brctl addbr br0 $ ip link set br0 up
Then simulate the container network through namespace:
# `ns0` is used as the container network, put one end of the veth pair into `ns0`, and the other end into the switch `br0` $ip netns add ns0 $ ip link add veth0 type veth peer name veth0_br $ ip link set veth0 netns ns0 $ ip netns exec ns0 ip addr add 10.1.1.100/24 dev veth0 $ ip link set veth0_br up $ ip netns exec ns0 ip link set veth0 up $ brctl addif br0 veth0_br # `ns1` is used as the container network, put one end of the veth pair into `ns1`, and the other end into the switch `br0` $ip netns add ns1 $ ip link add veth1 type veth peer name veth1_br $ ip link set veth1 netns ns1 $ ip netns exec ns1 ip addr add 10.1.1.101/24 dev veth1 $ ip link set veth1_br up $ ip netns exec ns1 ip link set veth1 up $ brctl addif br0 veth1_br
You can see that the container network can be connected normally:
ip netns exec ns0 ping -c1 10.1.1.101 PING 10.1.1.101 (10.1.1.101) 56(84) bytes of data. 64 bytes from 10.1.1.101: icmp_seq=1 ttl=64 time=0.058 ms --- 10.1.1.101 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Container accesses external network
In the above example, container networks in different namespaces can be connected to each other through veth pairs and bridges. However, if the container wants to access the external network, some additional configuration is required.
Let me explain my network environment here. The Ubuntu machine I used for testing is plugged into a router. The IP of this router is
192.168.31.1
, and the DHCP network segment is192.168. 31.0/24
, in order to enable the container network to communicate with the external network, here we use a simpler method, which is to set the container’s network card IP to be in the same network segment as the physical network card’s IP.
First create bridge br0
:
$ brctl addbr br0 $ ip link set br0 up
Then create namespace ns0
as the container network, create veth pair veth0
and veth0_br
, put one end of it into ns0
, and the other Plug one end into the network bridge br0
(note here that the IP 192.168.31.100
I configured for veth0
is the network segment of the physical network).
$ ip netns add ns0 $ ip link add veth0 type veth peer name veth0_br $ ip link set veth0 netns ns0 $ ip netns exec ns0 ip addr add 192.168.31.100/24 dev veth0 $ ip link set veth0_br up $ ip netns exec ns0 ip link set veth0 up $ brctl addif br0 veth0_br
There is another important configuration here, which is to configure the gateway of the default route in ns0
as the gateway of the physical network 192.168.31.1
.
$ ip netns exec ns0 ip route add default via 192.168.31.1 dev veth0 $ ip netns exec ns0 route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 192.168.31.1 0.0.0.0 UG 0 0 0 veth0 192.168.31.0 0.0.0.0 255.255.255.0 U 0 0 0 veth0
At this time, the veth0_br
network card has been inserted into br0
, but we know that if this is the case, we cannot communicate with the external network. We also need to insert the physical network card into the bridge. In this way, logically the physical router, physical network card, veth0_br
and veth0
will be in the same layer 2.
ip addr add 192.168.31.92/24 dev br0 ip addr del 192.168.31.92/24 dev enp1s0 brctl addif br0 enp1s0
With this configuration, the container can access the external network normally.
$ ip netns exec ns0 ping -c 1 8.8.8.8 PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. 64 bytes from 39.156.66.10: icmp_seq=1 ttl=46 time=37.3 ms --- 39.156.66.10 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 2003ms
Today we introduced the idea of implementing a single-node container network through bridge + veth pair, and will introduce the implementation in a cross-node scenario later.
If you have any questions about the content of the article or have other technical exchanges, you can follow my official account: Li Ruonian