当前位置: 代码迷 >> 综合 >> docker host net mtu (by quqi99)
  详细解决方案

docker host net mtu (by quqi99)

热度:45   发布时间:2023-12-13 08:54:46.0

作者:张华 发表于:2021-03-06
版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明

在一个gre虚机上创建一个docker container,显然网络不通是由mtu造成的。但是如果host=net创建的网络为什么也是不行呢?

sudo docker run --rm --net=host --privileged --name=nginx -v /sys/fs/cgroup:/sys/fs/cgroup -d -ti nginx
sudo docker exec -ti nginx bash

测试了UDP与TCP都是时而可以时而不可以。

#TCP
nc -tlp 8888
nc -vtz 10.5.0.178 8888#UDP
nc -ulp 8888
nc -vuz 10.5.0.178 8888

测试结果

  • 使用tcp时,tcp server每测试一遍会自动断,重新运行’nc -tlp 8888’每次均没有问题
  • 使用udp时,udp server每测试一遍不会自动断,每次重新运行"nc -ulp 8888"不会有问题,但不重新运行server时就报refused错误了

上面其实都是正常的,但客户那里报的错如下,这也是正常的,至于“inverse host lookup failed: Unknown host”是查不到host, nc上添加"-n"参数即可解决。

 # nc -uvz xxx 5093
xxx: inverse host lookup failed: Unknown host
(UNKNOWN) [xxx] 5093 (?) open

但客户tcpdump抓到了如下mtu问题,这才是真正的错误所在

07:19:55.096224 IP 10.30.50.189.33473 > xxx.5093: UDP, bad length 1432 > 1408

所以用nc测试能看到“5093 (?) open”说明网络是通的,但存在mtu问题。

tcp可以时的抓包

在虚机上的抓包数据:

ubuntu@i1:~$ sudo tcpdump -ei ens2 -s 0 port 8888
02:06:28.513860 fa:16:3e:54:36:ad (oui Unknown) > fa:16:3e:22:d6:67 (oui Unknown), ethertype IPv4 (0x0800), length 74: i1.38146 > juju-c40d4b-ovn-6.cloud.sts.8888: Flags [S], seq 3310591319, win 65340, options [mss 1452,sackOK,TS val 2788776170 ecr 0,nop,wscale 7], length 0
02:06:28.515634 fa:16:3e:22:d6:67 (oui Unknown) > fa:16:3e:54:36:ad (oui Unknown), ethertype IPv4 (0x0800), length 74: juju-c40d4b-ovn-6.cloud.sts.8888 > i1.38146: Flags [S.], seq 432497099, ack 3310591320, win 62342, options [mss 8918,sackOK,TS val 3043405881 ecr 2788776170,nop,wscale 7], length 0
02:06:28.515689 fa:16:3e:54:36:ad (oui Unknown) > fa:16:3e:22:d6:67 (oui Unknown), ethertype IPv4 (0x0800), length 66: i1.38146 > juju-c40d4b-ovn-6.cloud.sts.8888: Flags [.], ack 1, win 511, options [nop,nop,TS val 2788776172 ecr 3043405881], length 0
02:06:28.515907 fa:16:3e:54:36:ad (oui Unknown) > fa:16:3e:22:d6:67 (oui Unknown), ethertype IPv4 (0x0800), length 66: i1.38146 > juju-c40d4b-ovn-6.cloud.sts.8888: Flags [F.], seq 1, ack 1, win 511, options [nop,nop,TS val 2788776172 ecr 3043405881], length 0
02:06:28.517663 fa:16:3e:22:d6:67 (oui Unknown) > fa:16:3e:54:36:ad (oui Unknown), ethertype IPv4 (0x0800), length 68: juju-c40d4b-ovn-6.cloud.sts.8888 > i1.38146: Flags [P.], seq 1:3, ack 2, win 488, options [nop,nop,TS val 3043405882 ecr 2788776172], length 2
02:06:28.517723 fa:16:3e:54:36:ad (oui Unknown) > fa:16:3e:22:d6:67 (oui Unknown), ethertype IPv4 (0x0800), length 54: i1.38146 > juju-c40d4b-ovn-6.cloud.sts.8888: Flags [R], seq 3310591321, win 0, length 0
02:06:28.517860 fa:16:3e:22:d6:67 (oui Unknown) > fa:16:3e:54:36:ad (oui Unknown), ethertype IPv4 (0x0800), length 66: juju-c40d4b-ovn-6.cloud.sts.8888 > i1.38146: Flags [F.], seq 3, ack 2, win 488, options [nop,nop,TS val 3043405882 ecr 2788776172], length 0ubuntu@i1:~$ ip addr show ens2
2: ens2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1492 qdisc fq_codel state UP group default qlen 1000link/ether fa:16:3e:54:36:ad brd ff:ff:ff:ff:ff:ffinet 192.168.21.161/24 brd 192.168.21.255 scope global dynamic ens2

在物理机上的抓包数据:

02:06:28.514218 fa:16:3e:50:aa:2a (oui Unknown) > fa:16:3e:64:80:50 (oui Unknown), ethertype IPv4 (0x0800), length 74: 10.5.150.115.38146 > juju-c40d4b-ovn-6.cloud.sts.8888: Flags [S], seq 3310591319, win 65340, options [mss 1452,sackOK,TS val 2788776170 ecr 0,nop,wscale 7], length 0
02:06:28.514272 fa:16:3e:64:80:50 (oui Unknown) > fa:16:3e:50:aa:2a (oui Unknown), ethertype IPv4 (0x0800), length 74: juju-c40d4b-ovn-6.cloud.sts.8888 > 10.5.150.115.38146: Flags [S.], seq 432497099, ack 3310591320, win 62342, options [mss 8918,sackOK,TS val 3043405881 ecr 2788776170,nop,wscale 7], length 0
02:06:28.515805 fa:16:3e:50:aa:2a (oui Unknown) > fa:16:3e:64:80:50 (oui Unknown), ethertype IPv4 (0x0800), length 66: 10.5.150.115.38146 > juju-c40d4b-ovn-6.cloud.sts.8888: Flags [.], ack 1, win 511, options [nop,nop,TS val 2788776172 ecr 3043405881], length 0
02:06:28.515848 fa:16:3e:50:aa:2a (oui Unknown) > fa:16:3e:64:80:50 (oui Unknown), ethertype IPv4 (0x0800), length 66: 10.5.150.115.38146 > juju-c40d4b-ovn-6.cloud.sts.8888: Flags [F.], seq 1, ack 1, win 511, options [nop,nop,TS val 2788776172 ecr 3043405881], length 0
02:06:28.516053 fa:16:3e:64:80:50 (oui Unknown) > fa:16:3e:50:aa:2a (oui Unknown), ethertype IPv4 (0x0800), length 68: juju-c40d4b-ovn-6.cloud.sts.8888 > 10.5.150.115.38146: Flags [P.], seq 1:3, ack 2, win 488, options [nop,nop,TS val 3043405882 ecr 2788776172], length 2
02:06:28.516071 fa:16:3e:64:80:50 (oui Unknown) > fa:16:3e:50:aa:2a (oui Unknown), ethertype IPv4 (0x0800), length 66: juju-c40d4b-ovn-6.cloud.sts.8888 > 10.5.150.115.38146: Flags [F.], seq 3, ack 2, win 488, options [nop,nop,TS val 3043405882 ecr 2788776172], length 0
02:06:28.517335 fa:16:3e:50:aa:2a (oui Unknown) > fa:16:3e:64:80:50 (oui Unknown), ethertype IPv4 (0x0800), length 54: 10.5.150.115.38146 > juju-c40d4b-ovn-6.cloud.sts.8888: Flags [R], seq 3310591321, win 0, length 0root@juju-c40d4b-ovn-6:~# ip addr show ens3
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8958 qdisc fq_codel state UP group default qlen 1000link/ether fa:16:3e:64:80:50 brd ff:ff:ff:ff:ff:ffinet 10.5.0.178/16 brd 10.5.255.255 scope global dynamic ens3

udp可以时的抓包

虚机

02:15:12.943389 fa:16:3e:54:36:ad (oui Unknown) > fa:16:3e:22:d6:67 (oui Unknown), ethertype IPv4 (0x0800), length 43: i1.39709 > juju-c40d4b-ovn-6.cloud.sts.8888: UDP, length 1
02:15:12.946069 fa:16:3e:54:36:ad (oui Unknown) > fa:16:3e:22:d6:67 (oui Unknown), ethertype IPv4 (0x0800), length 43: i1.39709 > juju-c40d4b-ovn-6.cloud.sts.8888: UDP, length 1ubuntu@i1:~$ sudo tcpdump -i ens2 udp port 8888 -A -nn
02:46:17.724481 IP 192.168.21.161.50757 > 10.5.0.178.8888: UDP, length 1
E...#.@.@.6.....
....E"..        ...
02:46:17.728605 IP 192.168.21.161.50757 > 10.5.0.178.8888: UDP, length 1
E...#.@.@.6.....
....E"..        ...

物理机

02:15:12.942802 fa:16:3e:50:aa:2a (oui Unknown) > fa:16:3e:64:80:50 (oui Unknown), ethertype IPv4 (0x0800), length 43: 10.5.150.115.39709 > juju-c40d4b-ovn-6.cloud.sts.8888: UDP, length 1
02:15:12.944746 fa:16:3e:50:aa:2a (oui Unknown) > fa:16:3e:64:80:50 (oui Unknown), ethertype IPv4 (0x0800), length 43: 10.5.150.115.39709 > juju-c40d4b-ovn-6.cloud.sts.8888: UDP, length 1

udp不可以时的抓包数据(这也是正常的,server要重启)

root@10:/# nc -vuz 10.5.0.178 8888
juju-c40d4b-ovn-6.cloud.sts [10.5.0.178] 8888 (?) : Connection refused

虚机

02:16:28.016397 fa:16:3e:54:36:ad (oui Unknown) > fa:16:3e:22:d6:67 (oui Unknown), ethertype IPv4 (0x0800), length 43: i1.55573 > juju-c40d4b-ovn-6.cloud.sts.8888: UDP, length 1

物理机

02:16:28.014916 fa:16:3e:50:aa:2a (oui Unknown) > fa:16:3e:64:80:50 (oui Unknown), ethertype IPv4 (0x0800), length 43: 10.5.150.115.55573 > juju-c40d4b-ovn-6.cloud.sts.8888: UDP, length 1

UDP, bad length 1432 > 1408

错误"UDP, bad length 1432 > 1408"意为UDP包长度大于UDP有效负载长度(The tcpdump error message you get is due to IP fragmentation which happens because the multicast datagram length > MTU - https://github.com/the-tcpdump-group/tcpdump/blob/tcpdump-4.7.4/print-udp.c#L694),客户的ens3的mtu是1442, 这个1442是怎么来的。

以太网帧为46到1500节字之间,IPv4的IP包头是20,IP报文体是1480字节。UDP头(源端口,目标端口,UDP长度,UDP校验和)是8字节,所以UDP包长度是1472字节。还有GRE头也是8字节。
所以 1442 - 20(IP头) - 8 (UDP头) - 8 (ICMP头) = 1408
GRE/Vxlan的包头大小见 - https://tonydeng.github.io/sdn-handbook/basic/overlay.html

可用下列命令测试mtu (1442 - 28 = 1414)

ping -c 2 -s 1414 -M do 10.5.0.178root@10:/# traceroute --mtu 10.5.0.178
traceroute to 10.5.0.178 (10.5.0.178), 30 hops max, 65000 byte packets1  * F=1492 * *2  juju-c40d4b-ovn-6.cloud.sts (10.5.0.178)  2.859 ms  2.169 ms  0.671 ms

对于udp, 因为无连接, 所以无法协商mss

$ cat /proc/sys/net/ipv4/ip_no_pmtu_disc
0

见 - https://zhhuabj.blog.csdn.net/article/details/82346840
根据这篇文章(https://blog.csdn.net/sinat_20184565/article/details/80326262),对于udp,在udp server处设置ip_no_pmtu_disc=1(docker中如何设置-https://github.com/hwdsl2/docker-ipsec-vpn-server)后,udp server发出来的包会带有禁止分片DF=1, 这样当udp client收到这种DF=1且udp包大小>mtu时(也见-https://zhhuabj.blog.csdn.net/article/details/114434188)就会向server返回实际的mtu大小,然后server端将包先按mtu分好。因为一般udp分片都是关的,所以需要在server端的应用层先分好。

ubuntu@i1:~$ ethtool -k ens2 |grep udp-fragmentation-offload
udp-fragmentation-offload: off

另一种办法可以是提高虚机与容器的mtu到(1432+ 28=1460), 为什么现在是1442这么低。

重现问题

perf相比nc有一个-l参数,可以指定udp包大小(在服务端不指定大小,在客户端指定大小1432, 这样得出1432是发包,所以可以很容易重现问题。
在容器里运行:

iperf -c 10.5.0.178 -u -l 1432
iperf -c 10.5.0.178 -u -l 1432 -b 800M  #if hoping to promte the speed

在物理机上运行:

iperf -s -u

容器里抓包

ubuntu@i1:~$ sudo tcpdump -ei ens2 -s 0 port 5001
04:06:53.401372 fa:16:3e:54:36:ad (oui Unknown) > fa:16:3e:22:d6:67 (oui Unknown), ethertype IPv4 (0x0800), length 1450: i1.54000 > juju-c40d4b-ovn-6.cloud.sts.5001: UDP, bad length 1432 > 1408

换成下列UDP代码无问题, 但将里面的server与client的行(msgFromServer = “Hello UDP Client”)改成(msgFromServer = “Hello UDP Client” * 100 )就重现问题了。

cat << EOF | sudo tee -a udp_server.py
import socket
localIP     = "0.0.0.0"
localPort   = 5001
bufferSize  = 1024
msgFromServer       = "Hello UDP Client"
bytesToSend         = str.encode(msgFromServer)
# Create a datagram socket
UDPServerSocket = socket.socket(family=socket.AF_INET, type=socket.SOCK_DGRAM)
# Bind to address and ip
UDPServerSocket.bind((localIP, localPort))
print("UDP server up and listening")
# Listen for incoming datagrams
while(True):bytesAddressPair = UDPServerSocket.recvfrom(bufferSize)message = bytesAddressPair[0]address = bytesAddressPair[1]clientMsg = "Message from Client:{}".format(message)clientIP  = "Client IP Address:{}".format(address)  print(clientMsg)print(clientIP)# Sending a reply to clientUDPServerSocket.sendto(bytesToSend, address)
EOF
python3 udp_server.pycat << EOF | sudo tee -a udp_client.py
import socket
msgFromClient       = "Hello UDP Server"
bytesToSend         = str.encode(msgFromClient)
serverAddressPort   = ("192.168.2.139", 5001)
bufferSize          = 1024
# Create a UDP socket at client side
UDPClientSocket = socket.socket(family=socket.AF_INET, type=socket.SOCK_DGRAM)
# Send to server using created UDP socket
UDPClientSocket.sendto(bytesToSend, serverAddressPort)
msgFromServer = UDPClientSocket.recvfrom(bufferSize)
msg = "Message from Server {}".format(msgFromServer[0])
print(msg)
EOF
python3 udp_client.py

解决方案

这个udp包的大小是由udp应用自己指定的。

  • iperf遇到这种udp包比mss还大的情况(https://github.com/esnet/iperf/issues/604)就提醒人们注意了 - https://github.com/esnet/iperf/commit/6663be4144873b2751e7ae48b0161663ecf78d00
  • 有一些软件会在udp发送端(server与client都有可能成为发送端设置“echo 1 >/proc/sys/net/ipv4/ip_no_pmtu_disc”,这样不再充许pmtu协商 (这当然也会破坏tcp的mss协商),这相当于是 disable the Don’t Fragment (DF) ( DF置位时才能pmtu协商),这样这种软件自己实现了IP分片。见: https://www.zeitgeist.se/2013/11/26/mtu-woes-in-ipsec-tunnels-how-to-fix/

所以对于这个问题的解决:

  • 要么udp软件中自己改小udp包大小。
  • 要求提高虚机的mtu (这也得改底层openstack的mtu)

也这是为什么一些网站如京东用到了udp不好使的原因 - 见: https://blog.csdn.net/quqi99/article/details/82346840

  相关解决方案