当前位置: 代码迷 >> 综合 >> Fix a rabbitmq issue (by quqi99)
  详细解决方案

Fix a rabbitmq issue (by quqi99)

热度:52   发布时间:2023-12-13 09:00:18.0

问题
rabbitmq cluster的3个节点挂了. 3个节点是lxd容器:
11-lxd-8
69-lxd-2
9-lxd-9
初步sosreport分析
11-lxd-8上看到system_limit错误

=ERROR REPORT==== 17-Apr-2020::13:19:44 ===
Too many processes
=ERROR REPORT==== 17-Apr-2020::13:19:44 ===
Ranch listener {acceptor,{0,0,0,0,0,0,0,0},5672} connection process start failure; rabbit_connection_sup:start_link/4 crashed with reason: error:system_limit

但是已经设置了LimitNOFILE

$ grep -r 'Limit' lib/systemd/system/rabbitmq-server.service
LimitNOFILE=65536

不过LimitNOFILE似乎是管file_descriptors的, 而不是erlang processes的.

{file_descriptors,     [{total_limit,65436},      {total_used,6415},      {sockets_limit,58890},      {sockets_used,6413}]},
16:28:22  {processes,[{limit,1048576},{used,118304}]},

上面的输出显示似乎用了118304个erlang processes还未超出limit. 不过, 这可能是重启机器之后抓的sosreport. 并且limit已经很大, 单纯增加limit似乎也并不是正道. 所以继续寻找线索.

在69-lxd-2里也看到了’Error while waiting for Mnesia tables’这种错误, 似乎是Mnesia数据库未同步.

var/log/rabbitmq/rabbit@juju-3182a3-69-lxd-2.log.1
=INFO REPORT==== 17-Apr-2020::13:27:26 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left
=WARNING REPORT==== 17-Apr-2020::13:27:56 ===
Error while waiting for Mnesia tables: {timeout_waiting_for_tables,
[rabbit_user,rabbit_user_permission,
rabbit_vhost,rabbit_durable_route,
rabbit_durable_exchange,
rabbit_runtime_parameters,
rabbit_durable_queue]}

3个units上用’netstat -s’看到大量的reset tcp

sos_commands/networking/netstat_-s |head -n2
7154216 connection resets received
34025359 resets sent

深入分析

9-lxd-9上 通过下列命令看到各个openstack组件上都有到9-lxd-9的rabbitmq tcp连接, 一个组件就有约四五百个连接, 是不是多了一点? 其他两个units没这么多.

cat sos_commands/networking/netstat_-W_-neopa| awk '/:5672/ { print $5 }' | awk -F: '{ a[$1]++ }; END { for (i in a) print i, a[i] }' |sort -n -k 2 -r |more |head -n 80

但一个组件就有四五百个连接, 每个组件都这么多, 不可能每个组件都有问题吧. 除了大量tcp连接可以造成容器cpu升高, host机器的cpu升高也可以造成容器cpu升高的吧. 所以接着检查host机器的cpu占用率:

$ cat sos_commands/process/ps_auxwww |head -n 1
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
$ cat sos_commands/process/ps_auxwww |head -n1 && cat sos_commands/process/ps_auxwww |sort -n -k3 -r | head -n 3
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
nova 1936435 130 0.0 347704 126140 ? Rs 11:40 0:01 /usr/bin/python2 /usr/bin/nova-api-metadata --config-file=/etc/nova/nova.conf --log-file=/var/log/nova/nova-api-metadata.log
libvirt+ 1809025 100 0.0 51373088 236744 ? Sl Mar09 56456:32 qemu-system-x86_64 -enable-kvm -name guest=instance-000099e0,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-26-instance-000099e0/master-key.aes -machine pc-i440fx-

ok, 虚机占用cpu就是100%. 看样子是控制服务(rabbitmq)与计算服务(nova-compute)安装到同一台物理机了, 虚机占用的cpu或者内存大页什么的直接导致rabbitmq cluster挂掉.
注:
ps命令中第三列相加的cpu大于100%也不一定就意味着cpu一定高, 因为可以这些进程全跑在一个cpu核的, 它可能计算的是一个核的, 要想准确还得使用mpstat来查看

解决办法

调整架构是不可能了, 可以暂时使用isolcpu隔离一些cpu单独通过taskset给rabbitmq使用

mpstat -P ALL 能看到所有cpu核的负载情况
cat /proc/cpuinfo| grep "cpu cores"| uniq
grub中添加isolcpus=1,3来隔离1和3的cpu待用, 然后运行update-grub后重启, 验证:
1, cat /proc/cmdline |grep isolcpu
2, ps -To 'pid,lwp,psr,cmd' -p 310597
3, ps -eo pid,cmd,psr |awk '{if($3=3) print $0}'
taskset -p0x8 <pid>绑定<pid>到cpu3
taskset -c 1 /etc/init.d/mysql start
systemd manages the affinity for you. See "man systemd.exec" and CPUAffinity= option. 

20201212更新 - queue_master_locator=min-master

我们可能需要配置queue_master_locator=min-master - https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1890759, 进而可能造成这两个bug (控制平面居然影响了数据面) - https://bugs.launchpad.net/neutron/+bug/1871850
https://bugs.launchpad.net/neutron/+bug/1869808

# Queue masters are mostly on 11-lxd-8
> cat rabbitmqctl_report | awk -F'\t' '/^Queues on openstack/ {a=1}; a && /rabbit/ {split($1,s,"."); b[s[1]]++}; NF==0 {a=0}; END {for(i in b) {print i, b[i]}}'
<rabbit@juju-0aa49a-9-lxd-9 15
<rabbit@juju-0aa49a-11-lxd-8 15426
<rabbit@juju-0aa49a-10-lxd-8 309# Most connections are made to 11-lxd-8
> cat rabbitmqctl_report| awk -F'\t' '/^Connections/ {a=1}; a && /rabbit/ {split($1,s,"."); b[s[1]]++}; NF==0 {a=0}; END {for(i in b) {print i, b[i]}}'
<rabbit@juju-0aa49a-9-lxd-9 8763
<rabbit@juju-0aa49a-11-lxd-8 14843
<rabbit@juju-0aa49a-10-lxd-8 6413# 11-lxd-8 uses the most memory
> cat rabbitmqctl_report| awk '/^Status/ {a=1; print}; a && /total,/; !NF {a=0}'
Status of node 'rabbit@juju-0aa49a-10-lxd-8'[{total,3536412448},
Status of node 'rabbit@juju-0aa49a-11-lxd-8'[{total,7030459176},
Status of node 'rabbit@juju-0aa49a-9-lxd-9'[{total,4256377400},

crontab检查rabbitmq memory使用率:

#!/bin/bash
# process_memory_checker.sh 
MAILTO=root
process="bin/beam.smp"
mem_percentage=`ps -o %mem,command ax | grep "${process}" | grep -v grep | awk '{print $1}'`
threshold_percentage=25
if (( $(echo "${mem_percentage} > ${threshold_percentage}"|bc -l) ));
then echo "Process ${process} on `hostname -f` is using ${mem_percentage}% of total memory (threshold is ${threshold_percentage}%)." | \mail -s "Memory usage warning for process ${process} on $(hostname -s)" ${MAILTO} 2> /dev/null;exit 1;
fi
echo "No memory issues found for process ${process}"
exit 0

另外,客户发现memory用到超过vm_memory_high_watermark_paging_ratio (https://www.rabbitmq.com/memory.html#paging )时似乎memory也没有被page到硬盘造成memory一直在增大,这个网页(https://stackoverflow.com/questions/21666537/rabbitmq-memory-control-queue-is-full-and-is-not-paging-connection-hangs)说对“durable=true”queue才生效。
"rabbitmqctl list_queues name durable -p openstack"看到的durable都是false, 难道是在设置各组件的olso设置为(charm似乎不支持)amqp_durable_queues=True吗?

[oslo_messaging_rabbit]
amqp_durable_queues = True

但这个patch说上面配置被废弃了 - https://bugs.launchpad.net/oslo.messaging/+bug/1433956, 但master code中有这个配置,还没细查。另外,这个网页(https://elkano.org/blog/high-availability-rabbitmq-cluster-openstack/)说用这个配置:

[oslo_messaging_rabbit]
rabbit_hosts=node1:5672,node2:5672,node3:5672
rabbit_retry_interval=1
rabbit_retry_backoff=2
rabbit_max_retries=0
rabbit_ha_queues=true
rabbit_userid = openstack
rabbit_password = openstack_pass
amqp_auto_delete = true
amqp_durable_queues=True

charm ./charmhelpers/contrib/openstack/templates/section-rabbitmq-oslo 中有下面代码:

  8 {% if rabbitmq_ha_queues -%}9 rabbit_ha_queues = True10 rabbit_durable_queues = False11 {% endif -%}

其他debug手段:

sudo rabbitmqctl -p openstack list_queues|grep -vw 0# returns a list with info about memory dynamically allocated by the Erlang emulator
rabbitmqctl eval 'erlang:memory().'
rabbitmq-diagnostics memory_breakdown  #for new version

内存使用量一直突然增长可能是这个错误造成的, 详见:https://bugs.launchpad.net/oslo.messaging/+bug/1789177

$ grep -r 'no exchange' var/log/rabbitmq/rabbit@juju-0aa49a-10-lxd-8.log |wc -l
3511
$ grep -r 'no exchange' var/log/rabbitmq/rabbit@juju-0aa49a-10-lxd-8.log |head -n1
operation basic.publish caused a channel exception not_found: no exchange 'reply_bee2ebfe01854e0595ddaa7462dc4054' in vhost 'openstack'

这是关于rabbitm HA一个非常好的贴子(https://blog.csdn.net/zyz511919766/article/details/41896823), 也就是说,在rabbitmq non-ha中,exchange与 bindings也是在所在节点上的,而queue只处于一个节点。在rabbitmq ha环境中,只是将queue也根据policy放置在所有或某些节点上。当consumer从rabbitmq的某个节点消费queue时,consumer死掉了,consumer创建的reply-xxx queues也会被删掉但此时可能consumer没有收到basic cancel signal(eg:当reply-xxx被删后刚好consumer在重启), 这样,consumer没ack消息这样server还会继续给consumer发, 但此时就会报no exchange (exchange与reply-xxx queue都是consumer创建的,reply-xxx queue为0时exchange也会被删除)。
但是,有一点没闹明白,consumer重启之后不是会调用 reconnect继续重建exchange与reploy-xxx queue么?不过在重建之前,server会继续发, 这段期间就会有no exchange错误吧,这样会耗费大量的cpu,这样或许也是cpu升级的原因 (Lots of exchanges create problems during failover under high load)。

$ grep -r '^reply_' sos_commands/rabbitmq/rabbitmqctl_report |grep exchange |head -n1
reply_0003bee270e54c8cb9b78ba3b51ba4e2  exchange        reply_0003bee270e54c8cb9b78ba3b51ba4e2  queue   reply_0003bee270e54c8cb9b78ba3b51ba4e2  []      openstack
$ grep -r '^reply_' sos_commands/rabbitmq/rabbitmqctl_report |grep exchange |wc -l
24991

24991肯定会造成memory一直升高,所以一个最好办法就是设置x-cancel-on-ha-failover让queue没了时也取消consumer。另一个办法是也可以哪出哪些consumer在作怪.

$ cat sos_commands/networking/netstat_-W_-neopa| awk '/:5672/ { print $5 }' | awk -F: '{ a[$1]++ }; END { for (i in a) print i, a[i] }' |sort -n -k 2 -r |more |head -n 3
10.160.0.106 96
10.160.0.208 86
10.160.0.75 83

也可以用下面的policy缓解:

#rabbitmqctl set_policy min-masters-queue -p openstack '.*' '{"queue-master-locator":"min-masters"}' --apply-to queues --priority 10
rabbitmqctl set_policy HA -p openstack '^(?!amq\\.).*' '{"queue-master-locator":"min-masters", "ha-mode":"all", "ha-sync-mode":"automatic"}'

但是exchange只是记录的一个名称,是不占用内存的,queues可能会占用内存。但30481个queue也就1.2G左右,也不大啊。

$ cat rabbitmq_report.txt |gawk -F'\t' '/^Queues on openstack/ {a=1;next}; a && NF {mem+=$17; n+=1}; !NF {a=0} END {print n, mem}'
30481 1248040000

相较,rabbitmqctl status显示queue_procs与queue_slave_procs占用的内存更大, queue_procs占了约199G. 队列占用的内存指的是队列进程消耗的,并不包含消息体(在二进制中)。当内存不足时,这部分的内存将交换到磁盘上。

  • queue_procs:主队列,索引和消息保存在内存中。排队的消息数量越多,通常会将此内容归因于此部分。但是,这在很大程度上取决于队列属性以及消息是否作为瞬态发布
  • queue_slave_procs:队列镜像,索引和消息保存在内存中。减少镜像(副本)的数量或不使用固有的瞬态数据镜像队列可以减少镜像使用的RAM量。排队的消息数量越多,通常会将此内容归因于此部分。但是,这在很大程度上取决于队列属性以及消息是否作为瞬态发布。
      {queue_procs,199051279736},{queue_slave_procs,1243762680},

这个网页(https://www.rabbitmq.com/monitoring.html#diagnostics-observer)提到’rabbitmq-diagnostics observer’命令可以像top一样查看erlang虚拟机内进程的内存使用量,但rabbitmq-server 3.8版本才支持啊。那perf是否可以查看erlang虚机机内进程的call trace呢?但这篇文章说了怎么查erlang下的进程所用内存:
RabbitMQ及Erlang内存使用分析 - https://blog.csdn.net/jaredcoding/article/details/78115235
RabbitMQ运维经验分享 - https://my.oschina.net/hackandhike/blog/801052

rabbitmqctl status #single node status
rabbitmqctl report #cluster statuserl --set cookie $(cat /var/lib/rabbitmq/.erlang.cookie) -name test@127.0.0.1
help().
erlang:memory().
erlang:system_info(process_limit).  %%查看系统最多能创建多少process
erlang:system_info(process_count).  %%当前系统创建了多少process

GC应该能减少queue_procs

# On 18.04 and newer
sudo rabbitmqctl force_gc
# On 16.04. 
sudo rabbitmqctl eval '[garbage_collect(P) || P <- processes()].'
sudo rabbitmqctl environment |grep background_gc_enabled

一个命令

下面命令发现heat的exchange与queue太多。
awk是由一系列的模式匹配+action组成的,下列有三组:

  • '/Listing queues for vhost openstack/ {a=1; next},意为遇到Listing这一行之后不再运行next之后的模式匹配,并设置变量a=1
  • 然后处理Listing之后的第2行,会匹配到这组: a { split($1,b,“.”); q[b[1]]+=1 }, 将第1组放到q数组中计数
  • 一直处理到为空的行(!NF)将匹配到这组:a && !NF {a=0} , 再将a设置成0
  • 空行之后的行由于a=0,所以三组模式都不会匹配
#sudo rabbitmqctl report > rabbitmqctl_report.log
$ cat rabbitmqctl_report.log|awk '/Listing queues for vhost openstack/ {a=1; next} a && !NF {a=0} a { split($1,b,"."); q[b[1]]+=1 } END { for (i in q) {print i, q[i]} }' | sort -k 2 -n |tail
q-metering-plugin 4
q-plugin 4
q-reports-plugin 4
q-server-resource-versions 4
scheduler 4
cinder-volume 6
notifications 7
compute 10
heat-engine-listener 74465
engine_worker 74523

20220817更新

roy的一个脚本(https://gist.githubusercontent.com/roylez/a2f61802206d3ab7905f81254651d428/raw/9cb0b26203a293ee69683f9c1f24a7d3bbd8f8a8/rabbit-tell.awk ) 用于查看连向rabbit的进程名的多少, 可能牵扯到一个metadata process过多的bug - https://bugs.launchpad.net/charms/+source/heat/+bug/1665270

nova 4549:
- nova-api-os-compute 179
- nova-compute 403

脚本如下:

#!/usr/bin/gawk -f
#
# Usage: ./rabbit-tell.awk <rabbitmqctl_report>
#
# Common RabbitMQ issues that this script can be used to identify
#
# - Partitioned cluster
#
# - High binary/queue memory usage. There are mainly two causes of this:
#
#   + queue depth build up (some clients are disconnected)
#   + rabbit does not do periodic garbage collection by default
#
# - High fd/socket usage. This is usually caused by excessive connections that can be analyzed
# together with connection breakdown
#
# - Unbalanced queue master count on nodes. Usually this means "queue_master_locator"(only in 3.6
# and newer) should be tuned.
#function round(x) { return sprintf("%.2f", x) }function format_status_line(label, keys, array, color)
{if (color) { printf "\033[%s;1m", color }printf "%-15s", labelfor(i in keys) { printf "%23s", array[i] }if (color) { printf "%s", "\033[m" }print ""
}function humanize(x)
{split("B KB MB GB TB PB",type)y=0for(i=5;y<1;i--) y = x / (2**(10*i))return round(y) type[i+2]
}function colorize(s, color) { return "\033[" color "m" s "\033[m" }function print_title(s) { printf "\n\033[4;1m%s\033[m\n\n", toupper(s) }# remove color junk for 3.8
{ gsub(/\x1B\[[0-9;]*[mK]/,"") }/^Status of node/ {section="node"gsub(/'/, "")node=substr($4, 8)nodes[node]=nodeif ( /\.\.\.$/ ) { ver=38 }else { ver=36 }
}# 36 format parser {
   {
   {
ver==36 && /^$/ { section=0 }ver==36 && section=="node" && /rabbit,"RabbitMQ",/ {match($0, /,"([0-9]+\.[0-9]+\.[0-9]+)/, res)version[node]=res[1]
}ver==36 && section=="node" && /(total|binary|vm_memory_limit|queue_procs|queue_slave_procs|processes|uptime),/ {match($0, /,([0-9]+)/, res)if ( /total/ ) mem_total[node]=res[1]if ( /binary/ ) mem_binary[node]=res[1]if ( /vm_memory_limit/ ) mem_limit[node]=res[1]if ( /queue_procs|queue_slave_procs/ ) mem_queue[node]+=res[1]if ( /uptime/ ) uptime[node]=res[1]if ( /processes/ ) {match($0, /used,([0-9]+)/, res)process_used[node]=res[1]}
}ver==36 && section=="node" && /(total_limit|total_used|sockets_limit|sockets_used),/ {match($0, /,([0-9]+)/, res)if ( /total_limit/ ) fd_total[node]=res[1]if ( /total_used/ ) fd_used[node]=res[1]if ( /sockets_limit/ ) sockets_limit[node]=res[1]if ( /sockets_used/ ) sockets_used[node]=res[1]
}ver==36 && /^Cluster status of node/ {gsub(/'/, "")section="cluster"node=substr($5, 8)
}ver==36 && section=="cluster" && /partitions,/ {partitioned[node] = $0 ~ /partitions,\[\]/ ? "NO" : "YES"
}ver==36 && /^Connections:/ {FS="\t"section="connection"
}ver==36 && section=="connection" && /rabbit/ {match($22, /connection_name","([^:]*):/, res)c=res[1]user[$17]++uc[c]=$17client[c]++
}ver==36 && /^Channels:/ {section="channel"
}ver==36 && /^Queues on/ {match($0, /Queues on (.+):/, res)vhost=res[1]section="queue"
}ver==36 && section=="queue" && /\./ {match($1, /@([^.]+)\./, res)queue_master[res[1]]++
}ver==36 && section=="queue" && $10 && $10 ~ /[0-9]+/ {queue_vhost[$2]=vhostqueue_messages[$2]=$10queue_consumers[$2]=$15
}ver==36 && /^Exchanges on/ {section="exchange"
}
# }}}# 3.8 formattings {
   {
   {
ver==38 && /\.\.\.$/ { section=0 }ver==38 && /^Listing connections/ {FS="\t"section="connection"next
}ver==38 && section=="connection" && NF {if ( $1 == "pid" ) nextmatch($24, /connection_name","([^:]*):/, res)c=res[1]user[$19]++uc[c]=$19client[c]++if (! $19 ) { print }
}ver==38 && /^Listing queues for/ {match($0, /^Listing queues for vhost (.+) /, res)vhost=res[1]section="queue"next
}ver==38 && section=="queue" && /\./ {if ( $1 == "name" ) nextmatch($6, /@([^.]+)\./, res)queue_master[res[1]]++
}ver==38 && section=="queue" && $13 && $13 ~ /[0-9]+/ {queue_vhost[$1]=vhostqueue_messages[$1]=$13queue_consumers[$1]=$26
}
# }}}# outputs {
   {
   {
END {if (ver==36) {print_title("cluster nodes")format_status_line("node", nodes, nodes)format_status_line("version", nodes, version)format_status_line("uptime", nodes, uptime)for(n in nodes) { m_limit[n] = humanize(mem_limit[n]) }format_status_line("mem_limit", nodes, m_limit)for(n in nodes) {percent = round(mem_total[n]/mem_limit[n]*100)if (int(percent) > 80) { m_total_color=31 }m_total[n]="(" percent "%) " humanize(mem_total[n])}format_status_line("mem_total", nodes, m_total, m_total_color)for(n in nodes) {percent = round(mem_binary[n]/mem_total[n]*100)if (int(percent) > 50) { m_binary_color=31 }m_binary[n]="(" percent "%) " humanize(mem_binary[n]) }format_status_line("mem_binary", nodes, m_binary, m_binary_color)for(n in nodes) {percent = round(mem_queue[n]/mem_total[n]*100)if (int(percent) > 50) { m_queue_color=31 }m_binary[n]="(" percent "%) " humanize(mem_queue[n]) }format_status_line("mem_queue", nodes, m_binary, m_queue_color)for(n in nodes) {percent = round(fd_used[n]/fd_total[n]*100)if (int(percent) > 50) { fd_color=31 }fd[n] = "(" percent "%) " fd_used[n]}format_status_line("fd", nodes, fd, fd_color)for(n in nodes) {percent = round(sockets_used[n]/sockets_limit[n]*100)if (int(percent) > 50) { sockets_color=31 }sockets[n] = "(" percent "%) " sockets_used[n]}format_status_line("sockets", nodes, sockets, sockets_color)for(n in nodes) {percent = round(process_used[n]/1048576*100)if (int(percent) > 50) { process_color=31 }process[n] = "(" percent "%) " process_used[n]}format_status_line("process_used", nodes, process, process_color)for(n in nodes) {if ( partitioned[n] == "YES" ) { partition_color=31 }}format_status_line("partitioned", nodes, partitioned, partition_color)}print_title("connections breakdown by user/client")for(u in user) {print u, user[u]": "for(c in client) { if(uc[c]==u) {print "- "c,client[c]}}}print_title("queues with messages")if (length(queue_messages)) {printf "%-15s%45s%10s%10s\n", "vhost", "queue", "messages", "consumers"for(q in queue_messages) {printf "%-15s%45s%10s%10s\n", queue_vhost[q], q, queue_messages[q], queue_consumers[q]}}print_title("queue master count by node")for(n in queue_master) { print n ": \t" queue_master[n] }
}
# }}}

20221219 - remove a rabbitmq unit

juju deploy cs:rabbitmq-server-117 -n 3 --config min-cluster-size=3 --config source=cloud:xenial-queens --series=xenial rabbitmq-server
juju run-action rabbitmq-server/3 pause --wait
juju run-action --wait rabbitmq-server/0 forget-cluster-node node=rabbit@juju-2649d1-rb-3 --wait
juju remove-unit rabbitmq-server/3 --force
juju add-unit rabbitmq-server

20230627 - rabbitmq cluster brain split

用户在一个WAN上的rabbitmq cluster总是down掉。
1, 看到有inactivity_probe的因素(no response to inactivity probe),

sudo ovs-vsctl set Controller br-tun inactivity_probe=30000
sudo ovs-vsctl set Controller br-int inactivity_probe=30000
sudo ovs-vsctl set Controller br-data inactivity_probe=30000
sudo ovs-vsctl set Controller br-lb inactivity_probe=30000
sudo ovs-vsctl set Controller br-tun max_backoff=5000
sudo ovs-vsctl set Controller br-int max_backoff=5000
sudo ovs-vsctl set Controller br-data max_backoff=5000
sudo ovs-vsctl set Controller br-lb max_backoff=5000

2, 看到有mtu的问题(br-int: dropped over-mtu packet: 4044 > 1380)

juju config neutron-api global-physnet-mtu
juju config neutron-api path-mtu
juju config neutron-api physical-network-mtus
openstack network show <your network>
openstack network show <your network> |grep mtuovs-vsctl --columns=mtu_request list interface
sudo ovs-vsctl set int br-int mtu_request=1500
openstack network set --mtu 9000 <your network>

3, 还有mysql集群的问题( mysql factor about ‘Cluster is inaccessible from this instance’)

4, 但时间因素的嫌疑最大

cat sosreport*/date
for i in $(ls sosreport*/sos_commands/date/hwclock); do echo -n "$i "; cat $i; done