版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 (作者:张华 发表于:2018-11-06)
问题
neutron designate日志中发现错误"Got lower serial for", 并且创建zone时永远停留在PENDING状态.
解决
minidns中看到下列日志:
var/log/designate/designate-mdns.log.2.gz:2018-10-23 23:27:36.016 94713 INFO designate.mdns.handler [req-26d8910d-61d5-4fd6-bc6b-df7acaf36c12 - - - - -] NotFound, refusing. Question was xxx.openstack-au-east-2.oc.xxx.com. IN SOA
var/log/designate/designate-mdns.log.2.gz:2018-10-23 23:27:36.024 94713 WARNING designate.mdns.handler [req-fa5518dd-9506-44f8-a2ee-ee7d79ffaa3c - - - - -] ZoneNotFound while handling axfr request. Question was xxx.openstack-au-east-2.oc.xxx.com. IN AXFR: ZoneNotFound: Could not find Zone
根据代码[1], 查询DB时报ZoneNotFound从而导致无法构建SOA无法构建axfr response, 从而导致minidns master与bind9 slave无法做zone transfer, 这样导致bind9 slave的serial number也无法更新, 最终在minidns notify时看到"Got lower serial for"
在发生ZoneNotFound的附近有下列日志:
[req-b48612b0-ed3e-46d9-8510-6634282ef0a2 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1']
于是, 用下列测试程序重现了上述DB error,
import time
from sqlalchemy.engine import create_engine
url = 'mysql+pymysql://root@127.0.0.1:3306/mysql'
engine = create_engine(url, pool_recycle=4).connect()
query = 'SELECT NOW();'
while True:print('Q1', engine.execute(query).fetchall())engine.execute('SET wait_timeout=2')time.sleep(3)print('Q2', engine.execute(query).fetchall())
或者使用oslo.db测试程序:
mysql -u root -p -e "SET GLOBAL wait_timeout=5, slow_query_log=on, long_query_time=0.0;"$ cat test.py
import sys
import time
from oslo_config import cfg
from oslo_db import options as db_options
from oslo_db.sqlalchemy import session as db_session
from sqlalchemy.sql.expression import select
_facade = db_session.EngineFacade("mysql://root@localhost/test")
x = _facade.get_session()
print(x.scalar(select([23])))
time.sleep(5)
print(x.scalar(select([23])))
连接线的过期时间(pool_recycle, 连接池里的连接空闲一段时间后自动释放, oslo中默认是3600)不能大于服务器端的wait_timeout时间(这里是2, 默认是8小时). 所以应该设置wait_timeout大于3600, 或者haproxy中的配置大于3600.
wait_timeout是mysql的一个设置,主要是用来断开不使用的数据库连接。当连接空闲的时间达到wait_timeout设置的最大值时,mysql会主动切断这个连接,以供别的客户端连接数据库。这个值一般是28800,也就是8小时。在mysql中可以通过: show variables like “%timeout%”;获取。
另外,当数据库主动切断连接的时候,mysql客户端并不知道这个连接已经被切断,所以程序并不知道其已经无效了,如果mysql客户端再不支持ReConnect,双重的问题叠加在一起就会导致连接池返回无效连接的可能.
其他
mysqldump --single-transaction -u root -p designate --skip-extended-insert > /tmp/designate-$(date +%s).sql
For the error ‘Lost connection to MySQL server during query’,
wait_timeout is a setting of mysql, which is used for server side to disconnect unused client connections proactively. The default is 8 hours(28800), we can check it by ‘show variables like “%timeout%”;’ or ‘juju config mysql wait-timeout’
All the rest of the timeout below should be less than wait_timeout, if they are greater than wait_timeout, the error ‘Lost connection to MySQL server during query’ will happen.
1, the default value of olso’s connection_recycle_time is 3600
2, the timeout in haproxy.cfg
So we need to collect the following info for further analyses.
1, juju config mysql wait-timeout
2, mysql -unova -p -h<MYSQL_IP> -e ‘show variables like “%timeout%”’ #eg: run it in nova-cloud-controller/0
3, juju ssh neutron-api/0 – sudo grep -r ‘connection_recycle_time’ /etc/neutron/
4, juju ssh neutron-api/0 – sudo grep -r ‘timeout’ /etc/haproxy/
一些命令
#This will show which openstack service is using the most mysql connections
select user, count(*) from information_schema.processlist group by user;juju run --application mysql leader-get
juju run --application mysql "mysql -uroot -pChangeMe123 -e \"SELECT IFNULL(usr,'All Users') user,IFNULL(hst,'All Hosts') host,COUNT(1) Connections FROM (SELECT user usr,LEFT(host,LOCATE(':',host) - 1) hst FROM information_schema.processlist WHERE user NOT IN ('system user','root')) A GROUP BY usr,hst WITH ROLLUP;\""
附件
openstack的各服务都有大量的这种错误:
var/log/mysql/error.log:2019-08-29T16:07:45.335406Z 420821 [Note] Aborted connection 420821 to db: ‘keystone’ user: ‘keystone’ host: ‘10.191.5.49’ (Got timeout reading communication packets)
要求客户增大了connect_timeout, net_read_timeout, net_write_timeout, interactive_timeout后使用‘show global status like ‘aborted%’;’看到Aborted_clients的数目仍在增大
show global variables like ‘max_conn%’;
show global variables like ‘%timeout%’;
show global status like ‘aborted%’;
show processlist;
这时发现在21:55:48, 所有的designate serveres都收到了mysql错误, 通过mysql的查询日志找到21:04到21:16之间mysql收到了40个gnocchi查询,一个查询就要返回500MB的大小, 然后看到了警告:InnoDB: Warning: difficult to find free blocks in the buffer pool (338 search iterations)!"
所以看起来像是IO swamp的问题进而造成查询超时。解决方法:
1, 增加innodb_buffer_pool_size (juju中通过dataset-size控制)
2, 删除gnocchi中的相关数据
3, 调查designate中当遇到mysql超时时是不是会反复retry造成对DB的flood
20201014更新
amphora-agent writes heatbeat to udp, then octavia-health-manager gets heatbeat from udp and writes them to DB, finally health_manager gets heartbeat from DB via get_stale_amphora to decide if failover process will be started
当health-manager线程的数目(pgrep -af /usr/bin/octavia-health-manager | wc -l, 160)大于mysql的连接数时(ss -tp | grep -c :mysql, 90)时上面的get_stale_amphora就会报"Lost connection to MySQL server during query"错误(排除wait-timeout!=3600这个因素之后),
sqlalchemy看到这个错误后就会reconnect, 从而产生下面提到的"QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10"这个错误(http://sqlalche.me/e/3o7r). sqlalchemy’s reconnect可能不是replace, 而是删除老连接然后建立新连接, 这次当达到max_pool_size+max_overflow=20个连接后永远无新连接供它reconnect(sqlalchemy使用max_pool_size=10, 当一次使用多于5时, 会用到默认的max_overflow=10这样可以达到15个连接, 但多于20个后不再允许溢出.), 这可能是一个bug - https://github.com/sqlalchemy/sqlalchemy/issues/5308
可能在mysql在做failover时会出现这种情况, 因为new VIP owner没有old TCP connections从而导致之后的old TCP connections to go stale从而导致reconnect.
可能的解决方法:
1, 如果mysql不再failover的话可能这个问题也不再重启了, 这个可以等下列fix
2, as a base leve request a stable backport for the following patch
health-manager的eventlet/greenlet的数目由health_update_threads参数决定, 默认与CPU核数同, 大多数情况下都是短连接会多路复用还没问题, 但有时候也会各种因素让有的变成长连接. charm中一般对于worker threads有0.25x的乘数(LXD容器中最大是4), octavia-charm也应该用这个乘数来减小worker threads的数目 - https://bugs.launchpad.net/charm-octavia/+bug/1889731
3, We could additionally reconfigure the oslo.db [database] section to include a pool overflow limit at least as large as the number of expected threads by worker-multiplier
4, We could additionally open a bug against sqlalchemy that it should allow for connection replacement even when the pool limit is reached
CLI
export MYSQL_PASSWORD=ChangeMe123
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova.instances where host='juju-38b529-ovn-6.cloud.sts'\G\""
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova.instance_extra where instance_uuid='0761d1de-7acd-4781-ae8b-f5ba864ab6ec'\G\""
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova.compute_nodes where hypervisor_hostname='p2-bits-cloud-xxx.maas' or hypervisor_hostname='p2-bits-cloud-xxx.maas'\G\""
juju run --application mysql "mysql -u root -p$MYSQL_PASSWORD -e \"select * from nova_api.request_specs where instance_uuid='0761d1de-7acd-4781-ae8b-f5ba864ab6ec'\G\""
pager less -S
mysql backup by usin gjuju
juju run-action --wait mysql/leader backup
juju ssh mysql/leader
sudo cp -r /opt/backups/mysql mysql-backup-$(date +%F)
sudo chown ubuntu -R mysql-backup-$(date +%F)/
juju scp -- -r mysql/leader:~/mysql-backup-$(date +%F) backups/juju run-action --wait nova-cloud-controller/leader archive-data
bring mysql-innodb-cluster back
juju run-action mysql/leader reboot-cluster-from-complete-outage
juju run-action mysql/1 rejoin-instance
mysql-innodb-cluster
在单主模式下,InnoDB集群仅有一个读写服务器实例(即单主节点)。多个从节点是单主的副本。如果主发生故障,则从节点会选举一个作为单主。MySQL Router会检测到这种变化并将客户端应用程序转发到新的单主上。高级用户还可以将集群配置配置成多主模式.
- MySQL 5.7 引入了 Group Replication 功能,可以在一组 MySQL 服务器之间实现自动主机选举,形成一主多从结构。经过高级配置后,可以实现多主多从结构。
- MySQL InnoDB Cluster是建立在group repalication基础之上的高可用方案,其中MGR是基础. MGR(Mysql Group Replication)有两种模式,一种是Single-Primary,一种是Multi-Primary,即单主或者多主。注意:Multi-Primary模式中,所有的节点都是主节点,都可以同时被读写,看上去这似乎更好,但是因为多主的复杂性,在功能上如果设置了多主模式,则会有一些使用的限制,比如不支持Foreign Keys with Cascading Constraints。
- MySQL Router 是一个轻量级透明中间件,可以自动获取上述集群的状态,规划 SQL 语句,分配到合理的 MySQL 后端进行执行。MySQL Router 在InnoDB集群里面主要作用是为数据库集群提供一个虚拟IP
- Mysql shell作为客户端提供对集群的管理和访问,支持JS、Python、SQL三种方式。可谓新一代mysql client
安装:
juju add-model mysql
juju add-machine -n 4
juju deploy --to lxd:0 cs:keystone
juju deploy -n 3 --to lxd:1,lxd:2,lxd:3 mysql-innodb-cluster
juju deploy mysql-router keystone-mysql-router
juju add-relation keystone-mysql-router:db-router mysql-innodb-cluster:db-router
juju add-relation keystone-mysql-router:shared-db keystone:shared-db
1, ssh into mysql-innodb-cluster/leader to run 'pkill juju' if the following command is hungjuju run -u mysql-innodb-cluster/leader leader-get mysql.passwd2, log into mysql-innodb shell$mysql-shell.mysqlsh --uri root@localhost:3306 -p$(juju run -u mysql-innodb-cluster/leader leader-get mysql.passwd)
$mysql-shell.mysqlsh
# we have 3 modes (\sql, \py, \jq)\connect localhost:3306\js
mysql-js> var cluster = dba.rebootClusterFromCompleteOutage();
mysql-js> cluster.status();mysql-js> \sqlhttp://shrenikp.blogspot.com/2018/10/mysql-innodb-cluster-troubleshooting.html
更新,上面使用root登录时一定要加sudo
juju run -u mysql/leader leader-get mysql.passwd
juju ssh mysql/leader -- sudo -s
sudo mysql -u root -p
pager less -S#eg:
juju ssh mysql/leader -- sudo mysql --database=octavia --user=root --password=${DB_PASS}
juju ssh mysql/leader -- "sudo mysql --database=octavia --user=root --password=${DB_PASS} \--execute=\"select * from load_balancer;\""
20220505更新
2022-05-04 08:10:05.959 791819 ERROR nova.servicegroup.drivers.db oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1040, 'Too many connections to MySQL Router')
juju run -a nova-cloud-controller -- sudo ss -Onatp state connected \'sport = 3306\'
juju run -a nova-cloud-controller -- grep max_total_connections /var/lib/mysql/nova-cloud-controller-mysql-router/mysqlrouter.conf
# /var/lib/mysql/nova-cloud-controller-mysql-router/mysqlrouter.conf
[DEFAULT]
...
max_total_connections = 1024
and make sure unattended-upgrades is disabled in all units running mysql-router- https://bugs.launchpad.net/charm-mysql-router/+bug/1971565pass=`juju run --unit mysql/0 'leader-get root-password'`
juju ssh mysql/0 "mysql -uroot -p$pass -e \"show status like 'wsrep_cluster%';\""
juju ssh mysql/0 "mysql -uroot -p$pass -e \"select * from information_schema.processlist order by db, time desc;\""
另一个脚本
#!/bin/bashPASSWD=$(juju run --unit mysql-innodb-cluster/leader leader-get mysql.passwd)
for x in $(seq 0 900)
dodateecho "select * from designate.zone_masters" | juju ssh mysql-innodb-cluster/leader sudo mysql -N -u root -p$PASSWD 2> /dev/nullsleep 1
done
20220901 - backup db
pass=`juju run --unit mysql/0 'leader-get root-password'`
juju run --unit mysql/0 "mysqldump -u root --password=$pass --single-transaction --skip-lock-tables --set-gtid-purged=OFF --databases keystone --quick --result-file=/tmp/keystone.sql"
juju run --unit mysql/0 "mysqldump -u root --password=$pass --single-transaction --skip-lock-tables --set-gtid-purged=OFF --databases neutron --quick --result-file=/tmp/neutron.sql"
juju run --unit mysql/0 "mysqldump -u root --password=$pass --single-transaction --skip-lock-tables --set-gtid-purged=OFF --databases keystone octavia nova --quick --result-file=/tmp/datavases.sql"
Reference
[1] https://github.com/openstack/designate/blob/stable/queens/designate/mdns/handler.py#L233
[2] https://blog.csdn.net/quqi99/article/details/119042087