在亚马逊云买了多台的虚拟主机,一年多没有由于系统的原因出过故障。今天碰见了。
早上接到报警,从业务故障上来看,应该是数据库没有响应了。
SSH连数据库服务器,发现连不上。
重启数据库服务器,一直起不来。
最后用上周的数据库服务器的系统备份snapshot(我们的数据盘和系统盘是分开的)新建一个Volume,替换掉故障系统盘,重新启动服务器,才顺利进入系统。在用新的Volume挂靠服务器的时候,一定要记得,设备名称要和原来的系统Volume的名称一致,服务器才能顺利启动:
?
Root device ?/dev/sda1
?
MYSQL服务启动后,所有服务按照顺序重启一遍,业务恢复。
?
从业务日志上来看,在6月23日22:06分左右出现大量慢sql日志,非常简单的sql语句都会等待10秒以上,随后不久出现大量redis拒绝服务的错误日志。(Redis和数据库服务部署在同一台机器上),最终导致ssh都连接不上,系统处于宕机状态。
?
分析当时的CPU和磁盘IO都处于健康状态。
?
分析Linux系统/var/log/message的日志,发现在Jun 23 08:06:33开始就出现JAVA的OOM
?
Jun 23 08:06:33 App-01 kernel: java invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jun 23 08:06:33 App-01 kernel: java cpuset=/ mems_allowed=0
Jun 23 08:06:33 App-01 kernel: Pid: 30096, comm: java Not tainted 2.6.32-431.17.1.el6.x86_64 #1
?
然后出现出现各种系统错误日志。
?
分析/var/log/mysqld.log,发现在23点左右出现以下日志:
?
InnoDB: Warning: a long semaphore wait:
--Thread 140539730044672 has waited at btr0cur.cc line 257 for 1775.0 seconds the semaphore:
X-lock on RW-latch at 0x7fd1f7f589c0 '&block->lock'
number of readers 0, waiters flag 0, lock_word: 100000
Last time read locked in file btr0cur.cc line 257
Last time write locked in file /home/mysql/storage/innobase/btr/btr0cur.cc line 362
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0
?
=====================================
2015-06-23 23:02:52 7fd1eb5fe700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 1571 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 2215108 srv_active, 0 srv_shutdown, 164283 srv_idle
srv_master_thread log flush and writes: 2379391
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 815307
OS WAIT ARRAY INFO: signal count 905349
Mutex spin waits 13126865, rounds 31704200, OS waits 415905
RW-shared spins 503431, rounds 12970865, OS waits 375385
RW-excl spins 58495, rounds 950616, OS waits 5991
Spin rounds per wait: 2.42 mutex, 25.76 RW-shared, 16.25 RW-excl
------------
TRANSACTIONS
------------
Trx id counter 226393815
Purge done for trx's n:o < 226393810 undo n:o < 0 state: running but idle
History list length 1365
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 226393813, not started
MySQL thread id 18660718, OS thread handle 0x7fd0ec968700, query id 195313941 ip-172-31-7-84.ap-southeast-1.compute.internal 172.31.7.84 admin cleaning up
---TRANSACTION 226393812, not started
MySQL thread id 18660719, OS thread handle 0x7fd1e4e59700, query id 195313940 ip-172-31-7-84.ap-southeast-1.compute.internal 172.31.7.84 admin cleaning up
---TRANSACTION 226393627, not started
MySQL thread id 18660553, OS thread handle 0x7fd1e63ad700, query id 195314129 ip-172-31-7-84.ap-southeast-1.compute.internal 172.31.7.84 admin cleaning up
---TRANSACTION 226393626, not started
MySQL thread id 18660552, OS thread handle 0x7fd1e5cd2700, query id 195314130 ip-172-31-7-84.ap-southeast-1.compute.internal 172.31.7.84 admin cleaning up
---TRANSACTION 226393625, not started
MySQL thread id 18660617, OS thread handle 0x7fd0ec189700, query id 195314208 ip-172-31-7-84.ap-southeast-1.compute.internal 172.31.7.84 admin cleaning up
---TRANSACTION 226393624, not started
MySQL thread id 18660618, OS thread handle 0x7fd0ec148700, query id 195314209 ip-172-31-7-84.ap-southeast-1.compute.internal 172.31.7.84 admin cleaning up
?
?
分析结论:
?
业务高峰期,JAVA/REDIS/MYSQL对服务器内存的需求都到达顶峰(目前JAVA进程的内存没有限制),在所有JAVA进程都没有FULLGC之前,可能会存在一个内存峰值,引发Linux系统开始屠杀进程。
Linux屠杀进程的过程中往往抓个最大的开始搞,mysql,redis就非常容易被搞掉。
因为这些服务还会自动重启,Linux杀着杀着就把自己给杀蒙了,就宕机了。
?
应对策略:
?
1.买两台内存更大的服务器作为资源服务器,MYSQL和REDIS从应用应用服务器迁移到资源服务器。
2.对MYSQL和Redis做高可用架构应对单机宕机的情形。
3.调整inux内核信号量默认设置,让MYSQL能撑的时间长一点。
?
参考文档:
?
http://www.cnblogs.com/GoodGoodWorkDayDayUp/p/3473348.html
http://blog.csdn.net/wulantian/article/details/37560849
?