早上起床,手机收到不少报警内容:
Trigger: Disk I/O is overloaded on nginx_189
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. CPU iowait time (nginx_189:system.cpu.util[,iowait]): 36.59 %
2014-02-28 09-17
我台服务器主要功能是nginx负载从+静态服务器+NFS(文件服务器),也就是把图片等网页静态内容,用nginx来解析出去。开始以为,这IO会不会是因为FNS造成的?打开站长统计平台,看了下最近的流量统计,结果今天的流量和平时差不多,所以,不可能是因为NFS造成的。打开top看下,如:
top - 09:00:13 up 299 days, 15:18, 1 user, load average: 3.42, 6.62, 7.22
Tasks: 676 total, 1 running, 675 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.5%us, 0.8%sy, 0.0%ni, 88.9%id, 9.7%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 16323624k total, 16095648k used, 227976k free, 429972k buffers
Swap: 4194296k total, 89880k used, 4104416k free, 13756348k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18364 root 20 0 78824 2552 2372 S 2.0 0.0 104:04.11 master
24359 root 20 0 242m 5616 1008 S 1.7 0.0 84:37.57 rsyslogd
5471 www 20 0 563m 43m 968 S 1.3 0.3 244:38.98 nginx
2532 mysql 20 0 6544m 87m 3484 S 1.0 0.6 74:23.93 mysqld
18928 postfix 20 0 78912 3668 2652 S 1.0 0.0 0:04.17 trivial-rewrite
2632 root 20 0 0 0 0 S 0.7 0.0 125:53.99 nfsd
5466 www 20 0 563m 33m 968 S 0.7 0.2 245:24.31 nginx
5467 www 20 0 562m 32m 968 S 0.7 0.2 246:10.91 nginx
18367 postfix 20 0 92796 16m 2436 S 0.7 0.1 66:19.27 qmgr
19024 postfix 20 0 78784 3244 2404 S 0.7 0.0 0:00.11 error
20291 postfix 20 0 78808 3208 2380 S 0.7 0.0 0:00.07 bounce
20308 postfix 20 0 78808 3212 2380 D 0.7 0.0 0:00.06 bounce
20627 root 20 0 15428 1704 956 R 0.7 0.0 0:00.08 top
1267 root 20 0 0 0 0 D 0.3 0.0 28:54.38 jbd2/sda5-8
5473 www 20 0 564m 44m 968 S 0.3 0.3 245:18.47 nginx
5475 www 20 0 561m 41m 968 S 0.3 0.3 245:24.89 nginx
5478 www 20 0 564m 44m 972 S 0.3 0.3 245:15.26 nginx
5480 www 20 0 560m 40m 968 S 0.3 0.3 244:23.20 nginx
18007 postfix 20 0 78976 3588 2688 S 0.3 0.0 0:00.07 smtp
18061 postfix 20 0 78976 3580 2684 S 0.3 0.0 0:00.05 smtp
18116 postfix 20 0 78784 3244 2404 S 0.3 0.0 0:00.11 error
再看vmstat:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 21 89968 174376 435440 13724224 0 0 0 2344 10295 14796 1 1 48 51 0
0 20 89968 172764 435472 13724476 0 0 0 2168 11165 14036 1 1 45 53 0
0 15 89968 174144 435500 13724932 0 0 0 2284 10924 14465 1 1 56 41 0
1 14 89968 173416 435524 13725220 0 0 0 2236 11184 14114 1 1 50 48 0
0 23 89968 173664 435548 13725568 0 0 0 2124 10656 14672 1 1 52 46 0
0 21 89968 173424 435572 13725868 0 0 0 2068 9471 12507 1 1 50 48 0
0 21 89968 169780 435592 13726180 0 0 0 2440 11110 13856 1 1 54 44 0
0 20 89968 171068 435644 13726592 0 0 0 3052 11531 16560 1 1 47 51 0
1 21 89968 170284 435672 13726868 0 0 0 2532 10400 18581 1 1 47 50 0
0 4 89968 171284 435696 13727168 0 0 0 2352 10578 19785 1 1 42 55 0
1 18 89968 169640 435740 13727576 0 0 0 2296 11212 19899 1 2 52 46 0
0 22 89968 170520 435752 13727868 0 0 0 1888 10769 20069 1 1 52 46 0
0 22 89968 169844 435772 13728196 0 0 0 2700 11599 20887 1 2 43 54 0
0 26 89968 169388 435804 13728508 0 0 0 1852 10066 22187 1 1 38 60 0
1 33 89968 167868 435828 13728788 0 0 0 2280 10427 17010 1 1 31 67 0
0 28 89968 164760 435856 13729048 0 0 0 2240 9965 18557 1 1 35 63 0
0 23 89968 162884 435872 13729348 0 0 0 1408 10064 18835 1 1 40 58 0
0 30 89968 163520 435884 13729620 0 0 0 1996 9547 17489 1 1 45 53 0
0 4 89968 162344 435924 13729912 0 0 0 1740 10241 17469 1 1 41 56 0
1 22 89968 159048 435936 13730084 0 0 0 6552 9550 14428 1 1 41 56 0
1 21 89968 161040 435980 13730284 0 0 0 1336 9299 19970 1 2 41 57 0
0 1 89968 162676 435988 13730336 0 0 0 2408 4878 8640 0 1 71 28 0
0 23 89968 163336 436004 13730504 0 0 0 2164 8385 29363 1 3 59 36 0
1 20 89968 165344 436020 13730880 0 0 0 1880 10232 13989 1 1 50 48 0
0 15 89968 162088 436032 13731188 0 0 0 1940 11081 11315 1 1 53 45 0
0 20 89968 157988 436068 13731504 0 0 0 2288 11366 14245 1 1 54 44 0
从进程中看到,好像是postfix有问题。我这postfix主要是用来给程序发达邮件用的,如报警,程序外发邮件等。平时postfix进程不会像现在这样异常,这在postf主进程CPU占用高,其它的相关进程,也占用很高,如smtp,qmgr等。当然,也看到了不少error的进程,而且属主是postfix,所以可以判断是postfix出问题了,为了证实,把postfix服务停掉,观察top状态,服务器负载都下去了,而且磁盘写入,上下文切换都降低了,所以,postfix异常是正确的。
这是刚停止postfix以后的vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 88736 600328 448020 13682256 0 0 0 5 0 0 0 0 99 0 0
0 0 88736 600164 448020 13682256 0 0 0 0 3500 725 0 0 99 0 0
0 0 88736 600448 448020 13682256 0 0 0 0 4008 842 0 0 99 0 0
0 0 88736 600544 448020 13682256 0 0 0 0 3197 668 0 0 100 0 0
0 0 88736 600296 448020 13682256 0 0 0 0 2736 613 0 0 100 0 0
0 0 88736 601216 448020 13682256 0 0 0 0 3311 679 0 0 100 0 0
0 0 88736 600860 448020 13682256 0 0 0 0 3268 719 0 0 99 0 0
0 0 88736 600896 448020 13682256 0 0 0 16 2818 698 0 0 100 0 0
0 0 88736 600268 448020 13682256 0 0 0 0 2998 592 0 0 99 0 0
0 0 88736 601060 448020 13682256 0 0 0 0 3163 596 0 0 99 0 0
0 0 88736 600680 448020 13682256 0 0 0 24 3092 657 0 0 99 0 0
0 0 88736 600952 448020 13682256 0 0 0 0 2628 489 0 0 100 0 0
0 0 88736 600080 448020 13682256 0 0 0 0 3192 629 0 0 99 0 0
0 0 88736 600492 448020 13682256 0 0 0 16 3448 634 0 0 99 0 0
0 0 88736 600596 448020 13682256 0 0 0 0 4024 683 1 0 99 0 0
0 0 88736 599820 448020 13682256 0 0 0 0 2931 552 0 0 100 0 0
0 0 88736 600332 448020 13682256 0 0 0 0 3147 665 0 0 99 0 0
0 0 88736 600880 448020 13682256 0 0 0 0 2971 619 0 0 99 0 0
0 0 88736 600096 448020 13682256 0 0 0 0 3233 602 0 0 99 0 0
0 0 88736 600752 448020 13682256 0 0 0 0 2872 518 0 0 100 0 0
1 0 88736 600364 448020 13682256 0 0 0 0 3152 648 0 0 99 0 0
0 0 88736 601008 448020 13682256 0 0 0 0 2251 504 0 0 100 0 0
0 0 88736 601132 448020 13682256 0 0 0 0 2834 624 0 0 99 0 0
0 0 88736 601536 448020 13682256 0 0 0 0 2747 600 0 0 100 0 0
0 0 88736 600904 448020 13682256 0 0 0 0 2758 529 0 0 99 0 0
1 0 88736 601168 448020 13682256 0 0 0 0 2883 627 0 0 100 0 0
因为对postfix不是很熟悉,所以,在网上找了一篇文章:
参考文档:http://www.myhack58.com/Article/60/sort095/2012/35334.htm
按他的思路和方法,我打开maillog日志,发现日志是写入很快,用tail -f可以看是狂刷屏,而且日志文件有几个G大了。日志大概如上:
Feb 28 10:08:27 mail postfix/error[10442]: DC467C1AE4: to=<gmiglicco@comcast.net>, relay=none, delay=12332, delays=12324/3.5/0/4, dsn=4.0.0, status=deferred (delivery temporarily suspended: host mx1.comcast.net[68.87.26.147] refused to talk to me: 554 imta33.westchester.pa.mail.comcast.net comcast 124.172.223.189 Comcast block for spam. Please see http://postmaster.comcast.net/smtp-error-codes.php#BL000000)
Feb 28 10:08:27 mail postfix/error[10943]: 6BA87C6222: to=<tinajones10@yahoo.com>, relay=none, delay=760, delays=751/9.2/0/0.18, dsn=4.4.2, status=deferred (delivery temporarily suspended: lost connection with mta5.am0.yahoodns.net[66.196.118.33] while sending RCPT TO)
Feb 28 10:08:27 mail postfix/error[10397]: 6C12AC5559: to=<fisher.price30@yahoo.com>, relay=none, delay=2988, delays=2979/9.1/0/0.18, dsn=4.4.2, status=deferred (delivery temporarily suspended: lost connection with mta5.am0.yahoodns.net[66.196.118.33] while sending RCPT TO)
Feb 28 10:08:27 mail postfix/error[12092]: 65E5FC155C: to=<chef.jr@hotmail.com>, relay=none, delay=13123, delays=13114/8.6/0/0.89, dsn=4.4.2, status=deferred (delivery temporarily suspended: lost connection with mx3.hotmail.com[65.54.188.126] while sending RCPT TO)
Feb 28 10:08:27 mail postfix/error[10392]: 9F869C19FA: to=<fmdry@aol.com>, relay=none, delay=12457, delays=12446/8.3/0/2.3, dsn=4.7.1, status=deferred (delivery temporarily suspended: host mailin-04.mx.aol.com[64.12.88.132] refused to talk to me: 421 4.7.1 : (DYN:T1) http://postmaster.info.aol.com/errors/421dynt1.html)
Feb 28 10:08:27 mail postfix/error[10423]: 6CE90C24F9: to=<nothingbetter0811@yahoo.com>, relay=none, delay=10968, delays=10959/8.5/0/0.89, dsn=4.4.2, status=deferred (delivery temporarily suspended: lost connection with mta5.am0.yahoodns.net[66.196.118.33] while sending RCPT TO)
Feb 28 10:08:27 mail postfix/error[10942]: 659D0C3636: to=<chdp@yahoo.com>, relay=none, delay=8958, delays=8949/8.6/0/0.89, dsn=4.4.2, status=deferred (delivery temporarily suspended: lost connection with mta5.am0.yahoodns.net[66.196.118.33] while sending RCPT TO)
然后用postcat -q命令查看其中一封邮件的内容: 如 postcat -q 659D0C3636,可以看到,邮件是用test@xxx.com发送的,再仔细分析日志,确实是挺多from test@xxx.com记录。登陆test@xxx.com邮箱,发现有8000多封退信,而且还在增加中。到此,问题所在可以真正找出来了。因为test@xxx.com是以前开发程序时所创建的帐号,有可能是密码设的太简单,被人破解,被用于狂发邮件。果断把test@xxx.com封掉。然后再用postsuper -d ALL 把其它postfix缓存清空,到此故障完全排除。有些排查和解决步骤,那篇文章中出现的,我这里就没写出来了,思路是借鉴他的。
您可以选择一种方式赞助本站
支付宝扫一扫赞助
微信钱包扫描赞助
赏