nagios的安装配置请见
这里主要讲通过nagios的监听和事件处理机制,对一些故障服务进行远程处理。处理后如果还不正常,nagios启动邮件,短信报警。
python_action.sh,python_action.py 代码见
1.首先启用邮件报警功能。
下载sendEmail软件,解压后直接将sendEmail复制到/usr/bin里
因为没有搞懂sendmail,所以下了个sendEmail
编辑/usr/local/nagios/etc/objects/commands.cfg
将原来/bin/mail -s 这一部份替换为
tail /usr/local/nagios/var/nagios.log | /usr/local/bin/sendEmail -f username@163.com -t $CONTACTEMAIL$ -s smtp.163.com -u "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" -xu username -xp 123
意思是用sendEmail客户端通过163邮箱的smtp服务,发送邮件。username 是你163邮箱名,123是163邮箱密码。$CONTACTEMAIL$ 是你要发送的目的邮箱,也就是nagios.cfg配置中系统管理员的邮箱。我是将nagios.log的后十行作为邮件正文一起发送的。
这是我的配置
# 'notify-host-by-email' command definition
define command{ command_name notify-host-by-email command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | tail /usr/local/nagios/var/nagios.log | /usr/local/bin/sendEmail -f username@163.com -t $CONTACTEMAIL$ -s smtp.163.com -u "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" -xu username -xp 123 } # 'notify-service-by-email' command definition define command{ command_name notify-service-by-email command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | tail /usr/local/nagios/var/nagios.log | /usr/local/bin/sendEmail -f username@163.com -t $CONTACTEMAIL$ -s smtp.163.com -u "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" -xu username -xp 123 }配置好有,就可以将带有日志信息的邮件发送到指定邮箱。
短信报警这块可以注册一个移动的139邮箱,邮件到后自动发消息到手机。把短信通知设置为长短信,就可以直接看了。记得把nagios.cfg中邮件地址改为139邮箱的地址。
2.利用nagios的事件处理机制,监控Linux下指定进程。
编辑/usr/local/nagios/etc/objects/localhost.cfg
这是我配置的两个服务,一个是用TCP监听django的8000端口,一个是通过SNMP
监听django的 manage.py runserver 进程
#check_django_tcp
define service{ use local-service ; Name of service template to use host_name RedHat-host service_description Django_TCP check_command check_django_tcp!8000 notifications_enabled 1 event_handler_enabled 1 event_handler python_action } #check_django_snmp define service{ use local-service ; Name of service template to use host_name RedHat-host service_description Django_SNMP check_command check_django_snmp!2c!public!.1.3.6.1.4.1.2021.54.101.2!"manage.py runserver" notifications_enabled 1 event_handler_enabled 1 event_handler python_action }注意这两项
event_handler_enabled 1
event_handler python_action事件使能打开,处理方式是python_action
python_action 我是在command.cfg中定义的。
#'python_action'
define command{ command_name python_action command_line $USER1$/python_action.sh "$HOSTNAME$,$SERVICEDESC$,$SERVICESTATE$,$SERVICESTATETYPE$,$SERVICEATTEMPT$" } #'check_django_tcp' define command{ command_name check_django_tcp command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p $ARG1$ $ARG2$ } #'check_django_snmp' define command{ command_name check_django_snmp command_line $USER1$/check_snmp -H $HOSTADDRESS$ -P $ARG1$ -C $ARG2$ -o $ARG3$ -r $ARG4$ }python_action.sh是自己写的脚本,调用python_action.py
要讲python_action.sh和python_action.py复制到
/usr/local/nagios/libexec/
改权限为chown -R nagios:nagios /usr/local/nagios/*
python_action.sh 代码
#!/bin/bash
cd /usr/local/nagios/libexec if [ $# -ne 2 ]; then service_info="$1" /usr/bin/python /usr/local/nagios/libexec/python_action.py "$service_info" fipython_action.py 代码
# -*- coding: utf-8 -*-
import pxssh, time, os,sys,pexpect from time import sleep, ctime machine_name_list = {"ubuntu-host":["192.168.15.67", "root", "123"], "localhost":["172.172.10.100", "root", "123"], "RedHat-host":["192.168.15.67", "root", "123"] } server_command_list = {"Django_TCP":"/usr/bin/python /root/dmdu_manage/manage.py runserver &", "SMTP":"/etc/init.d/sendmail restart", "Django_SNMP":"/usr/bin/python /root/dmdu_manage/manage.py runserver &" } def write_opt_log(service_info='None',command='None'): try: f = open("service_opt_info.txt",'a') info=[] info.append(service_info) info.append(command) print info f.write("%s,%s\n" % (info[0],ctime())) f.write("%s\n" % (info[1])) f.write("\n") f.close except Exception , e: print "Exception is ",e def ssh_cmd(hostIP='172.172.10.101', username="root", password="kk",command=""): print "Now connecting %s" % (hostIP) print "Please Wait... ...\n" #import pdb;pdb.set_trace() s = pxssh.pxssh() s.login(hostIP, username, password, login_timeout=30, original_prompt="[$#>]", auto_prompt_reset="['Password','password: ', 'continue connecting (yes/no)?']") print "Start OS\n" s.sendline(command) s.prompt() print s.before s.sendline("exit") s.prompt() print s.before #s.logout() print "End OS \n" def pexpect_cmd(hostIP='172.172.10.101', username="root", password="kk",command=""): print "Start OS \n" print "Please Wait... ...\n" ssh = pexpect.spawn('ssh -l %s %s %s'%(username, hostIP, command)) r = '' try: i = ssh.expect(['[Pp]assword: ', 'continue connecting (yes/no)?', pexpect.EOF, pexpect.TIMEOUT]) if i == 0 : ssh.sendline(password) elif i == 1: ssh.sendline('yes') except pexpect.EOF: ssh.close() else: r = ssh.read() ssh.expect(pexpect.EOF) ssh.close() print "End OS\n" return r def restart_opt(service_info='None'): #import pdb;pdb.set_trace info_detail=[] info_detail = service_info.split(',') hostname=info_detail[0] service_desc=info_detail[1] service_state=info_detail[2] service_state_type=info_detail[3] service_attempt=info_detail[4] hostIP = machine_name_list[hostname][0] username = machine_name_list[hostname][1] password = machine_name_list[hostname][2] command = server_command_list[service_desc] if service_state == "CRITICAL" and int(service_attempt) >= 3 : try: write_opt_log(service_info,command) ssh_cmd(hostIP,username,password,command) #pexpect_cmd(hostIP,username,password,command) service_opt="up" except pxssh.ExceptionPxssh, e: print "ExceptionPxssh is", e if __name__ == '__main__': service_info = sys.argv[1] restart_opt(service_info)由于调用了pexpect库,所以在监控机器上要装pexpect-2.3这个可以到网上下。
tar -zxvf pexpect-2.3.tar.gz cd pexpect-2.3.tar.gz python setup.py install修改 vim
/usr/local/lib/python2.6/dist-packages/pxssh.py /usr/lib/python2.6/dist-packages/pxssh.py 第134行。在第一个 self.read_nonblocking(size=10000,timeout=1) # GAS: Clear out the cache before getting the prompt 前加入 self.sendline() time.sleep(0.5) 修改后为 self.sendline() time.sleep(0.5) self.read_nonblocking(size=10000,timeout=1) # GAS: Clear out the cache before getting the prompt 不改的话,会报pxssh超时错误。 装好后,就可以执行带有pxssh 的python脚本。3.开始配置受控端的snmp
要想监控Linux服务器下的指定进程,可以采取这种办法。 配置受控端的/etc/snmp/snmpd.conf 找到这一行 exec .1.3.6.1.4.1.2021.54 将其改为 exec .1.3.6.1.4.1.2021.54 /bin/sh /root/test.sh建立/root/test.sh文件 编辑为以下内容,假如我要监测django的 manage.py runserver 进程。 #!/bin/bash /bin/ps x | grep manage.py | awk '{print $6 " " $7;}' 保存后退出。 重启snmp服务。 在监控端机器上运行snmpwalk -v 2c -c public 192.168.15.67 .1.3.6.1.4.1.2021.54 可以看到以下信息 root@sifksky:/usr/local/nagios/libexec# snmpwalk -v 2c -c public 192.168.15.67 .1.3.6.1.4.1.2021.54 UCD-SNMP-MIB::ucdavis.54.1.1 = INTEGER: 1 UCD-SNMP-MIB::ucdavis.54.2.1 = STRING: "/bin/sh" UCD-SNMP-MIB::ucdavis.54.3.1 = STRING: "/root/test.sh" UCD-SNMP-MIB::ucdavis.54.100.1 = INTEGER: 0 UCD-SNMP-MIB::ucdavis.54.101.1 = STRING: "manage.py runserver" UCD-SNMP-MIB::ucdavis.54.101.2 = STRING: "manage.py runserver" UCD-SNMP-MIB::ucdavis.54.102.1 = INTEGER: 0 UCD-SNMP-MIB::ucdavis.54.103.1 = "" 如果没有,请确认受控端防火墙已经关闭。看到UCD-SNMP-MIB::ucdavis.54.101.1 = STRING: "manage.py runserver"这个时候就可以用nagios的 check_snmp -H 192.168.15.67 -P 2c -C public -o .1.3.6.1.4.1.2021.54.101.1 -s "manage.py runserver" 来监控这个进程了。
有什么不懂的,请大家留言指出。
PS
现知道的nagios监听服务的三种方式: 1)检测服务指定端口。通过TCP协议,不需要在受控端安装任何软件即可监听。 2)通过snmp监听进程名。需要在受控端开通snmp,然后配置snmp,添加一些脚本。 3)通过nagios的nrpe插件监控,相当于在受控端安装一个小型服务,通过SSL与监控端上的nagios通信,服务重启脚本也放在受控端上。 前两种需要开通SSH服务,以便出现问题后,nagios可以通过python脚本重启服务,程序重启后还不正常,才发送报警信息。 第三种不需要开通SSH服务,但就是要装插件在受控端。 通过监听TCP端口的预警时间要比通过snmp监听要快一点。