You need to download the following:
The global configurables are usually in eddie.cf and are listed below:
The EDDIE configuration follows the standard Python code format. Where methods or child objects of an object are indicated by indenting them beneath the parent object definition, sub-objects or parameters of a directive object are similarly indicated by indenting them beneath the parent object definition. For example, a notification object definition may look like:
N COMMONALERT:
# Info
Level 0:
email(INFO_EMAIL,INFO)
# Warning
Level 1:
email(ALERT_EMAIL,WARN)
# Alert
Level 2:
email(ALERT_EMAIL,ALERT),ticker(ALERT_P)
# Serious Alert
Level 3:
email(ALERT_EMAIL,ALERT),email(SYSSUP_EMAIL,ALERT_P),ticker(ALERT_P)
The parameters and child objects of the parent object, N, are
indented. Similarly for the Level objects. If you are used
to Python coding this will be second nature to you. If you are
not, it will not be hard to pick up.
DIRECTIVE name:
argument1=value1
[argument2=value2
...]
where "DIRECTIVE" is the directive name, like PROC or FS, and
"name" is the user-defined name of this directive object. The
arguments customize the directive appropriately. Some arguments
are directive-specific while others are common to all directives.
E.g.:
PROC test:
procname='syslogd'
rule=NR
scanperiod='30s'
action='COMMONALERT(commonmsg.proc,1)'
This is an example definition of a PROC directive, called 'test'.
It contains the PROC-specific arguments, 'procname' and 'rule'.
'scanperiod' and 'action' are arguments which are common to all
directives. Some arguments are optional while others are required,
and errors will be raised if they are missing. In this example
'procname', 'rule' and 'action' are all required. 'scanperiod' is
optional.
An EDDIE configuration can be simple to get basic monitoring started quickly and made as complicated as required to perform advanced operations. A simple example rules file is shown below to monitor basic services on a host. This rules file, named simple.rules, would be placed in the same directory as eddie.cf and eddie.cf would contain the entry
# Process checks
PROC syslogd:
procname='syslogd'
rule=NR
action="email('root', '%(procp)s is not running on %(h)s')"
PROC inetd:
procname='inetd'
rule=NR
action="email('root', '%(procp)s is not running on %(h)s')"
PROC sshd:
procname='sshd'
rule=NR
action="email('root', '%(procp)s is not running on %(h)s')"
# Filesystem checks
FS root:
fs='/'
rule="capac>=90"
action="email('root', '%(fsf)s over 90%% on %(h)s')"
FS varlog:
fs='/var/log'
rule="capac>=90"
action="email('root', '%(fsf)s over 90%% on %(h)s')"
# Service Port checks
SP smtp_port:
port='smtp'
protocol='tcp'
bindaddr='0.0.0.0'
action="email('root', '%(spprot)s/%(spport)s on %(h)s is not listening')"
SP http_port:
port='http'
protocol='tcp'
bindaddr='0.0.0.0'
action="email('root', '%(spprot)s/%(spport)s on %(h)s is not listening')"
# System statistics checks
SYS loadaverage:
rule="loadavg1 > 3.00"
scanperiod='1m'
action="email('root', '%(h)s load-average > 3.00')"
DIRECTIVE name: arg1=value1 arg2=value2 argn=valuenWhere "DIRECTIVE" is the name of the directive itself (see Built-in Directives); "name" is a user-defined name of the directive definition (the directive ID is usually constructed as "DIRECTIVE.name", e.g., "FS.root", and will appear in the logs, console, etc); "args" are arguments to define what the directive should do and how it should do it. Some arguments are common to all directives and others are specific to that type of directive.
Common Directive Arguments:
Built-in Directives: The built-in directives are as follows:
# alert if cron is not running
PROC cron:
procname='cron'
rule=NR
action='email("alert", "cron is not running on %(h)s")'
# syslog has a memory leak - alert if using over 50MB
PROC syslogmem:
procname='syslogd'
rule='vsz > 50000'
action='email("alert", "syslogd using over 50MB")'
# alert if / over 95% full
FS root:
fs='/'
rule='capac > 95'
action='email("alert", "/ is over 95%% full on %(h)s")'
# alert if /var has less than 100MB available
FS var:
fs='/var'
rule='avail < 100*1024'
action='email("alert", "/var has less than 100MB free on %(h)s")'
# alert if nothing listening on http port
SP http:
port='http'
protocol='tcp'
bindaddr='0.0.0.0'
action='email("alert", "http port not bound to on %(h)s")'
# alert if nothing listening on port 22 on 10.0.0.5
SP sshport:
port=22
protocol='tcp'
bindaddr='10.0.0.5'
action='email("alert", "10.0.0.5:22 not bound to")'
# alert if the sshd pid file doesn't exist
PID sshdpid1:
pid='/var/run/sshd.pid'
rule=EX
action='email("alert", "sshd pid file not found on %(h)s")'
# alert if the sshd pid doesn't match the process table
PID sshdpid2:
pid='/var/run/sshd.pid'
rule=PR
action='email("alert", "sshd pid not in process table on %(h)s")'
# Check load average (the hard way)
COM loadavg: cmd="uptime | cut -d, -f4 | awk '{print $3}'"
rule="float(out) > 6.0"
action='email("alert", "Load on %(h)s is > 6.0")'
# Check number of netscapes running
COM loadavg: cmd="ps -ef | grep netscape | wc -l"
rule="int(out) > 6.0"
action='email("alert", "There are %(comout)s netscapes running on %(h)s")'
# check that 10.0.0.5 is accepting connections on port 80
PORT webcheck:
host='10.0.0.5'
port=80
send=""
expect=""
action="email('alert', 'port 80 not responding on 10.0.0.5')"
# check that 10.0.0.5 is accepting connections on port 25
PORT smtpcheck:
host='www.domain.name'
port=25
send="\n"
expect="220.*"
action="email('alert', 'port 25 not responding on 10.0.0.5')"
# alert if eth0 interface has disappeared
IF ethexists:
name='eth0'
rule=NE
action="email('alert', 'eth0 has disappeared on %(h)s')"
# alert if input packet errors are greater than 10%
IF ierrs:
name='hme0'
rule="100.0*ierrs/ipkts > 10.0"
action="email('alert', 'input packet error > 10% on hme0')"
# alert if any UDP input errors
IF udpinerr:
rule="udpInErrors > 0"
action="email('alert', '%(h)s has UDP input errors')"
System stats from '/usr/bin/uptime':
uptime - time since last boot (string)
users - number of logged on users (int)
loadavg1 - 1 minute load average (float)
loadavg5 - 5 minute load average (float)
loadavg15 - 15 minute load average (float)
System counters from '/usr/bin/vmstat -s' (see vmstat(1M)):
ctr_swap_ins - (long)
ctr_swap_outs - (long)
ctr_pages_swapped_in - (long)
ctr_pages_swapped_out - (long)
ctr_total_address_trans_faults_taken - (long)
ctr_page_ins - (long)
ctr_page_outs - (long)
ctr_pages_paged_in - (long)
ctr_pages_paged_out - (long)
ctr_total_reclaims - (long)
ctr_reclaims_from_free_list - (long)
ctr_micro_hat_faults - (long)
ctr_minor_as_faults - (long)
ctr_major_faults - (long)
ctr_copyonwrite_faults - (long)
ctr_zero_fill_page_faults - (long)
ctr_pages_examined_by_the_clock_daemon - (long)
ctr_revolutions_of_the_clock_hand - (long)
ctr_pages_freed_by_the_clock_daemon - (long)
ctr_forks - (long)
ctr_vforks - (long)
ctr_execs - (long)
ctr_cpu_context_switches - (long)
ctr_device_interrupts - (long)
ctr_traps - (long)
ctr_system_calls - (long)
ctr_total_name_lookups - (long)
ctr_toolong - (long)
ctr_user_cpu - (long)
ctr_system_cpu - (long)
ctr_idle_cpu - (long)
ctr_wait_cpu - (long)
Process/memory stats from '/usr/bin/vmstat' (see vmstat(1M)):
procs_running - number of processes running (int)
procs_blocked - number of processes blocked (int)
procs_waiting - number of processes waiting (int)
mem_swapfree - amount of free swap (kB) (int)
mem_free - amount of free RAM (kB) (int)
Linux:
loadavg1 - 1min load average (float) loadavg5 - 5min load average (float) loadavg15 - 15min load average (float) ctr_uptime - uptime in seconds (float) ctr_uptimeidle - idle uptime in seconds (float) ctr_cpu_user - total cpu in user space (int) ctr_cpu_nice - total cpu in user nice space (int) ctr_cpu_system - total cpu in system space (int) ctr_cpu_idle - total cpu in idle thread (int) ctr_cpu%d_user - per cpu in user space (e.g., cpu0, cpu1, etc) (int) ctr_cpu%d_nice - per cpu in user nice space (e.g., cpu0, cpu1, etc) (int) ctr_cpu%d_system - per cpu in system space (e.g., cpu0, cpu1, etc) (int) ctr_cpu%d_idle - per cpu in idle thread (e.g., cpu0, cpu1, etc) (int) ctr_pages_in - pages read in (int) ctr_pages_out - pages written out (int) ctr_pages_swapin - swap pages read in (int) ctr_pages_swapout - swap pages written out (int) ctr_interrupts - number of interrupts received (int) ctr_contextswitches - number of context switches (int) ctr_processes - number of processes started (I think?) (int) boottime - time of boot (epoch) (int)HP-UX:
System stats from '/usr/bin/uptime':
uptime - (string)
users - (int)
loadavg1 - (float)
loadavg5 - (float)
loadavg15 - (float)
System counters from '/usr/bin/vmstat -s' (see vmstat(1)):
ctr_swap_ins - (long)
ctr_swap_outs - (long)
ctr_pages_swapped_in - (long)
ctr_pages_swapped_out - (long)
ctr_total_address_trans_faults_taken - (long)
ctr_page_ins - (long)
ctr_page_outs - (long)
ctr_pages_paged_in - (long)
ctr_pages_paged_out - (long)
ctr_reclaims_from_free_list - (long)
ctr_total_page_reclaims - (long)
ctr_intransit_blocking_page_faults - (long)
ctr_zero_fill_pages_created - (long)
ctr_zero_fill_page_faults - (long)
ctr_executable_fill_pages_created - (long)
ctr_executable_fill_page_faults - (long)
ctr_swap_text_pages_found_in_free_list - (long)
ctr_inode_text_pages_found_in_free_list - (long)
ctr_revolutions_of_the_clock_hand - (long)
ctr_pages_scanned_for_page_out - (long)
ctr_pages_freed_by_the_clock_daemon - (long)
ctr_cpu_context_switches - (long)
ctr_device_interrupts - (long)
ctr_traps - (long)
ctr_system_calls - (long)
ctr_Page_Select_Size_Successes_for_Page_size_4K - (long)
ctr_Page_Select_Size_Successes_for_Page_size_16K - (long)
ctr_Page_Select_Size_Successes_for_Page_size_64K - (long)
ctr_Page_Select_Size_Successes_for_Page_size_256K - (long)
ctr_Page_Select_Size_Failures_for_Page_size_16K - (long)
ctr_Page_Select_Size_Failures_for_Page_size_64K - (long)
ctr_Page_Select_Size_Failures_for_Page_size_256K - (long)
ctr_Page_Allocate_Successes_for_Page_size_4K - (long)
ctr_Page_Allocate_Successes_for_Page_size_16K - (long)
ctr_Page_Allocate_Successes_for_Page_size_64K - (long)
ctr_Page_Allocate_Successes_for_Page_size_256K - (long)
ctr_Page_Allocate_Successes_for_Page_size_64M - (long)
ctr_Page_Demotions_for_Page_size_16K - (long)
# alert if 1 minute load average > 2
IF loadavg1:
rule="loadavg1 > 2.0"
action="email('alert', '%(h)s has loadavg1 > 2.0')"
# Email all entries from /var/log/messages to alert every 12 hours.
LOGSCAN messages:
file='/var/log/messages'
regex='.*'
scanperiod='12h'
action='email("alert", %(h)s:%(logscanfile)s", "-- Logscan matched %(logscanlinecount)d lines: --\n%(logscanlines)s")'
POP3TIMING pop3test:
server='pop3.domain.com'
user='fred'
password='foo'
action="email('mary', 'host=%(pop3timinghost)s, username=%(pop3timingusername)s, connecttime=%(pop3timingconnecttime)s, authtime=%(pop3timingauthtime)s, listtime=%(pop3timinglisttime)s, retrtime=%(pop3timingretrtime)s')"
N name:
Level 0:
action1, [action2, ...]
Level 1:
action1, [action2, ...]
Level n:
action1, [action2, ...]
Levels should range from 0 to 9, with 9 being the most critical. E.g.:
N COMMONALERT:
# Info
Level 0:
email(INFO_EMAIL,INFO)
# Warning
Level 1:
email(ALERT_EMAIL,WARN)
# Alert
Level 2:
email(ALERT_EMAIL,ALERT),ticker(ALERT_P)
# Serious Alert
Level 3:
email(ALERT_EMAIL,ALERT),email(SYSSUP_EMAIL,ALERT_P),ticker(ALERT_P)
Actions are defined like function calls, and multiple actions are
separated by commas. See Actions for more information.
M groupname: MSG msgname1: "string1" "string2" MSG msgname2: "string3" "string4"or
M groupname:
M subgroupname1:
MSG msgname1: "string1"
"string2"
MSG msgname2: "string3"
"string4"
M subgroupname2:
MSG msgname3: "string5"
"string6"
E.g.:
# Define common messages. These are used by the COMMONALERT notification object
M commonmsg:
# Define a subgroup of messages to be used by PROC directive actions
M proc:
# Warning-level message for email
MSG WARN: "Warning: %(procp)s on %(h)s not running"
"The %(procp)s process on %(h)s is not running"
# Warning-level message for paging or tickertape
MSG WARN_P: "Warn: The %(procp)s daemon on %(h)s is not running." ""
# Alert-level message for email
MSG ALERT: "Alert: %(procp)s on %(h)s not running"
"""ALERT: The %(procp)s daemon on %(h)s is not running.
%(problemage)s
%(problemfirstdetect)s
"""
# Alert-level message for paging or tickertape
MSG ALERT_P: "ALERT: The %(procp)s daemon on %(h)s is not running." ""
By default every directive is shown in the Console output in the format "<ID> - <state>". This can be modified with the console directive argument, or the directive not shown at all by setting this argument to None.
Substitution variables available to the console argument string are:
Directive examples:
# check root filesystem usage
FS rootfs: fs='/'
rule="capac > 95"
action='email("root", "%(fsf)s at %(fscapac)s%%")'
console='%(state)s %(fscapac)s%%'
# email me load average every 5mins
SYS loadavg5: rule="1"
action="email('chris', '%(h)s loadavg5: %(sysloadavg5).02f')"
scanperiod='5m'
console="loadavg5=%(sysloadavg5).02f"
# store root filesystem data in RRD (don't show on Console)
FS root_rrd: fs='/'
rule="1"
scanperiod='5m'
action='elvinrrd("fs-%(h)s_root", "used=%(fsused)s", "size=%(fssize)s")'
console=None
Console example:
$ telnet localhost 33343
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Eddie Console Gateway
FS.rootfs - ok 33%
SYS.loadavg5 - loadavg5=0.14
Connection closed by foreign host.
The format for specifying time is either:
[ EDDIE Homepage ]
© Chris Miles 2001