Difference between revisions of "Watchdog (AMC-IP)"
From Zenitel Wiki
(→Software application watchdog) |
(→Software application watchdog) |
||
Line 8: | Line 8: | ||
A System Monitoring Daemon (amc_ip_cmd_srv) continuously monitors that the AlphaCom software (amcd) are running by receiving an alive signal from amcd. The same amc_ip_cmd_srv daemon do also send alive signals to the Hardware Watchdog circuit. If the | A System Monitoring Daemon (amc_ip_cmd_srv) continuously monitors that the AlphaCom software (amcd) are running by receiving an alive signal from amcd. The same amc_ip_cmd_srv daemon do also send alive signals to the Hardware Watchdog circuit. If the | ||
− | alive signal from amcd is missing the amc_ip_cmd_srv will first try to do a soft reset of the AlphaCom software and some system processes. If the AlphaCom system fails to start again the amc_ip_cmd_srv will stop toggling the hardware wdog, and a full system reset will occur. | + | alive signal from amcd is missing the amc_ip_cmd_srv will first try to do a soft reset of the AlphaCom software, and some system processes (see list below). If the AlphaCom system fails to start again the amc_ip_cmd_srv will stop toggling the hardware wdog, and a full system reset will occur. |
A software reset is performed using linux runlevel system. The resources that is stopped and started again are in order: | A software reset is performed using linux runlevel system. The resources that is stopped and started again are in order: | ||
Line 32: | Line 32: | ||
If not amcd reports back after another '''20''' seconds after soft reset initiated, the Hardware Watchdog toggling is stopped. | If not amcd reports back after another '''20''' seconds after soft reset initiated, the Hardware Watchdog toggling is stopped. | ||
The Hard reset should kick in within '''3''' seconds. | The Hard reset should kick in within '''3''' seconds. | ||
+ | |||
+ | '''Logging''' | ||
+ | |||
+ | All controlled reset are logged to the System Log. A soft reset will be logged with the message: | ||
+ | |||
+ | Trace Event from Interface AMC_CMD_SERVER: AMC reported dead, trying a soft reset and set STBY REQ | ||
+ | |||
+ | If the soft reset did not work the message before HW reset will be: | ||
+ | |||
+ | Trace Event from Interface AMC_CMD_SERVER: AMC dead, resetting the AMC_IP card | ||
+ | |||
+ | In addition a system process state dump (ps) will be written to the file '/opt/nvram/amc_reset_ps_dump.txt' before a hard reset. | ||
===Recovery from a failure situation === | ===Recovery from a failure situation === |
Revision as of 15:07, 25 August 2008
Contents
Hardware watchdog
Software application watchdog
A System Monitoring Daemon (amc_ip_cmd_srv) continuously monitors that the AlphaCom software (amcd) are running by receiving an alive signal from amcd. The same amc_ip_cmd_srv daemon do also send alive signals to the Hardware Watchdog circuit. If the alive signal from amcd is missing the amc_ip_cmd_srv will first try to do a soft reset of the AlphaCom software, and some system processes (see list below). If the AlphaCom system fails to start again the amc_ip_cmd_srv will stop toggling the hardware wdog, and a full system reset will occur.
A software reset is performed using linux runlevel system. The resources that is stopped and started again are in order:
* AMC System including amcd, rtpd, sipd * cron job * snmpd * inetd (telnet and ftp) * ntp (network Time Protocol) * apache (AlphaWeb) * Log daemons * Networking
The resources are started again in the reverse order.
The amcd will monitor its own resources like rtp, sip etc. If any of these resources fails the amcd will iself try to soft reset them. If the amcd is not able to get the resources back up running it will stop alive signalling the amc_ip_cmd_srv and the monitoring process described above will take over.
Timing
After a system reset the amcd get some extra time before validated as 'dead': 25 seconds
The Soft reset process is triggered when no signal from amcd within 5 seconds. If not amcd reports back after another 20 seconds after soft reset initiated, the Hardware Watchdog toggling is stopped. The Hard reset should kick in within 3 seconds.
Logging
All controlled reset are logged to the System Log. A soft reset will be logged with the message:
Trace Event from Interface AMC_CMD_SERVER: AMC reported dead, trying a soft reset and set STBY REQ
If the soft reset did not work the message before HW reset will be:
Trace Event from Interface AMC_CMD_SERVER: AMC dead, resetting the AMC_IP card
In addition a system process state dump (ps) will be written to the file '/opt/nvram/amc_reset_ps_dump.txt' before a hard reset.
Recovery from a failure situation
(Recovery from a failure situation if the watchdog detects a failure e.g.:)
- Restart the process
- Restart with OS
- Power down/up.
System architecture
The watchdog on the AMC-IP board is comprised of two main elements; the watchdog circuit and the system monitoring software running on the CPU. The watchdog circuit requires an alive signal to be toggeled by the CPU regularly. This signal is toggeled by the system monitor at 100 Hz as long as all required processes is up and running. If the system monitor detects irregularities or hanging processes, it will attempt to restart these processes. If this dosen't work, it will stop toggeling the alive signal, and the entire AMC-IP board is reset.