Watchdog (AMC-IP)

System architecture

The watchdog on the AMC-IP board is comprised of two main elements; the watchdog circuit and the system monitoring software running on the CPU. The watchdog circuit requires an alive signal to be toggeled by the CPU regularly. This signal is toggeled by the system monitor at 1 Hz as long as all required processes is up and running. If the system monitor detects irregularities or hanging processes, it will attempt to restart these processes. If this dosen't work, it will stop toggeling the alive signal, and the entire AMC-IP board is reset.

Hardware watchdog

The Hardware watchdog can operate in two timing modes:

60s/10s - After first start-up (or mode change) give 60 sec tickle timeout before reset, then 10 sec.
3s/3s - After first start-up (or mode change) give 3sec tickle timeout before reset, then 3 sec.

The bootloader will set the HW-Dog in 60/10 mode before launching the linux boot. The watchdog will not be tickled during linux boot process, but it is given a 60 second timeout.

When the System Monitoring Process (see below) is launched the mode is changed to 3s/3s. When the AlphaCom System is running the wdog is tickled once each second.

Software application watchdog

A System Monitoring Daemon (CMD server, or amc_ip_cmd_srv) continuously monitors that the AlphaCom software (AMCd) are running by receiving an alive signal from amcd. The same amc_ip_cmd_srv daemon do also send alive signals to the Hardware Watchdog circuit. If the alive signal from amcd is missing the amc_ip_cmd_srv will first try to do a soft reset of the AlphaCom software, and some system processes (see list below). If the AlphaCom system fails to start again the amc_ip_cmd_srv will stop toggling the hardware wdog, and a full system reset will occur.

A software reset is performed using linux runlevel system. The resources that is stopped and started again are in order:

* AMC System including amcd, rtpd, sipd
* cron job
* snmpd
* inetd (telnet and ftp)
* ntp (network Time Protocol)
* apache (AlphaWeb)
* Log daemons
* Networking

The resources are started again in the reverse order.

The amcd will monitor its own resources like rtp, sip etc. If any of these resources fails the amcd will iself try to soft reset them. If the amcd is not able to get the resources back up running, it will stop alive signalling the amc_ip_cmd_srv and the monitoring process described above will take over.

Timing

After a system reset the amcd will get some extra time before validated as 'dead': 25 seconds

The Soft reset process is triggered when no signal from amcd within 5 seconds. If the amcd do not report back after another 20 seconds after the soft reset was initiated, the Hardware Watchdog toggling is stopped. The Hard reset should kick in within 3 seconds.

Logging

All controlled reset are logged to the System Log. A soft reset will be logged with the message:

 Trace Event from Interface AMC_CMD_SERVER: AMC reported dead, trying a soft reset and set STBY REQ

If the soft reset did not work the message before HW reset will be:

Trace Event from Interface AMC_CMD_SERVER: AMC dead, resetting the AMC_IP card

In addition a system process state dump (ps) will be written to the file '/opt/nvram/amc_reset_ps_dump.txt' before a hard reset.

Watchdog diagram

With reference to the above drawing, there is a hierarchy of monitoring:

The AMCd (intercom) application is monitoring:

M100d (Application for interfacing to Philips M100)
SIPd (Application that handles all SIP calls)
RMd (Application for interfacing to Ringmaster system)
BILLINGd (Application for call billing)

If the AMCd detects that one of its underdeamons fails, each of these will be restarted by AMCd.

AMCd also monitors:

TCPserverd (Which handles IP station connections)
RTPdeamon (IP Audio connections)

If one of these fails, the AMCd will restart it self, and a larger software restart will be executed by the CMD server, as described below.

The software watchdog process (CMD server) is updating a Hardware watchdog. The CMD server monitors the AMCd and the HAIPd (Redundant server application), and it checks if regular tasks like Web, console etc are getting CPU resources.

If AMCd or HAIPd fails, they will be restarted by the CMD server.

A restart of AMCd will result in a restart of all above mentioned underdeamons. If a restart of AMCd is not successful, the CCMD server will after 60 seconds do a reboot of the Linux OS, and stop updating the hardware watchdog. HAIPd will be restarted without doing a full reboot.

If all CPU resources to "regular" tasks are used over a longer time period, it is assumed that an underlaying process is looping and is using all CPU but still able to update the software watchdog, and the AMCd will be restarted. If this doesn't help, there will be a reboot of the Linux OS.