Difference between revisions of "System availability"

Latest revision as of 10:26, 8 September 2017

System availability

System Availability

The percentage of time that the system can perform its intended function

System Availability = 1 – System Downtime

Downtime per year

Availability	Nines	Downtime
90%	1	36.5	days/year
99%	2	3.65	days/year
99.9%	3	8.78	hours/year
99.99%	4	52	minutes/year
99.999%	5	5	minutes/year

System Downtime

Many events causes system downtime:

HW fault
Software fault
Vandalism
Extreme conditions (fire, flooding etc)
Power outage
IP network failure
Planned system maintenance

System Downtime = ∑ P * S * MTTR

P	= Probability of event taken place
S	= Severity of event
	= Percentage of service affected by fault
MTTR	= Mean Time To Repair
	= mean time to detect fault + mean time to fix fault

HW failure

MTBF

Probability of HW faults calculated using MTBF figures
MTBF ≠ System Availability

MTBF calculations

Emperical method
MIL-HDBK-217
Telcordia

Emperical methods

Based on statistics from the field

MIL-HDBK-217 and Telcordia

All component entered in database with set environmental condition
Provides usually lower MTBF figure than emperical methods
- Does include real usage conditions
- Use worst case environmental conditions

*More components gives higher MTBF
*MTBF and single points of failure

Other failures

Software fault

Automatic watch dog functions
Automatic recovery functions
Maturity of system
Structured software design and test

Vandalism and Extreme conditions

Robustness to vandalism and extreme conditions
IP and IK class
IP security functions to hinder denial of service attacks (DOS)

Power outage

UPS and redundant power supplier

IP network failure

Network service level
Redundant and switchover functions

Planned system maintenance

Expansion, add users etc
Ability to do maintenance without service interruptions

Redundancy

Redundancy is about parallelism and removing single point of failures

Redundancy usually gives lower MTBF figures

Require more components

Redundancy usually provides significant higher service availability

A single failure shall have minimum or no impact on service availability

STENTOFON System Availability

Redundancy

Control room redundancy and parallel call handling
Power supply redundancy
Alternative AlphaNet routing
Control card AMC-IP redundancy

Reduced MTTR

AlphaNet supervision
Station supervision and tone test
Network monitoring (SNMP, Syslog, OPC)

Software failures and recovery

HW watchdog
SW process watchdog
Automatic recovery

System maintenance

Centralized and remote firmware upgrade
Hot insert and removal of cards
Control card AMC-IP redundancy

MTBF figures

MTBF (Mean Time Between failures) figures try to give a measure of the reliability of a system. There are 2 ways in which MTBF figures can be obtained:

Calculation. There are a number of different standards for the calculation of MTBF figures, of which MIL-HDBK-217 is probably the most widely used.
Statistical information. MTBF figures can also be calculated form data gathered about failures in real installations. Of each electronic unit the number of hours in operation is reported at the moment of failure. When this is done over a sufficient large quantity of the same units a realistic ‘real life’ MTBF can be calculated. The MTBF of newly designed items can still be calculated this way by comparing them to existing items with a similar complexity.

Zenitel mainly uses statistical information, in line with common practice.

List of MTBF figures

@@ Line 1: / Line 1: @@
-tekst
+==System availability==
+System Availability
+*The percentage of time that the system can perform its intended function
+<br>
+System Availability = 1 – System Downtime
+<br>
+<br>
+<br>
+'''Downtime per year'''
+{|border="1"
+|width="100px"|'''Availability'''
+|width="100px"|'''Nines'''
+|colspan="2" width="200px"|'''Downtime'''
+|-
+|90%||1||36.5||days/year
+|-
+|99%||2||3.65||days/year
+|-
+|99.9%||3||8.78||hours/year
+|-
+|99.99%||4||52||minutes/year
+|-
+|99.999%||5||5||minutes/year
+|}
+==System Downtime==
+'''Many events causes system downtime:'''
+*HW fault
+*Software fault
+*Vandalism
+*Extreme conditions (fire, flooding etc)
+*Power outage
+*IP network failure
+*Planned system maintenance
+<br>
+<br>
+'''System Downtime = ∑ P * S * MTTR'''
+{|border="0"
+|width="60px"|P
+|= Probability of event taken place
+|-
+|S||= Severity of event
+|-
+|&nbsp;||= Percentage of service affected by fault
+|-
+|MTTR||= Mean Time To Repair
+|-
+|&nbsp;||= mean time to detect fault + mean time to fix fault
+|}
+==HW failure==
+<br>
+'''MTBF'''
+*Probability of HW faults calculated using MTBF figures
+*MTBF ≠ System Availability
+<br>
+'''MTBF calculations'''
+*Emperical method
+*MIL-HDBK-217
+*Telcordia
+<br>
+'''Emperical methods'''
+*Based on statistics from the field
+<br>
+'''MIL-HDBK-217 and Telcordia'''
+*All component entered in database with set environmental condition
+*Provides usually lower MTBF figure than emperical methods
+**Does include real usage conditions
+**Use worst case environmental conditions
+<br>
+'''*More components gives higher MTBF'''
+<br>
+'''*MTBF and single points of failure'''
+<br>
+==Other failures==
+'''Software fault'''
+*Automatic watch dog functions
+*Automatic recovery functions
+*Maturity of system
+*Structured software design and test
+<br>
+'''Vandalism and Extreme conditions'''
+*Robustness to vandalism and extreme conditions
+*IP and IK class
+*IP security functions to hinder denial of service attacks (DOS)
+<br>
+'''Power outage'''
+*UPS and redundant power supplier
+<br>
+'''IP network failure'''
+*Network service level
+*Redundant and switchover functions
+<br>
+'''Planned system maintenance'''
+*Expansion, add users etc
+*Ability to do maintenance without service interruptions
+==Redundancy==
+Redundancy is about parallelism and removing single point of failures
+<br>
+<br>
+Redundancy usually gives lower MTBF figures
+*Require more components
+<br>
+Redundancy usually provides significant higher service availability
+*A single failure shall have minimum or no impact on service availability
+==STENTOFON System Availability==
+'''Redundancy'''
+*Control room redundancy and parallel call handling
+*Power supply redundancy
+*Alternative AlphaNet routing
+*Control card AMC-IP redundancy
+<br>
+'''Reduced MTTR'''
+*AlphaNet supervision
+*Station supervision and tone test
+*Network monitoring (SNMP, Syslog, OPC)
+<br>
+'''Software failures and recovery'''
+*HW watchdog
+*SW process watchdog
+*Automatic recovery
+<br>
+'''System maintenance'''
+*Centralized and remote firmware upgrade
+*Hot insert and removal of cards
+*Control card AMC-IP redundancy
+== MTBF figures==
+[http://en.wikipedia.org/wiki/MTBF MTBF] (Mean Time Between failures) figures try to give a measure of the reliability of a system. There are 2 ways in which MTBF figures can be obtained:
+* '''Calculation.''' There are a number of different standards for the calculation of MTBF figures, of which MIL-HDBK-217 is probably the most widely used.
+* '''Statistical information.''' MTBF figures can also be calculated form data gathered about failures in real installations. Of each electronic unit the number of hours in operation is reported at the moment of failure. When this is done over a sufficient large quantity of the same units a realistic ‘real life’ MTBF can be calculated. The MTBF of newly designed items can still be calculated this way by comparing them to existing items with a similar complexity.
+Zenitel mainly uses statistical information, in line with common practice.
+List of [[MTBF figures]]
+[[Category:AMC Software]]