Fault Management - an overview (2023)

Management of Traditional Applications

Rick Sturm, ... Julie Craig, in Application Performance Management (APM) in the Digital Enterprise, 2017


Fault management and performance management are the two domains that account for most of the daily activities of frontline staff and application specialists. While fault management looks at an application to determine whether it is “broken” (i.e., it has failed or has another serious problem), performance management is concerned with keeping an application running “well.” That is, performance management seeks to ensure that an application is operating within the parameters specified in the relevant service-level agreement (SLA) and operational-level agreement (OLA). This is largely a question of the speed at which the application does its work. An absolute requirement for application performance management (APM) is a set of tools to collect and analyze data about the application. Some of the collected data will be archived to allow a historical perspective to be taken when analyzing problems. SLAs are discussed in Appendix A, “Service-Level Management.”

Read full chapter



Integrated Fault and Security Management

Ehab Al-Shaer, Yan Chen, in Information Assurance, 2008

Passive Approach

Passive fault management techniques typically depended on monitoring agents to detect and report network abnormality using alarms or symptom events. These events are then analyzed and correlated in order to reach the root faults. Various event correlation models were proposed including the rule-based analyzing system [10], model-based system [11], case-based diagnosing system, and model traversing techniques. Different techniques are also introduced to improve the performance, accuracy, and resilience of fault localization. In Appleby and Goldszmidt [5], a model-based event correlation engine is designed for multilayer fault diagnosis. In Kliger et al. [3], a coding approach is applied to the deterministic model to reduce the reasoning time and improve system resilience. A novel incremental event-driven fault reasoning technique is presented in Steinder and Sethi [1, 4] to improve the robustness of the fault localization system by analyzing lost, positive, and spurious symptoms.

The above techniques were developed based on passively received symptoms. If the evidence (symptoms) is collected correctly, the fault reasoning results can be accurate. However, in real systems, symptom loss or spurious symptoms (observation noise) are unavoidable. Even with a good strategy [1] to deal with observation noise, such techniques have limited resilience to noise because of their underlying passive approach, which might also increase the fault detection time.

Read full chapter



Network Management Architecture and Design

James Farmer, ... Weyl Wang, in FTTx Networks, 2017

Fault Management

Within the ISO framework, fault management is responsible for detecting, correlating, and providing the necessary interfaces for correcting a failure within the managed devices. A failure may be defined as an event within the network causing the system to operate outside its normal operating conditions. The failure may be defined as transient or persistent, requiring the management system to have the capacity to detect either condition under all operating environments.

Upon detection and correction of a failure condition it is critical the management system is capable of recording all events surrounding the event to permanent record. Once the system has been restored to normal operation, each failure condition should be evaluated in detail to make sure all events leading up to the failure are well understood. Any possible corrective action to prevent the conditions from occurring again should be put into place.

Events and alarms are typically displayed within the management system as a sorted table listing each of the conditions the system has detected.

Fig. 14.5 shows an example alarm and event table in an FTTx NMS/EMS.

Fault Management - an overview (1)

Figure 14.5. Alarm and event table.

The key attributes of an alarm or event include the ID of the condition, the severity of the problem as defined by the operator (critical/major/minor), the source of the condition including the device name and type, the time the condition was received by the management system and finally the number of times this condition has been reported by the network element.

In today’s environment of a distributed management system it is important for a network operator to ACKnowledge each alarm they are working to resolve. This tells the operator’s staff that a colleague has already begun taking action on the condition received by the system. Enforcing this discipline is critical in order to avoid one of two undesirable outcomes: either two people start working on the same problem and work at cross-purpose, or everyone assumes someone else has it, so no one works on it.

Alarm and event notification is critical to a fault management system in order to facilitate automatic reporting to staff. Notifications are typically set up for particular alarm/event, severity level, frequency, and device types through email, SMS text message, voice message, and system alarm. These features are critical to allow for indications of various conditions and escalated problems within a managed system to be easily communicated to the staff.

Read full chapter



Enterprise Integration

Max Schubert, in Nagios 3 Enterprise Network Monitoring, 2008

Integration with Trouble Ticketing Systems

We recommend that you only allow your fault management software to create trouble tickets if the alerts triggering the tickets are so application specific and free from false positives (like a passive check) that you can ensure the event requires attention by a real person. On the other hand, we highly recommend you take advantage of help desk systems that allow you to associate fault manager event IDs and information with your trouble tickets. For example, it would make sense to have your help desk software have the capability to acknowledge an open alert in Nagios, or to have the help desk software clear an alert as soon as the issue is resolved by the help desk person working on the event in question.

Core Nagios exposes a large number of macros that enable scripts to use the information Nagios gathers from hosts, services, and other sources. Macros can be employed by event handlers that execute scripts to email, send text messages, or call external programs. Trouble ticketing systems often provide programs that can be used to inject tickets into the service desk system from the command line; many allow end users to open new tickets by sending email to a specific email address, for example [emailprotected]. Nagios’ event handlers can call these external programs or send email to a trouble ticketing system to open new tickets, providing URLs in the ticket body that link the ticket back to the originating Nagios event. Again, be sure any events set up to automatically open new trouble tickets only do so for events that require immediate human attention (Figure 6.6).

Fault Management - an overview (2)

Figure 6.6. Nagios Opens a Trouble Ticket

Read full chapter



Control and Management

Rajiv Ramaswami, ... Galen H. Sasaki, in Optical Networks (Third Edition), 2010

8.5.8 Client Layers

We will describe some of the performance and fault management features in the client layer protocols described in Chapter 6. The performance and fault management mechanisms of the SONET/SDH and the electronic layer of OTN have already been discussed. Since SONET/SDH and OTN provide constant bit rate service, they use bit error rate (BER) as a performance measure as well as loss of signal. Network elements are informed of error and fault events through defect indicators (see Subsection 8.5.4). They also have trace information in their overhead.

Protocols that provide packet transport services such as Ethernet or MPLS have performance measures that are packet oriented, such as packet loss rate, packet delay, and packet delay variation (jitter). To detect if a connection (link or path) is up, “hello” or continuity check messages are sent periodically through the connection between the end nodes. If these messages are not received, then it is assumed that the connection is down. Remote defect indicators and AIS signals are used by one end of a link to inform the other end that it has detected a failure or error. Management occurs at different levels. At the lowest level, individual links are managed, while at the highest level end-to-end connections are managed. In the middle level, segments of an end-to-end connection can be managed such as when a segment goes through another network operator. In addition, end-to-end management can be customer oriented or service provider oriented.

Read full chapter



Keeping the System Up and Running

BARBARA MIREL, in Interaction Design for Complex Problem Solving, 2004

What Fault-Management Problems Do Users Have and How Complex Are They?

The design team finds that the same prototypical problems plague IT specialists across organizations. The team lists these problem types on a new section of the whiteboard, as shown in Table 4.1.

TABLE 4.1. Tasks within the scope of the VizAppManager

Troubleshooting poor performance, availability of resources, overload, and congestion

Performed in reaction to some alert (e.g., user complaint, trouble ticket, alarm going off)

Assuring acceptable performance, availability, and capacity before problems arise

Conducted to detect and prevent deteriorating conditions

Managing unexpected consequences of new deployments

Carried out in anticipation of problems by monitoring, analyzing, and troubleshooting new factors in the system–a re-engineered network, new system hardware, newly deployed applications, or software updates. Probably the situation with the highest incidence of system problems

Positively, the team limits the scope of its tool to solving just these problems. Findings from contextual interviews show that technical specialists also manage and prevent faults for reliability purposes. For example, they tune applications and other components (performed to improve system components and software code), and they determine health-of-the-system metrics (conducted to identify baseline measures for alarm thresholds). But these situations are handled by a different group of specialists, not those in charge of assuring functionality. Because reliability specialists are not the VizAppManager team's targeted users, design team members decide that support for tuning, optimizing, and other reliability activities are outside the scope of the program.

The design team characterizes the conditions of the fault-management problems that tier-two and tier-three analysts confront as follows:

Systems and networks have highly sophisticated functions, structural relations, and multilayered architectures that users cannot adequately represent to themselves in any one model.

Real-time systems are in perpetual flux due to human interactions and internal dependencies.

Relevant data are dispersed across numerous sources and tools.

A vast number of data elements are required for investigations.

Continuously changing workplace situations influence users’ interpretations of overall performance.

Infrastructure problems ultimately have solutions that are not provisional. “Good enough” is not acceptable; in some cases goals are clear, and problems, once effectively formulated (not always an easy task), have fairly well-structured solutions.

Notably, as this list shows, design team members frame these traits of complexity as system conditions, information elements, and data demands. They do not phrase them in ways that put the focus on users’ experiences. Subtly, this verbal presentation lends itself to a ready breakdown into task and graphic objects.

The traits the team identifies are common across situations and work sites. Degrees of complexity, however, vary from one troubleshooting situation to the next. Sometimes, IT analysts encounter common or familiar problems, and through experience, they have assimilated standard methods and fixes for investigating and repairing them.

In other cases, however, entangled infrastructure conditions or unapparent chain reactions confound analysts. Problems elude a clear definition and are difficult to trace to a source. One example is an intermittent slowdown with no obvious cause, and another is Benkei's entangled faults. The VizAppManager team realizes that its single application-monitoring tool must support all these degrees of complexity.

From contextual inquiry findings, the team estimates that roughly 80% of tier-two and tier-three troubleshooting involves common problems. The other 20% are abnormal, often intermittent, and may take days or weeks to solve. As we see later, because of this distribution, the team devotes most of its efforts to creating support for the 80%. Unfortunately, team members fail to realize that a good deal of complexity still resides in solving these familiar problems, and this complexity is shared by the other 20%, as well.

Read full chapter



Network Survivability

Bjorn Jager, ... David Tipper, in Information Assurance, 2008

4.3.3 Basic Network Management Concepts

When studying the survivability of communication networks, it is also useful to look at the general framework in which traffic management is implemented. Network management has become an indispensable tool of communication networks since it is responsible for ensuring continuous and secure functioning of a network. Generally, network management is divided into several functional areas: performance management, configuration management, and fault management. These key areas are implemented by control modules that operate in an integrated way to manage the network, including functions that support traffic management and restoration survivability techniques.

As an example, we can look at how fault management fulfills its goals. The key functions of fault management, RRRR [28], in prioritized order are:


Restore services.


Root cause identification of failures.


Repair failed components.


Report the incidences.

To implement the highest priority task, that is, restore services, fault management uses the functionality provided by configuration management and performance management. Upon detection of a network failure by the fault management system or detection of QoS degradation by the performance management system, the failed traffic connections are identified by the configuration management system, new paths are searched for if needed, the best path i.e., the one with the lowest cost, is selected for each failed connection, and the traffic is rerouted by establishing new connections along the selected paths. In parallel with the service restoration process, a repair process should begin with the network manager performing root cause identification followed by the detailed repair or replacement of the failed components. Once the failed components have been repaired, they can be put back in service and a normalization or reversion process might occur that consists of moving the traffic from their current routes to their original prefailure routes. Furthermore, all incidents in the process are monitored and reported for billing and management purposes.

The steps in the service recovery process are sometimes denoted as DeNISSE(RN) [29], which is derived from the restoration process' major steps:


Detection of the failure.


Notification of the failure to the entities responsible for restoration.


Identification of failed connections.


Search for new paths.


Selection of least-cost paths for the failed connections.


Establishment of the new paths by signaling and rerouting.


Report for billing and management.



These steps are summarized in Figure 4.3.

Fault Management - an overview (3)

FIGURE 4.3. Major steps in traffic management during restoration of service.

Read full chapter



Network Environments, Managing

Ray Hunt, John Vargo, in Encyclopedia of Information Systems, 2003

II.A.2. Fault Management

This function is required to detect abnormal network behavior. Fault management follows a sequence of actions: error detection, error diagnosis, and error recovery.

Error detection monitors events such as alarm signals from network devices (when thresholds are exceeded or in the event of hardware failure), deterioration of performance, or application failures. Error detection facilities also include an error log for future analysis.

Error diagnosis involves the analysis of detected errors in an effort to determine the cause of an error and a course of action to rectify it. Recent approaches to error diagnosis include the use of artificial intelligence techniques such as deductive reasoning.

Error recovery involves a range of measures proportional to the error's magnitude. Simple errors may require the fine-tuning of a device on the network, where more serious errors may mandate the replacement of a faulty device. Persistent performance failures are usually an indicator of poor network health. Remedying such problems typically involves reconfiguration of the problematic section of the network.

Read full chapter



NFV Management and Orchestration

Zonghua Zhang, Ahmed Meddahi, in Security in Network Functions Virtualization, 2017

1.2.3 Element management system (EMS)

The purpose of element management systems (EMSs) is to achieve FCAPS (Fault, Configuration, Accounting, Performance and Security) management functionality for a VNF. EM exchanges information with VNFM through open reference point (VeEm-Vnfm). The tasks related to EM functions are:

configuration for the network functions provided by the VNF;

fault management for the network functions provided by the VNF;

accounting for the usage of VNF functions;

collecting performance measurement results for the functions provided by the VNF;

security management for the VNF functions.

The EM may be aware of virtualization and can collaborate with VNFM to perform those functions that require exchange of information regarding the NFVI resources associated with VNF. It is not a requirement to have a 1:1 map between VNF and EMS and one single EMS may manage many VNFs.

Read full chapter



Asynchronous Transfer Mode

Jean Walrand, Pravin Varaiya, in High-Performance Communication Networks (Second Edition), 2000


A very important feature of ATM networks is that they can make a number of management and control decisions to discriminate among connections and to provide the variety of QoS that different applications need. The decisions are divided into three groups. When a request is made for a connection with a particular QoS, the network must determine whether to accept or reject the request, depending on the resources then available. (Recall that QoS involves three sets of parameters: delay, cell loss, and source traffic rate.) If the resources are insufficient to meet the request, the network may negotiate with the user the traffic parameters in the requested service class.

Once the connection is admitted, the network must assign a route or path to the virtual channel that carries the connection. It must inform the switches and other network elements along the path that this virtual channel must be allocated certain resources so that the agreed-on QoS is met.

Lastly, the network must monitor the data transfer to make sure that the source also conforms to the QoS specification and to drop its cells as appropriate. (This is called traffic policing.) The network may also ask a source to slow down its transmissions.

In addition, the network carries a number of information flows to monitor its operations and to detect and identify the location of congested or failed devices.

The BISDN standard so far is silent about how these decisions are to be carried out. (However, recent ITU recommendations deal with OAM, performance monitoring and protection switching at the ATM layer to complement SONET protection switching.) We shall discuss potential solutions in Chapter 8. The ATM Forum specifies frame formats that the network should use to carry its monitoring information and to interact with users. We review these next.

The network uses operation and maintenance information flows for the following functions:

fault management,

traffic and congestion control,

network status monitoring and configuration, and

user/network signaling.

These functions, like the other network functions, are organized into layers, called the BISDN reference model.

Figure 6.11 shows the layer arrangement of all network functions, including those of operation and management. The layers in the user plane comprise the functions required for the transmission of user information. For instance, for an Internet Protocol over ATM, these layers could be HTTP/TCP/IP/AAL5.

Fault Management - an overview (4)

FIGURE 6.11. The BISDN model layer arrangement of network functions, including the operation and management functions.

The layers in the control plane are the functions needed to set up, supervise, and release a virtual circuit connection. These functions, implemented by signaling protocols such as PNNI, are needed only for switched virtual connections and are absent in a network that implements only permanent virtual connections. (In a permanent virtual circuit connection, the path or route assigned to a source and destination and the VCI for that route are fixed.)

The layer management plane contains management functions specific to individual layers. Layer mangement also handles the operations and maintenance flows specific to each layer. The protocols used for these functions include ILMI and SNMP.

Finally, plane management consists of the functions that supervise the operations of the whole network. Plane management has no layered structure.

6.5.1 Fault Management

Consider a virtual circuit connection over an ATM network and assume that the connection is implemented by a SONET network. We know from section 5.2 that SONET establishes transmission paths for the ATM layer. The transmission is over optical fibers. The transmitters in SONET are all synchronized to the same master clock. This synchronization enables the time-division multiplexing of different bit streams. This multiplexing is done byte-by-byte.

The physical layer (SONET) is decomposed into three sublayers: section, line, and path. The section layer transmits bits between any two devices where light is converted back into electronic signals or conversely. For instance, there is a section between two successive regenerators or between a regenerator and a multiplexer. The line layer transports bits between multiplexers where SONET signals are added to or dropped from the transmission. Finally, the path layer transports user information. Thus, a path goes across a number of lines (or links) that are switched by the SONET demultiplexers and multiplexers, and a line consists of a number of sections. Each layer inserts and strips its own overhead information, which it uses to monitor the transmission functions for which it is responsible. (See Figure 6.12.)

Fault Management - an overview (5)

FIGURE 6.12. Operation and maintenance flows for a virtual circuit connection over SONET.

Each of the three sublayers uses overhead bytes in the SONET frames to supervise its operations. The overhead bytes are said to carry a flow of operation and maintenance information. The flow carried by the section overhead bytes is called F1. The flow carried by the line and path overhead bytes are F2 and F3, respectively. The virtual circuit connection is carried by a virtual path connection. Accordingly, the network uses a flow of cells to supervise the virtual path connection and a flow of cells to supervise the virtual circuit connection. These two flows are called F4 and F5, respectively.

The format of the F4 and F5 cells depends on whether the cells monitor the segment across the user-network interface or the end-to-end connection (see Figure 6.13). The cell formats are shown in Figure 6.14. Note that the F5 cells have the same VPI/VCI as the user cells of the connection they monitor. The F5 cells are distinguished from the user cells by the PT field. Similarly, the F4 cells have the same VPI as the user cells and are distinguished by their VCI.

Fault Management - an overview (6)

FIGURE 6.13. A segment indicates a connection across the user-network interface. An end-to-end connection is between the source and destination user equipment.

Fault Management - an overview (7)

FIGURE 6.14. Format of OAM cells.

The main function of the OAM cells is to detect and manage faults. Fault-management OAM cells have the leading 4 bits of the cell payload set to 0001. The next 4 bits, the function type (FT) field, indicate the type of function performed by the cell: alarm indication signal (AIS), signaled by FT = 0000; far end receive failure (FERF), signaled by FT = 0001; and loopback cell, signaled by FT = 1000. The AIS cells are sent along the VPC (virtual path connection) or VCC (virtual circuit connection) by a network device that detects an error condition along the connection. Those cells are then sent along to the destination of the connection. When the equipment at the end of that connection receives the AIS, it sends back FERF cells to the other end of the connection. As shown in Figure 6.15, the AIS and FERF cells specify the type of failure as well as the failure location.

Fault Management - an overview (8)

FIGURE 6.15. Function-specific fields in AIS and FERF cells (above) and in loopback cells (below).

A loopback cell contains a field that specifies whether the cell should be looped back, a correlation tag, a loopback location identification, and a source identification. These loopback cells are used as shown in Figure 6.16.

Fault Management - an overview (9)

FIGURE 6.16. Loopback at the end of connection (above) and at the segment (below).

The device that requests a loopback (we call it the source) inserts a loopback cell and selects a value for the correlation tag. The device can specify where the loopback should take place. The device sets the loopback indication field of the cell to 1 to indicate that the cell must be looped back. When the device where the loopback must occur receives the cell, it sets its loopback indication field to 0 and sends the cell back to the source. The source compares the correlation tag of the cell it receives with the value it selected. This correlation tag prevents a device from getting confused by other loopback cells.

One other OAM function may be mentioned. Performance management, indicated by OAM cell type 0010, consists of forward monitoring, backward monitoring, and reporting. In forward monitoring, for example, a block of cells of one connection is bounded by OAM cells, which include, among other information, the size of the block. The receiving node can compare the number of received cells with the block size to detect missing or inserted cells.

6.5.2 Traffic and Congestion Control

The objectives of traffic and congestion control are to guarantee the contracted quality of service to virtual connections. The operations that the network performs are the subject of Chapters 8 and 9.

6.5.3 Network Status Monitoring and Configuration

The OAM functions described above do not provide diagnostic, monitoring, and configuration services across the user-network interfaces. That is the purpose of the Integrated Local Management Interface or ILMI protocol, Version 4.0. ILMI uses the Simple Network Management Protocol (SNMP) and a management information base (MIB). The situation is illustrated in Figure 6.17.

Fault Management - an overview (10)

FIGURE 6.17. The Integrated Local Management Interface (ILMI) protocol is designed to supervise the connections across user-network interfaces.

The figure shows a private ATM network connected by a private ATM switch to a public ATM network. Each connection across two interfaces is supervised by two ATM Interface Management Entities (IMEs): one for each of the ATM devices. Two such IMEs are said to be adjacent, and the ILMI specifies the structure of the Management Information Base (MIB) that contains the attributes of the connection supervised by the adjacent entities. The ATM devices may be workstations with ATM interfaces that send ATM cells to an ATM switch, or ATM switches, or IP routers that transfer their packets within ATM cells to an ATM switch.

ILMI 4.0 describes four MIB modules. The Textual Conventions MIB defines common textual conventions. The Link Management Module defines the objects for each ATM interface and the methods to detect ILMI connectivity between IMEs. It is further described below. The Address Registration MIB supports procedures for the user and the network to know each other's ATM address. The Service Registry MIB helps to locate ATM network services such as LAN Emulation Configuration Servers (LECS).

Important objects of the link management MIB are summarized in Figure 6.18. As the figure indicates, one MIB is defined per IME. The contents of an IME MIB are the attributes of the physical layer (which implements the bit way), the ATM traffic, the VPCs, and VCCs that go across that UNI. The figure indicates representative attributes. The ATM statistics attributes are now deemphasized. The MIB also includes ABR attributes. The ABR virtual path and virtual channel are tuned on a per-connection basis via these attributes. Examples of ABR attributes are ICR (initial cell rate), which is an upper bound on the source's transmission rate; RIF (rate increment factor), which controls the allowed increase in the source transmission rate; and RDF (rate decrement factor), which controls the required decrease in that rate. These attributes are set using ABR resource management (RM) cells, as discussed in section 8.4.2.

Fault Management - an overview (11)

FIGURE 6.18. Structure of the ILMI link management MIB that contains the attributes of the connection supervised by adjacent interface management entities.

6.5.4 User/Network Signaling

The basic signaling functions between the network and a user are as follows:

the user requests a switched virtual connection,

the network indicates whether the request is accepted or not, and

the network indicates error conditions with a connection.

We have discussed the UNI signaling protocol above.

Read full chapter



Top Articles
Latest Posts
Article information

Author: Terrell Hackett

Last Updated: 28/11/2023

Views: 6838

Rating: 4.1 / 5 (72 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Terrell Hackett

Birthday: 1992-03-17

Address: Suite 453 459 Gibson Squares, East Adriane, AK 71925-5692

Phone: +21811810803470

Job: Chief Representative

Hobby: Board games, Rock climbing, Ghost hunting, Origami, Kabaddi, Mushroom hunting, Gaming

Introduction: My name is Terrell Hackett, I am a gleaming, brainy, courageous, helpful, healthy, cooperative, graceful person who loves writing and wants to share my knowledge and understanding with you.