All you telecom engineers out there must have already heard of fault management, right? Well, those of you who haven’t yet heard of it and need to understand what it’s about, worry no more! I’ll help you. Let’s start by defining the term.
First, What’s Fault Management All About?
ISO (International Organization for Standardization) defined a network management framework of which fault management is a component. That framework is FCAPS (Fault management, Configuration, Accounting, Performance, and Security). In other words, fault management relates to network management. In the same vein, a network management system must include a fault management system. The latter finds network problems and takes action to correct them. It also identifies and prevents potential or known problems that may occur in the future.
Therefore, the prime goal of fault management is to maintain network connectivity at all times. And by managing faults in a network, applications and services that rely on that network remain up and running. Most importantly, those applications and services stay accessible and properly functioning.
However, you know networks fail and go down, don’t you? So, what we all want are applications and services with fault tolerance and downtime minimization mechanisms in place. That’s when fault management systems come into play! Take a look at the next section for an explanation of what fault management systems do exactly.
How Can You Fight Network Faults?
Fault management systems are tools to prevent network faults from happening. Their major motto is fault tolerance, and downtime minimization is their main concern. But let’s first understand where those faults come from.
Network faults originate from events in the network that have an impact on service delivery. Those events may only interfere with service delivery, but, worse than that, they can diminish or block service delivery. Hardware failure, connectivity loss, and power outages are three examples of network faults.
Upon fault detection, a fault management system notifies the network administrator by triggering alarms. This means that a fault management system embeds an alarm system. Think of an alarm as a notification that can be viewed in the fault management system itself. Additionally, the network administrator can receive an alarm via email or SMS.
The fault management system shall monitor a more faulty area more frequently and thoroughly. So, depending on the frequency that an area of a network experiences faults, the intensity of monitoring shall be adjusted.
Sometimes, fault management systems can automatically solve a fault, dismissing the network administrator’s manual action. They can even use programs or scripts to prevent some faults from occurring! Pretty cool, isn’t it? But not every problem that affects the operation of a network is major or requires special attention. Many problems just require a trivial automatic fix performed in no time. As a result, IT teams can focus on major problems that are more difficult to fix.
Next, we’ll see how exactly a fault management system works.
More About Fault Management Systems
First and foremost, a fault management system must have a clear picture of the network topology. That topology contains a map of every device and node connected to the network. This allows the fault management system to oversee every point of the network that may cause downtime.
But how does a fault management system work? It frequently queries devices and nodes to evaluate whether the hardware is behaving well or not. Then it collects the information retrieved from those queries and analyzes it. Its goal is to catch any network performance problem that requires a solution. Sometimes, devices and nodes send information on performance problems to the fault management system by their initiative.
Fault management systems keep networks operational with the features that follow.
Thresholds defined in fault management systems are based on prior knowledge of conditions that led to faults. Therefore, thresholds are a warning mechanism to prevent potential faults. Let me give you some examples of thresholds:
- A certain limit in the capacity of a node’s processing. If going over that limit led to faults in the past, a threshold shall be defined for the limit.
- A particular link utilization. A link is a connection between two nodes in a network. Sometimes, traffic over a link is high enough to cut access to the whole network. That happens when a link’s traffic consumes all of the network bandwidth. And if it caused problems in the past, then that amount of traffic shall become a threshold. To calculate the threshold, divide the average traffic over the link by the total link capacity. You can even calculate the threshold by millisecond, second, minute, hour, and so on. Some tools use a weighted average, meaning more recent values weigh more than older ones.
- The network utilization. Modern networks consist of many links. So, the average link utilization in a network may be a threshold.
To sum up, you must have a proper network infrastructure layout. If you can’t, at least determine improvement areas. Frequently, a single bottleneck in the infrastructure is a major source of faults in your network.
Constant Network Monitoring
Afault management system constantly monitors the status of a network.
Continuous Network Scan for Threats
For instance, viruses can lead to faults in networks. So, fault management systems must be able to detect them and either act accordingly or sound an alarm.
Network administrators receive fault event notifications sent by the fault management system. The same system may automatically solve those faults, but some faults demand major action.
Fault Location Tracing
A fault management system needs to trace the locations of faults. One of the main reasons for that is to adjust the intensity of monitoring for the most faulty areas. By doing so, the fault management system can better prevent faults in those areas.
Automatic Correction of Fault Conditions
If it doesn’t require much effort, a fault management system can automatically prevent faults. It does so by correcting the conditions that may cause those faults. To achieve that, the system executes programs or scripts to perform minor fixes that are neither complex nor time-consuming. The same programs or scripts also enable the fault management system to automatically solve actual faults.
A fault management system creates detailed logs of system status and the preventive or reactive actions it took. From the perspective of fault prevention, logging with details is extremely important.
Now you know how a fault management system works and what its main features are. The next step is to distinguish between active and passive fault management systems. Let’s take a look …
Active vs. Passive Fault Management
Fault management and fault management systems can be active or passive.
Active fault management systems use strategies such as ping or port status checks to query devices and nodes. That allows determining the status of those devices and nodes by routine. It’s an active approach to fault management. That is to say that the identification and correction of conditions that potentially lead to future faults are proactive.
On the other hand, passive fault management systems monitor the network for actual fault events that have already occurred. It’s more of a corrective than a preventive approach. To clarify, it may only discover faults until there’s nothing left to do.
Now, focusing on the passive fault management approach, what’s the process of detecting a fault and solving it? Check it out in the next section!
The Fault Management Cycle
The fault management workflow is cyclical and continuous. It starts with fault detection, follows some steps until fault resolution, and ends where it began: fault detection. This is the general fault management cycle, as you may find below in more detail. However, any fault management system may implement a specific process that goes beyond the basic steps below.
Consider that a fault management system is monitoring a network. Consequently, it discovers an interruption in the service delivery or that the service delivery performance is deficient.
Let the investigation begin! Go to the next step.
Fault Diagnosis and Isolation
The fault management system determines the source of the fault and its location in the network topology.
OK. So, the system already knows where smoke is coming from. But you know a bad thing never comes alone. What if there are a bunch of fault events all related to each other? It’s time for some alarm grouping!
Fault Event Correlation and Aggregation, Plus Alarming
A single fault can buzz multiple alarms. But that could be disturbing to the network administrators. And that’s why fault management systems combine related fault events and conduct a root cause analysis on them. Only after that, those systems fire an aggregated alarm for network administrators.
The network fault buzzer sounded! Now what?
Restoration of Service
Once the alarm is out to the network administrator, the fault management system automatically performs a quick and simple fix. It executes programs or scripts to get the service up and running again as soon as possible.
Service automatically restored, available, and working? Check. But what if the kind of fault demands a less quick and more complex fix?
Depending on the complexity of the fault, automatic restoration of service may not be possible. In those cases, the network administrator or a competent technician performs a manual intervention.
In this last step of the workflow, someone manually solves the fault. The resolution may be a correction, a repair, or a replacement.
At this point, you may be wondering what you need to do to put things into practice. Allow me to show you the way in the next section.
Ok, but How Can I Start With Fault Management?
You can either develop your own fault management system or buy one. If you’re going to develop your own, I must say that agile methodologies are appropriate. You can start by working on the most important root causes and observed signs of fault. Or focus on an area of your network. Or even on a type of device or node.
After that, here are the main steps that fault management systems specificallycomprise:
- Define diagnostic goals for the system.
- Know and involve subject matter experts capable of providing reference knowledge.
- List possible root causes and observed signs of fault and prioritize them by impact and frequency.
- Specify thresholds.
- Test the system in a simulated environment with simulated values.
- Test the system in the real environment with live data.
Now, imagine you’ve got an IoT network. That demands special care like you’re about to find out next.
And What About IoT, Blockchain, and Cloud Computing?
FCAPS is useful to set a straightforward common ground for talking about network management with corporate management. And it still applies today. Nevertheless and when it comes to IoT solutions, they didn’t exist when FCAPS was defined. And IoT sensors are likely to generate measurements that can be confused with faults. Fault detection, diagnostics, and isolation are thus vital in IoT networks to ensure accurate data sets.
FCAPS is quite appropriate for centralized single-provider environments. But in a blockchain, how do you know which provider is responsible for a fault? In a blockchain, fault management takes place by sharing the state of each vendor’s network across the entire blockchain. In case of a fault, the whole blockchain receives a data log.
When applications are on a cloud, they go from server to server according to loads. For example, fault detection is tougher with virtualized servers. But there’s more:
- Different tenants may experience a fault that originated from the same source (an overloaded server or an overloaded link).
- Also, the high number of devices, nodes, and links contributes to the likelihood of fault occurrence.
- The constant addition, upgrade, or replacement of devices contributes to configuration errors and, consequently, opportunities for faults.
- And a change in one device can affect others.
Keep in Mind: Scan, Detect, and Solve
An operational fault management system is one of the most important assets against actual or potential faults in a network. That system can smell the fault and go after its source. It does it nonstop.
Once the fault management system gets to the source of the fault, it studies the fault. As a result, it suggests a solution to those in charge. It may even automatically execute restoration programs or scripts to instantly fix the fault.
Nobody wants to offer service over a network that’s down! So, set up your fault management system to prevent and react to fault events in your network.
This post was written by Sofia Azevedo. Sofia has most recently taught college-level courses in IT, ICT, information systems, and computer engineering. She is fond of software development methods and processes. She started her career at Philips Research Europe and Nokia Siemens Networks as a software engineer. Sofia has also been a product owner, working in the development of software for domains such as telecom, marketing, and logistics.
Fault management is a discipline of IT operations management focused on detecting, isolating, and resolving problems. Faults occur any time a configuration item (CI) malfunctions or whenever an event interferes or prevents proper operation or service delivery.What is fault management and describe the steps in fault management? ›
Fault in a network is normally associated with failure of a network component and subsequent loss of connectivity. Fault management involves a five-step process: (1) Fault detection, (2) Fault location, (3) Restoration of service, (4) Identification of root cause of the problem, and (5) Problem resolution.What are the four basic steps of fault management? ›
Fault management follows a sequence of actions: error detection, error diagnosis, and error recovery. Error detection monitors events such as alarm signals from network devices (when thresholds are exceeded or in the event of hardware failure), deterioration of performance, or application failures.What are the key functions of fault management system? ›
In network management, fault management is the set of functions that detect, isolate, and correct malfunctions in a telecommunications network, compensate for environmental changes, and include maintaining and examining error logs, accepting and acting on error detection notifications, tracing and identifying faults, ...What is a fault simple definition? ›
A fault is a fracture or zone of fractures between two blocks of rock. Faults allow the blocks to move relative to each other. This movement may occur rapidly, in the form of an earthquake - or may occur slowly, in the form of creep.What are the 5 steps in fault management? ›
Fault management involves a five-step process: (1) Fault detection, (2) Fault location, (3) Restoration of service, (4) Identification of root cause of the problem, and (5) Problem resolution.What is the key word in the definition of fault? ›
Explanation: The key words in the definition are fracture and movement. The exact significance of these key words must be clearly understood. 3. For a rock structure to be called fault, fracture has to happen but movement is not necessary.What are the 3 types of faults? ›
There are three main types of fault which can cause earthquakes: normal, reverse (thrust) and strike-slip.What are the 3 parts of a fault? ›
Parts of a Fault
The main components of a fault are (1) the fault plane, (2) the fault trace, (3) the hanging wall, and (4) the footwall. The fault plane is where the action is. It is a flat surface that may be vertical or sloping. The line it makes on the Earth's surface is the fault trace.
automatic correction of potential problem-causing conditions; automatic resolution of actual malfunctions; and. detailed logging of system status and actions taken.
There are four types of faulting -- normal, reverse, strike-slip, and oblique. A normal fault is one in which the rocks above the fault plane, or hanging wall, move down relative to the rocks below the fault plane, or footwall.What is benefit of fault management? ›
Benefits of a Network Fault Management System
Create substantial savings in initial expenditure, operational, and maintenance costs. Save your investment in legacy remote monitoring systems by extending their working life. Provide advanced features like after-hours monitoring and automatic notifications at low cost.
Network fault management is necessary, because maintaining and guaranteeing network continuity has become essential. A properly managed fault system offers many positive benefits; a properly managed network guarantees availability, minimizing downtime and early detection of faults.What is the purpose of fault detection? ›
1 Introduction. Early Fault Detection and Diagnosis (FDD) plays an essential role in the safety and reliability of industrial process operations. Basic mathematical model-based FDD techniques rely on the monitoring of the extent of the matching between the actual process and an analytical model prediction.What is the difference between alarm and fault? ›
The distinction between faults and alerts is not "important" and "unimportant", but instead "persistent" and "transient". A fault results in some broken state that needs to be repaired (and may be repaired automatically as a result of environmental changes). An alert notes a discrete event occurred.What is fault answer in one sentence? ›
fault noun (MISTAKE) a mistake, especially something for which you are to blame: It's not my fault she didn't come!What is a fault example? ›
Well-known terrestrial examples include the San Andreas Fault, which, during the San Francisco earthquake of 1906, had a maximum movement of 6 metres (20 feet), and the Anatolian Fault, which, during the İzmit earthquake of 1999, moved more than 2.5 metres (8.1 feet).Which is the best definition for a fault quizlet? ›
A fault is a fracture along which the blocks of crust on either side have moved relative to one another parallel to the fracture. The fault plane is the planar (flat) surface along which there is slip during an earthquake.What are the steps in determining the fault? ›
- Collect the Evidence. All the evidence collected must be relevant to the problem at hand. ...
- Analyse the Evidence. ...
- Locate the Fault. ...
- Determination and Removal of the Cause. ...
- Rectification of the Fault. ...
- Check the System.
A series of parallel faults that, all inclined in the same direction, gives rise to a gigantic staircase; hence these are called step faults. Each step is a fault block and its top may be horizontal or tilted.
Fault Reporting Procedure means the provision by the Customer to MT of the Minimum Fault Reporting Information by means of telephone to the Fault Reporting Telephone Number or other more detailed reporting procedure that may be provided by a Services Description relevant to a particular Service; Sample 1Sample 2.What are faults caused by? ›
A fault is formed in the Earth's crust as a brittle response to stress. Generally, the movement of the tectonic plates provides the stress, and rocks at the surface break in response to this.What is the definition of normal fault? ›
Normal, or Dip-slip, faults are inclined fractures where the blocks have mostly shifted vertically. If the rock mass above an inclined fault moves down, the fault is termed normal, whereas if the rock above the fault moves up, the fault is termed a Reverse fault.What are the most common types of faults? ›
Normal Faults: This is the most common type of fault. It forms when rock above an inclined fracture plane moves downward, sliding along the rock on the other side of the fracture. Normal faults are often found along divergent plate boundaries, such as under the ocean where new crust is forming.Where do faults occur? ›
Normal faults show cracks where one block of rock is sliding down and away from another block of rock. These faults usually occur in areas where the crust is very slowly stretching or where two plates are pulling away from each other.What causes the three types of faults? ›
Normal faults occur when two plates, one on top of the other, slide past each other and create the fault. Reverse faults occur when one plate slides under the other, creating a vertical offset. Strike-slip faults happen when two plates move horizontally past each other.What are the characteristics of faults? ›
The characteristics may be summarized as follows. (a) Fault zones usually are irregular, branched, anastomosed, and curved rather than simple and planar. (b) Faults are generally composed of one or more clay or clay-like gouge zones in a matrix of sheared and foliated rock bordered by highly fractured rock.Which of the following is concern with fault management? ›
Fault management systems are tools to prevent network faults from happening. Their major motto is fault tolerance, and downtime minimization is their main concern.What are the 5 kinds of faults? ›
Types of faults include strike-slip faults, normal faults, reverse faults, thrust faults, and oblique-slip faults.What is the biggest fault line in the world? ›
The San Andreas Fault Line is one of the biggest faults in the world, and it stretches for over 750 miles.
Fault prevention is a proactive strategy to identify all potential areas where a fault can occur and to close those gaps. During the requirements phase, the business rules and requirements that are incomplete or ambiguous will give rise to a heap of defects during development.Why is fault classification important? ›
Fault classification is necessary for reliable and high speed protective relaying followed by digital distance protection. Hence, a suitable review of these methods is needed. The contribution consists of two parts. This is part 1 of the series of two parts.What is main difference between fault and performance management? ›
Performance management means ensuring the network is operating as efficiently as possible whereas fault management means preventing, detecting, and correcting faults in the network circuits, hardware, and software (e.g., a broken device or improperly installed software).What are the different types of fault management? ›
We normally apply active or passive faults management. On the one hand, active fault management is carried out continuously to various network elements, network devices and the state of the network. In other words, active fault management is constantly monitoring the network.What does fault mean in sports? ›
In sports like tennis, a fault is an illegal hit of the ball, usually when it lands outside the playing area.