Performance and fault management (Data Communications and Networking) (2024)

Performance management means ensuring the network is operating as efficiently as possible whereas fault management means preventing, detecting, and correcting faults in the network circuits, hardware, and software (e.g., a broken device or improperly installed software). Fault management and performance management are closely related because any faults in the network reduce performance. Both require network monitoring, which means keeping track of the operation of network circuits and devices to ensure they are functioning properly and to determine how heavily they are used.

A Day in the Life: Network Policy Manager

All large organizations have formal policies for the use of their networks (e.g., wireless LAN access, password, server space). Most large organizations have a special policy group devoted to the creation of network policies, many of which are devoted to network security. The job of the policy officer is to steer the policy through the policy making process and ensure that all policies are in the best interests of the organization as a whole. Although policies are focused inside the organization, policies are influenced by events both inside and outside the organization. The policy manager spends a significant amount of time working with outside organizations such as the U.S. Department of Homeland Security, CIO and security officer groups, and industry security consortiums. The goal is to make sure all policies (especially security policies) are up-to-date and provide a good balance between costs and benefits.

A typical policy begins with networking staff writing a summary containing the key points of the proposed policy. The policy manager takes the summary and uses it to develop a policy that fits the structure required for organizational policies (e.g., date, rationale, scope, responsible individuals, and procedures). This policy manager works with the originating staff to produce an initial draft of the proposed policy. Once everyone in the originating department and the policy office are satisfied with the policy, it is provided to an advisory committee of network users and network managers for discussion. Their suggestions are then incorporated in the policy or an explanation is provided is to why the suggestions will not be incorporated in the policy.

After several iterations, a policy becomes a draft policy and is posted for comment from all users within the organization. Comments are solicited from interested individuals and the policy may be revised. Once the draft is finalized, the policy is then presented to senior management for approval. Once approved, the policy is formally published, and the organization charged with implementing the policy begins to use it to guide their operations.

Network Monitoring

Most large organizations and many smaller ones use network management software to monitor and control their networks. One function provided by these systems is to collect operational statistics from the network devices. For small networks, network monitoring is often done by one person, aided by a few simple tools (discussed later in this topic). These tools collect information and send messages to the network manager’s computer.

In large networks, network monitoring becomes more important. Large networks that support organizations operating 24 hours a day are often mission critical, which means a network problem can have serious business consequences. For example, consider the impact of a network failure for a common carrier such as AT&T or for the air traffic control system. These networks often have a dedicated network operations center (NOC) that is responsible for monitoring and fixing problems. Such centers are staffed by a set of skilled network technicians that use sophisticated network management software. When a problem occurs, the software immediately detects the problems and sends an alarm to the NOC.

Network Management Salaries

MANAGEMENT FOCUS

Network management is not easy, but it doesn’t pay too badly. Here are some typical jobs and their respective salaries.

Network Vice President	$150,000
Network Manager	90,000
Telecom Manager	77,000
LAN Administrator	70,000
WAN Administrator	75,000
Network Designer	80,000
Network Technician	60,000
Technical Support Staff	50,000
Trainer	50,000

Staff members in the NOC diagnose the problem and can sometimes fix it from the NOC (e.g., restarting a failed device). Other times, when a device or circuit fails, they must change routing tables to route traffic away from the device and inform the common carrier or dispatch a technician to fix or replace it.

Figure 13.2 shows the NOC at Indiana University. The NOC is staffed 24 hours a day, 7 days a week to monitor the networks at Indiana University. The NOC also has responsibility for managing portions of several very high-speed networks including the Abilene Network of Internet2 (see Management Focus Box 13.5).

The parameters monitored by a network management system fall into two distinct categories: physical network statistics and logical network information. Gathering statistics on the physical network parameters includes monitoring the operation of the network’s modems, multiplexers, circuits linking the various hardware devices, and any other network devices. Monitoring the physical network consists of keeping track of circuits that may be down and tracing malfunctioning devices. Logical network parameters include performance measurement systems that keep track of user response times, the volume of traffic on a specific circuit, the destination of data routed across various networks, and any other indicators showing the level of service provided by the network.

Some types of management software operate passively, collecting the information and reporting it back to the central NOC. Others are active, in that they routinely send test messages to the servers or application being monitored (e.g., an HTTP Web page request) and record the response times. One common type of monitoring approach is the network weather map, which displays the usage of all major circuits in the network in real time.1

Figure 13.2 The Global Research Network Operations Center at Indiana University.

Performance tracking is important because it enables the network manager to be proactive and respond to performance problems before users begin to complain. Poor network reporting leads to an organization that is overburdened with current problems and lacks time to address future needs. Management requires adequate reports if it is to address future needs.

Failure Control Function

Failure control requires developing a central control philosophy for problem reporting, whether the problems are first identified by the NOC or by users calling in to the NOC or a help desk. Whether problem reporting is done by the NOC or the help desk, the organization should maintain a central telephone number for network users to call when any problem occurs in the network. As a central troubleshooting function, only this group or its designee should have the authority to call hardware or software vendors or common carriers.

Many years ago, before the importance (and cost) of network management was widely recognized, most networks ignored the importance of fault management. Network devices were "dumb" in that they did only what they were designed to do (e.g., routing packets) but did not provide any network management information.

For example, suppose a network interface card fails and begins to transmit garbage messages randomly. Network performance immediately begins to deteriorate because these random messages destroy the messages transmitted by other computers, which need to be retransmitted. Users notice a delay in response time and complain to the network support group, which begins to search for the cause. Even if the network support group suspects a failing network card (which is unlikely unless such an event has occurred before), locating the faulty card is very difficult and time consuming.

Internet2 Weather Map

MANAGEMENT FOCUS

The Abilene network is an Inter-net2 high-performance backbone that connects regional gigapops to provide high-speed network services to over 220 Internet2 university, corporate, and affiliate member institutions in all 50 states, the District of Columbia, and Puerto Rico. The current network is primarily an OC-192c (10 Gbps) backbone employing optical transport technology and advanced high-performance routers.

The network is monitored 24 hours a day, seven days a week from the network operations center (NOC) located on the campus of Indiana University in Indianapolis. The NOC oversees problem, configuration, and change management; network security; performance and policy monitoring; reporting; quality assurance; scheduling; and documentation. The NOC provides a structured environment that effectively coordinates operational activities with all participants and vendors related to the function of the network.

The NOC uses multiple network management software running across several platforms. Figure 13.3 shows one of the tools used by the NOC that is available to the general public: the Internet2 Weather Map. Each of the major circuits connecting the major Abilene gigapops is shown on the map. Each link has two parts, showing the utilization of the circuits to and from each pair of gigapops. The links are color-coded to quickly show the utilization of the link. Figure 13.3 is not in color so it is difficult to read, but if you visit the Abilene Web site (the URL is listed below), you can see that circuits with very low utilization are different shades of blue, which turn to green and then yellow and orange as utilization increases to 10 percent of capacity. Once utilization climbs above 30 percent, the link is shown in deeper shades of red and then purple. If you look back at the photo in Figure 13.2 you’ll see the weather map displayed on the large screen in the NOC.

The link from the Chicago gigapop to the New York City gigapop, for example, indicates that over the last few minutes, an average of 546 Mbps has been transmitted, giving a 10 percent utilization. The link from New York City to Chicago shows that over the last few minutes, an average of 6.2 Gbps has been transmitted, giving a 70 percent utilization.

Technical Reports

TECHNICAL FOCUS

Technical reports that are helpful to network managers are those that provide summary information, as well as details that enable the managers to improve the network. Technical details include:

• Circuit use

• Usage rate of critical hardware such as host computers, front-end processors, and servers

• File activity rates for database systems

• Usage by various categories of client computers

• Response time analysis per circuit or per computer

• Voice versus data usage per circuit

• Queue-length descriptions, whether in the host computer, in the front-end processor, or at remote sites

• Distribution of traffic by time of day, location, and type of application software

• Failure rates for circuits, hardware, and software

• Details of any network faults

Problem prioritizing helps ensure that critical problems get priority over less important ones. For example, a network support staff member should not work on a problem on one client computer if an entire circuit with dozens of computers is waiting for help. Moreover, a manager must know whether problem-resolution objectives are being met. For example, how long is it taking to resolve critical problems?

Management reports are required to determine network availability, product and vendor reliability (mean time between failures), and vendor responsiveness. Without them, a manager has nothing more than a "best guess" estimate for the effectiveness of either the network’s technicians or the vendor’s technicians.

The purposes of the trouble log are to record problems that must be corrected and to keep track of statistics associated with these problems. For example, the log might reveal that there were 37 calls for software problems (3 for one package, 4 for another package, and 30 for a third software package), 26 calls for cable modem problems evenly distributed among two vendors, 49 calls for client computers, and 2 calls to the common carrier that provides the network circuits. These data are valuable when the design and analysis group begins redesigning the network to meet future requirements.

Performance and Failure Statistics

There are many different types of failure and recovery statistics that can be collected. The most obvious performance statistics are those discussed above: how many packets are being moved on what circuits and what the response time is. Failure statistics also tell an important story.

One important failure statistic is availability, the percentage of time the network is available to users. It is calculated as the number of hours per month the network is available divided by the total number of hours per month (i.e., 24 hours per day x 30 days per month = 720 hours).

Elements of a Trouble Report

TECHNICAL FOCUS

When a problem is reported, the trouble log staff members should record the following:

• Time and date of the report

• Name and telephone number of the person who reported the problem

• The time and date of the problem (and the time and date of the call)

• Location of the problem

• The nature of the problem

• When the problem was identified

• Why and how the problem happened

The downtime includes times when the network is unavailable because of faults and routine maintenance and network upgrades. Most network managers strive for 99 to 99.5 percent availability, with downtime scheduled after normal working hours.

The mean time between failures (MTBF) is the number of hours or days of continuous operation before a component fails. Obviously, devices with higher MTBF are more reliable.

When faults occur, and devices or circuits go down, the mean time to repair (MTTR) is the average number of minutes or hours until the failed device or circuit is operational again. The MTTR is composed of these separate elements:

The mean time to diagnose (MTTD) is the average number of minutes until the root cause of the failure is correctly diagnosed. This is an indicator of the efficiency of problem management personnel in the NOC or help desk who receive the problem report.

The mean time to respond (MTTR) is the average number of minutes or hours until service personnel arrive at the problem location to begin work on the problem. This is a valuable statistic because it indicates how quickly vendors and internal groups respond to emergencies. Compilation of these figures over time can lead to a change of vendors or internal management policies or, at the minimum, can exert pressure on vendors who do not respond to problems promptly.

Finally, after the vendor or internal support group arrives on the premises, the last statistic is the mean time to fix (MTTF). This figure tells how quickly the staff is able to correct the problem after they arrive. A very long time to fix in comparison with the time of other vendors may indicate faulty equipment design, inadequately trained customer service technicians, or even the fact that inexperienced personnel are repeatedly sent to fix problems.

For example, suppose your Internet connection at home stops working. You call your ISP, and they fix it over the phone in 15 minutes. In this case, the MTTRepair is 15 minutes, and it is hard to separate the different parts (MTTD, MTTR, and MTTF). Suppose you call your ISP and spend 60 minutes on the phone with them, and they can’t fix it over the phone; instead, the technician arrives the next day (18 hours later) and spends one hour fixing the problem. In this case MTTR = 1 hour + 18 hours + 1 hour = 20 hours.

Management Reports

TECHNICAL FOCUS

Management-oriented reports that are helpful to network managers and their supervisors provide summary information for overall evaluation and for network planning and design. Details include:

• Graphs of daily/weekly/monthly usage, number of errors, or whatever is appropriate to the network

• Network availability (uptime) for yesterday, the last 5 days, the last month, or any other specific period

• Percentage of hours per week the network is unavailable because of network maintenance and repair

• Fault diagnosis

• Whether most response times are less than or equal to 3 seconds for online real-time traffic

• Whether management reports are timely and contain the most up-to-date statistics

• Peak volume statistics as well as average volume statistics per circuit

• Comparison of activity between today and a similar previous period

The MTBF can be influenced by the original selection of vendor-supplied equipment. The MTTD relates directly to the ability of network personnel to isolate and diagnose failures and can often be improved by training. The MTTR (respond) can be influenced by showing vendors or internal groups how good or bad their response times have been in the past. The MTTF can be affected by the technical expertise of internal or vendor staff and the availability of spare parts onsite.

Another set of statistics that should be gathered are those collected daily by the network operations group, which uses network management software. These statistics record the normal operation of the network, such as the number of errors (retransmissions) per communication circuit. Statistics also should be collected on the daily volume of transmissions (characters per hour) for each communication circuit, each computer, or whatever is appropriate for the network. It is important to closely monitor usage rates, the percentage of the theoretical capacity that is being used. These data can identify computers/devices or communication circuits that have higher-than-average error or usage rates, and they may be used for predicting future growth patterns and failures. A device or circuit that is approaching maximum usage obviously needs to be upgraded.

Such predictions can be accomplished by establishing simple quality control charts similar to those used in manufacturing. Programs use an upper control limit and a lower control limit with regard to the number of blocks in error per day or per week. Notice how Figure 13.4 identifies when the common carrier moved a circuit from one microwave channel to another (circuit B), how a deteriorating circuit can be located and fixed before it goes through the upper control limit (circuit A) and causes problems for the users, or how a temporary high rate of errors (circuit C) can be encountered when installing new hardware and software.

Figure 13.4 Quality control chart for circuits

Improving Performance

The topics on LANs, BNs, MANs, and WANs discussed several specific actions that could be taken to improve network performance for each of those types of networks. There are also several general activities to improve performance that cut across the different types of networks.

Policy-Based Management A new approach to managing performance is policy-based management. With policy-based management, the network manager uses special software to set priority policies for network traffic that take effect when the network becomes busy. For example, the network manager might say that order processing and videoconferencing get the highest priority (order processing because it is the lifeblood of the company and videoconferencing because poor response time will have the greatest impact on it). The policy management software would then configure the network devices using the QoS capabilities in TCP/IP and/or ATM to give these applications the highest priority when the devices become busy.

Server Load Balancing Load balancing, as the name suggests, means to allocate incoming requests for network services (e.g., Web requests) across a set of equivalent servers so that the work is spread fairly evenly across all devices. With load balancing, a separate load-balancing server (sometimes called a virtual server), or a router or switch with special load-balancing software, allocates the requests among a set of identical servers using a simple round-robin formula (requests go to each server one after the other in turn) or more complex formulas that track how busy each server actually is. If a server crashes, the load balancer stops sending requests to it and the network continues to operate without the failed server.

Service-Level Agreements More organizations establish service-level agreements (SLAs) with their common carriers and Internet service providers. An SLA specifies the exact type of performance and fault conditions that the organization will accept. For

Inside a Service-Level Agreement

TECHNICAL FOCUS

There are many elements to a solid service-level agreement (SLA) with a common carrier. Some of the important ones include example, the SLA might state that network availability must be 99 percent or higher and that the MTBF for T1 circuits must be 120 days or more. In many cases, SLA includes maximum allowable response times. The SLA also states what compensation the service provider must provide if it fails to meet the SLA. Some organizations are also starting to use an SLA internally to define relationships between the networking group and its organizational "customers."

• Network availability, measured over a month as the percentage of time the network is available (e.g., [total hours - hours unavailable]/total hours) should be at least 99.5 percent

• Average round-trip permanent virtual circuit (PVC) delay, measured over a month as the number of seconds it takes a message to travel over the PVC from sender to receiver, should be less than 110 milliseconds, although some carriers will offer discounted services for SLA guarantees of 300 milliseconds or less

• PVC throughput, measured over a month as the number of outbound packets sent over a PVC divided by the inbound packets received at the destination (not counting packets over the committed information rate, which are discard eligible), should be above 99 percent—ideally, 99.99 percent

• Mean time to respond, measured as a monthly average of the time from inception of trouble ticket until repair personnel are on site, should be 4 hours or less

• Mean time to fix, measured as a monthly average of the time from the arrival of repair personnel on-site until the problem is repaired, should be 4 hours or less