Reliability

Summary

Reliability Overview

Software reliability is about how well an application provides its services. Software reliability is also about providing correct results and handling error detection and recovery in order to avoid failures. More formally, software reliability is defines as the Mean Time Between Failures - MTBF. MTBF really means the average amount of time the application runs until a failure occurs. For example, the application's requirements specification may specify that the application should provide a reliability of one failure for every 30 days of 24-hour operation (one failure every 720 hours). If testing shows that the application runs for 1900 hours with two failures, the the MTBF is 1900 / 2 = 950 hours. In this example, a reliability of 900 hours exceeds the stated requirements. An important notion of reliability is that  software errors occur while the application is available and executing.

The cost of application failure is often too high. In addition to not being able to sell and make money from unreliable applications, unreliable systems are often hard to maintain and improve.

Reliability of an enterprise application depends very much on the reliability of its individual components.  These components include hardware, software modules, operating system services and human processes required to provide the expected business service. Because all components of an enterprise application are connected, a failure in one component can affect the reliability of many other components.

Application failures occur for many reasons:

It may seem that the solution to reliability resides strictly in software and hardware technology. In fact the biggest improvement to reliability comes from having well-trained people, quality software engineering processes, and a serious ongoing commitment to reliability.

Designing For Reliability

As a design concept, reliability is about an application's ability to run without failures. Designing for reliability requires an in-depth look into how a particular application service is provided, evaluate failure scenarios, and find design alternatives. You simply cannot add reliability onto an application before deployment. 

There is a close relationship between reliability and availability. While reliability is about how long an application runs between failures, availability is about an application's ability to immediately begin handling all service request and be able to recover quickly if a failure occurs (and hence maximize the time when the application is available for use.) Obviously a reliable application causes fewer failures from which to recover and hence increases availability

Good reliability design engineering would attempt to do the following:

  1. Follow Windows application design guidelines.
  2. Add a reliability section in the application's requirements specification document.
  3. Use good architectural infrastructure.
  4. Build management and monitoring tools around the application.
  5. Use redundancy.
  6. Use built-in application health checks.
  7. Use consistent and robust error handling.

Follow Windows application design guidelines

A large amount of design information for reliable applications is found in a Microsoft design document entitled The Application Specification for Microsoft Windows 2000.

Add a reliability section in the application's requirements specification document

The design specification of an application should specify the expected reliability profile of the final operating application. Such a profile might as simple as "the application should have a reliability MTBF of 1000 hours", or it might be a complicated description of expected usage patterns, fault categories and errors, quantitative reliability metrics, and expected quality assurance proofs.

It must be noted that a reliability specification always contains uncertainty. First, it is difficult to estimate the actual required MTBF that satisfies both the anticipated business service level and the available budget. Second, the usage pattern of the software can change between time and this will distort the original reliability assumptions and may require reconsideration of the original reliability metrics.

Reliability requirements and measurements do not generally consider the consequence of failures. This an important point to note because while some failures may have no consequences, other failures can cause data corruption of even loss of service. Therefore, when analyzing reliability problems, always be sure to prioritize maintenance with proper consideration of failure consequences.

Use good architectural infrastructure

In addition to having a unique set of design problems resulting from specific business requirements, applications also share a common and recurring set of infrastructure design problems such as the ones listed below:

If you have to write all this, then they should be written such that can be reused over and over in other applications. Otherwise, Microsoft .NET Application Blocks provide great 'starter kits' for most of the above issues. For example, The .NET offers the Data Access Block, Exception Management Block, Caching, and many others. By using these blocks (whether as provided or with some modifications), you can solve most of the time addressing business-related design issues and be confident that the underlying infrastructure is reliable and reusable. 

Build management and monitoring tools around the application

During the design phase, it is very important to include automated management information as part of the coded infrastructure of the application. Such information might include response times of each call, resource consumption statistics, system traces (ideally per user) with thread and other user metrics, and error warnings. This type of automated monitoring is a very important and key best-practice to enable ongoing analysis, trend identification, and isolation of a application problems before they become significant

With built-in monitoring you can provide immediate failure notification and thereby speed recovery and avoid application failure. I have personally found that trace statements provide invaluable information for analyzing system performance and identifying/resolving problems. Traces and other built-in management information become more important as the way an application is used changes over time due to more users, larger workloads and increased integration with other system resources.

Note: Windows Management Instrumentation can be very helpful in adding build management and monitoring tools around the application. See WMI in Manageability.

Use redundancy

Redundant software, hardware and data help achieve reliability from detected failure conditions as follows:

Use built-in application health checks

Applications may need to run scheduled self-validation checks and on a continuous basis. For example, a method may attempt to simulate a user action that exercises all major parts of the system, and then report on the outcome. If no results are obtained with a predefined amount of time, appropriate alarms must be raised.

Use consistent and robust error handling

A well-designed distributed application should respond to all errors in a consistent manner. Errors in a distributed system may actually occur at any layer, and this requires a very well-thought solution for managing errors. This involves clear identification of the error condition, where it happened, and a graceful resolution of the problem so that the application may continue running, In many cases, 'Retry on Error' logic can be integrated into the client-side code

Testing for Reliability

Testing for reliability is about testing every feature in the application until all failures are discovered and removed before the system is deployed. However, in a complex application you will - most likely - not be able to find all possible failures due to the different combinations of alternate pathways through the application. Nonetheless, you can still test the most likely usages under normal scenarios and verify that the application provides the requested service. 

The following testing concepts are especially relevant to creating reliable applications:

Best Practices for Reliability

Software reliability extends across the entire development cycle. Given the major processes of a software lifecycle, requirements, design, development, testing, installation, maintenance, and un-installation, how reliable is each process?  The following best practices are recommended for creating highly-reliable application: