Reliability

Summary

Reliability Overview
Designing For Reliability
Testing for Reliability
Best Practices for Reliability

Reliability Overview

Software reliability is about how well an application provides its services. Software reliability is also about providing correct results and handling error detection and recovery in order to avoid failures. More formally, software reliability is defines as the Mean Time Between Failures - MTBF. MTBF really means the average amount of time the application runs until a failure occurs. For example, the application's requirements specification may specify that the application should provide a reliability of one failure for every 30 days of 24-hour operation (one failure every 720 hours). If testing shows that the application runs for 1900 hours with two failures, the the MTBF is 1900 / 2 = 950 hours. In this example, a reliability of 900 hours exceeds the stated requirements. An important notion of reliability is that software errors occur while the application is available and executing.

The cost of application failure is often too high. In addition to not being able to sell and make money from unreliable applications, unreliable systems are often hard to maintain and improve.

Reliability of an enterprise application depends very much on the reliability of its individual components. These components include hardware, software modules, operating system services and human processes required to provide the expected business service. Because all components of an enterprise application are connected, a failure in one component can affect the reliability of many other components.

Application failures occur for many reasons:

Weak code.
Inadequate testing.
Hardware failures.
Change management problems.
Lack of ongoing monitoring and analysis.
Operations error.
Unusual events (security failures, connectivity, etc.)
Unhandled errors.
and so on.

It may seem that the solution to reliability resides strictly in software and hardware technology. In fact the biggest improvement to reliability comes from having well-trained people, quality software engineering processes, and a serious ongoing commitment to reliability.

Designing For Reliability

As a design concept, reliability is about an application's ability to run without failures. Designing for reliability requires an in-depth look into how a particular application service is provided, evaluate failure scenarios, and find design alternatives. You simply cannot add reliability onto an application before deployment.

There is a close relationship between reliability and availability. While reliability is about how long an application runs between failures, availability is about an application's ability to immediately begin handling all service request and be able to recover quickly if a failure occurs (and hence maximize the time when the application is available for use.) Obviously a reliable application causes fewer failures from which to recover and hence increases availability.

Good reliability design engineering would attempt to do the following:

Follow Windows application design guidelines.
Add a reliability section in the application's requirements specification document.
Use good architectural infrastructure.
Build management and monitoring tools around the application.
Use redundancy.
Use built-in application health checks.
Use consistent and robust error handling.

Follow Windows application design guidelines

A large amount of design information for reliable applications is found in a Microsoft design document entitled The Application Specification for Microsoft Windows 2000.

Add a reliability section in the application's requirements specification document

The design specification of an application should specify the expected reliability profile of the final operating application. Such a profile might as simple as "the application should have a reliability MTBF of 1000 hours", or it might be a complicated description of expected usage patterns, fault categories and errors, quantitative reliability metrics, and expected quality assurance proofs.

It must be noted that a reliability specification always contains uncertainty. First, it is difficult to estimate the actual required MTBF that satisfies both the anticipated business service level and the available budget. Second, the usage pattern of the software can change between time and this will distort the original reliability assumptions and may require reconsideration of the original reliability metrics.

Reliability requirements and measurements do not generally consider the consequence of failures. This an important point to note because while some failures may have no consequences, other failures can cause data corruption of even loss of service. Therefore, when analyzing reliability problems, always be sure to prioritize maintenance with proper consideration of failure consequences.

Use good architectural infrastructure

In addition to having a unique set of design problems resulting from specific business requirements, applications also share a common and recurring set of infrastructure design problems such as the ones listed below:

Data access
Exception management
Caching
Security authentication
Thread pooling
Network load balancing
Asynchronous messaging and ,method calling
Publish-subscribe event handling
Component management
Distributed installation and version control
Message instrumentation and monitoring

If you have to write all this, then they should be written such that can be reused over and over in other applications. Otherwise, Microsoft .NET Application Blocks provide great 'starter kits' for most of the above issues. For example, The .NET offers the Data Access Block, Exception Management Block, Caching, and many others. By using these blocks (whether as provided or with some modifications), you can solve most of the time addressing business-related design issues and be confident that the underlying infrastructure is reliable and reusable.

Build management and monitoring tools around the application

During the design phase, it is very important to include automated management information as part of the coded infrastructure of the application. Such information might include response times of each call, resource consumption statistics, system traces (ideally per user) with thread and other user metrics, and error warnings. This type of automated monitoring is a very important and key best-practice to enable ongoing analysis, trend identification, and isolation of a application problems before they become significant.

With built-in monitoring you can provide immediate failure notification and thereby speed recovery and avoid application failure. I have personally found that trace statements provide invaluable information for analyzing system performance and identifying/resolving problems. Traces and other built-in management information become more important as the way an application is used changes over time due to more users, larger workloads and increased integration with other system resources.

Note: Windows Management Instrumentation can be very helpful in adding build management and monitoring tools around the application. See WMI in Manageability.

Use redundancy

Redundant software, hardware and data help achieve reliability from detected failure conditions as follows:

For software components, redundancy might consists of double or even triple- redundant software components running in parallel. Closely related to this concept of software-redundancy is the use of .NET platform services such as clustering, load balancing, data replication, and protecting complex operations with transactions.
For hardware components, redundant hardware is one of the most common strategies for providing reliable system. This includes the use of RAID for data storage, multiple network interface cards, and redundant power supplied.
For data redundancy, multiple databases can be synchronized through data replication. This not only provides reliability benefits, but when coupled with network load balancing (NLB) actually enhances performance.

Use built-in application health checks

Applications may need to run scheduled self-validation checks and on a continuous basis. For example, a method may attempt to simulate a user action that exercises all major parts of the system, and then report on the outcome. If no results are obtained with a predefined amount of time, appropriate alarms must be raised.

Use consistent and robust error handling

A well-designed distributed application should respond to all errors in a consistent manner. Errors in a distributed system may actually occur at any layer, and this requires a very well-thought solution for managing errors. This involves clear identification of the error condition, where it happened, and a graceful resolution of the problem so that the application may continue running, In many cases, 'Retry on Error' logic can be integrated into the client-side code

Testing for Reliability

Testing for reliability is about testing every feature in the application until all failures are discovered and removed before the system is deployed. However, in a complex application you will - most likely - not be able to find all possible failures due to the different combinations of alternate pathways through the application. Nonetheless, you can still test the most likely usages under normal scenarios and verify that the application provides the requested service.

The following testing concepts are especially relevant to creating reliable applications:

Use Component Stress Testing
Stress testing is about simulating large workloads to evaluate how the system behaves under peak usage conditions. With component stress testing you isolate each component (or service) and create test harnesses that exercise each method exposed by that component. The idea here is to test methods on each separate component using conditions that would far exceed what the overall application would experience. For example, you could call each method with a loop that loops 1,000,000 times and observe results to discover if there are any memory-related or thread-related issues.

The Web Application Stress Tool and Application Center Test are some tools that help you stress-test your component.
Use Integration Stress Testing
After stress-testing each component in isolation, you could stress-test the entire application with all its components and supporting services. Integration stress-testing is largely concerned about interactions with other services, processes and data structures.

With integration stress-testing you need to know usage scenarios, simulate what users are trying to do, and more importantly know all the ways a user goes through the application. This means that test scripts should attempt to exercise the application according to the intended usage.
Use Real-World Testing
Software that is reliable in isolation and in a controlled environment may not be so reliable in real-world deployment. A real-world test ensure that the application can happily coexist with other concurrent application. Real-world testing often discovers failures due to interactions with other applications.

Therefore, the test plan should ensure that the application will run in the final target environment with no failures. This involves making sure that the application will run in the server space with the final configuration and with no failures
Use Random Destruction Testing
One of the easiest ways to test for reliability is to test with random input. This kind of testing attempts to crash the application by providing erroneous and non-logical input. Note that this erroneous input is not just simple keystrokes that provide strings where numbers are expected. Erroneous and non-logical includes a wide-range of inputs including simple keystrokes, simulated program events, simulated caches, corrupted database tables, or any other input that can forced into the application. Such testing, while difficult and time-consuming, helps improve reliability by forcing failures at different parts of the system and invoking (and thereby verifying) error handling.

The point with random testing is that it can be highly automated because it is completely outside the concern of how the underlying application is supposed to work. If the application does not break it passes the test, and if the application breaks it fails the test.

Obviously, you will need a test harness to drive chaotic, high-stress and non-logical inputs into the program. A relevant tool by Microsoft is called Injector which can injects faults into any API without requiring access to the source code.
Obtain Windows Certification
Windows certification helps validate reliable and manageable applications. See http://msdn.microsoft.com/certification for more information.

Best Practices for Reliability

Software reliability extends across the entire development cycle. Given the major processes of a software lifecycle, requirements, design, development, testing, installation, maintenance, and un-installation, how reliable is each process? The following best practices are recommended for creating highly-reliable application:

Focus on Reliability
Be focused on how services are implemented and look for possible problems within the application.
Invest in People
Developers and operations staff should be properly trained in architecture and lifecycle management practices. Create a team culture where reliability is considered critical. Training and education should be provided where possible.
Use a Robust Operating System
For example, Windows servers often contain the following features:
- Workload Balancing
- Storage management
- Memory Protection
- Diagnostic and Recovery tools.
- Installation protection
- etc.
Remove Failure Points from the Application Design
If a failure occurs, remove the failure and do not just treat the symptom. For example, if an application leaks memory, do not just restart it to clear memory but rather address the memory leak issue directly.
Provide Ongoing Reliability Monitoring
All mission-critical data should provide monitoring data. There are several excellent tools for monitoring and analyzing the health of the application:
- Windows Management Instrumentation (WMI).
- Performance Monitor (PefMon).
- Application Center 2000.
- Systems Management Server (SMS).
Invest in Quality Software Engineering Processes
In general, achieving reliability derives more from the software lifecycle and its supporting culture than if does from attempting to write perfect-code. In other words, reliable applications result from embracing a reliability culture. Excellent standards and software engineering processes must become part of the business culture. This requires a substantial investment in attitude, training, and budget. This commitment involves the following:
- Coding standards and coding reviews
  Coding standards and programming practices should be shared across the business. Completed source code should be reviewed by peers for correctness, performance, reliability issues
- Use a development lifecycle methodology
  The use of a consistent, repeatable, formal software development method always leads to more reliable systems. Having a formal specification process promotes detailed analysis leading to discovery and innovation.
- Develop recovery procedures
  You should have documented procedure that describe how to quickly and correctly recover from failures. This procedures should be fully practiced before the application goes into production. Operation manuals should always document configuration, troubleshooting tips, and the required steps for problem identification and resolution.
- Use proven change control procedures
  Change control (or configuration management) is a complex procedural issues and varies considerable depending on the environment. For example, parallel-development where each user uses his/her own development branch off of the main branch, helps keep the main branch clean and always ready for a build without having to ask everybody to check-in or uncheck-out their files.
Deploy Changes Very Carefully
Adding a changed component to an existing application can cause failures in a couple of ways: the component may have an error, or the component may actually reveal errors in existing components. Only careful integration testing will reveal these errors. A reliable application is only possible when all components, services, configurations are synchronized
Pay Attention to Budget
Very high reliability can cost more than it is worth. Certain kinds of failure may have no consequence to the business or customer.