Reliability
Summary
Software reliability is about how well an application provides its services.
Software reliability is also about providing correct results and handling error
detection and recovery in order to avoid failures. More formally, software
reliability is defines as the Mean Time Between Failures - MTBF. MTBF really
means the average amount of time the application runs until a failure occurs.
For example, the application's requirements specification may specify that the
application should provide a reliability of one failure for every 30 days of
24-hour operation (one failure every 720 hours). If testing shows that the
application runs for 1900 hours with two failures, the the MTBF is 1900 / 2 =
950 hours. In this example, a reliability of 900 hours exceeds the stated requirements.
An important notion of reliability is that software errors occur while the
application is available and executing.
The cost of application failure is often too high. In addition to not being
able to sell and make money from unreliable applications, unreliable systems are
often hard to maintain and improve.
Reliability of an enterprise application depends very much on the reliability
of its individual components. These components include hardware, software
modules, operating system services and human processes required to provide the
expected business service. Because all components of an enterprise application
are connected, a failure in one component can affect the reliability of many
other components.
Application failures occur for many reasons:
- Weak code.
- Inadequate testing.
- Hardware failures.
- Change management problems.
- Lack of ongoing monitoring and analysis.
- Operations error.
- Unusual events (security failures, connectivity, etc.)
- Unhandled errors.
- and so on.
It may seem that the solution to reliability resides strictly in software and
hardware technology. In fact the biggest improvement to reliability comes from
having well-trained people, quality software engineering processes, and a
serious ongoing commitment to reliability.
As a design concept, reliability is about an application's ability to run
without failures. Designing for reliability requires an in-depth look into
how a particular application service is provided, evaluate failure scenarios,
and find design alternatives. You simply cannot add reliability onto an
application before deployment.
There is a close relationship between reliability and availability. While
reliability is about how long an application runs between failures, availability
is about an application's ability to immediately begin handling all service
request and be able to recover quickly if a failure occurs (and hence maximize
the time when the application is available for use.) Obviously a reliable
application causes fewer failures from which to recover and hence increases
availability.
Good reliability design engineering would attempt to do the following:
- Follow Windows
application design guidelines.
- Add a reliability section in the
application's requirements specification document.
- Use good architectural
infrastructure.
- Build management and monitoring tools around the
application.
- Use redundancy.
- Use built-in application health
checks.
- Use consistent and robust error
handling.
A large amount of design information for reliable applications is found in a
Microsoft design document entitled The
Application Specification for Microsoft Windows 2000.
The design specification of an application should specify the expected
reliability profile of the final operating application. Such a profile might as
simple as "the application should have a reliability MTBF of 1000 hours",
or it might be a complicated description of expected usage patterns, fault
categories and errors, quantitative reliability metrics, and expected quality
assurance proofs.
It must be noted that a reliability specification always contains uncertainty.
First, it is difficult to estimate the actual required MTBF that satisfies both
the anticipated business service level and the available budget. Second, the
usage pattern of the software can change between time and this will distort the
original reliability assumptions and may require reconsideration of the original
reliability metrics.
Reliability requirements and measurements do not generally consider the
consequence of failures. This an important point to note because while some
failures may have no consequences, other failures can cause data corruption of
even loss of service. Therefore, when analyzing reliability problems, always be
sure to prioritize maintenance with proper consideration of failure
consequences.
In addition to having a unique set of design problems resulting from specific
business requirements, applications also share a common and recurring set of
infrastructure design problems such as the ones listed below:
- Data access
- Exception management
- Caching
- Security authentication
- Thread pooling
- Network load balancing
- Asynchronous messaging and ,method calling
- Publish-subscribe event handling
- Component management
- Distributed installation and version control
- Message instrumentation and monitoring
If you have to write all this, then they should be written such that can be
reused over and over in other applications. Otherwise, Microsoft .NET
Application Blocks provide great 'starter kits' for most of the above
issues. For example, The .NET offers the Data Access Block, Exception Management
Block, Caching, and many others. By using these blocks (whether as provided or
with some modifications), you can solve most of the time addressing
business-related design issues and be confident that the underlying infrastructure
is reliable and reusable.
During the design phase, it is very important to include automated management
information as part of the coded infrastructure of the application. Such
information might include response times of each call, resource consumption
statistics, system traces (ideally per user) with thread and other user metrics,
and error warnings. This type of automated monitoring is a very important and
key best-practice to enable ongoing analysis, trend identification, and
isolation of a application problems before they become significant.
With built-in monitoring you can provide immediate failure notification and
thereby speed recovery and avoid application failure. I have personally found
that trace statements provide invaluable information for analyzing system
performance and identifying/resolving problems. Traces and other built-in
management information become more important as the way an application is used
changes over time due to more users, larger workloads and increased integration
with other system resources.
Note: Windows Management Instrumentation can be very helpful in adding build
management and monitoring tools around the application. See WMI
in Manageability.
Redundant software, hardware and data help achieve reliability from detected
failure conditions as follows:
- For software components, redundancy might consists of double or
even triple- redundant software components running in parallel. Closely
related to this concept of software-redundancy is the use of .NET platform services
such as clustering, load balancing, data replication, and protecting complex
operations with transactions.
- For hardware components, redundant hardware is one of the most
common strategies for providing reliable system. This includes the use of
RAID for data storage, multiple network interface cards, and redundant power
supplied.
- For data redundancy, multiple databases can be synchronized through data replication.
This not only provides reliability benefits, but when coupled with network
load balancing (NLB) actually enhances performance.
Applications may need to run scheduled self-validation checks and on a
continuous basis. For example, a method may attempt to simulate a user action
that exercises all major parts of the system, and then report on the outcome. If
no results are obtained with a predefined amount of time, appropriate alarms
must be raised.
A well-designed distributed application should respond to all errors in a
consistent manner. Errors in a distributed system may actually occur at any
layer, and this requires a very well-thought solution for managing errors. This
involves clear identification of the error condition, where it happened, and a
graceful resolution of the problem so that the application may continue running,
In many cases, 'Retry on Error' logic can be integrated into the client-side
code
Testing for reliability is about testing every feature in the application
until all failures are discovered and removed before the system is deployed.
However, in a complex application you will - most likely - not be able to find
all possible failures due to the different combinations of alternate pathways
through the application. Nonetheless, you can still test the most likely usages
under normal scenarios and verify that the application provides the requested
service.
The following testing concepts are especially relevant to creating reliable
applications:
- Use Component Stress Testing
Stress testing is about simulating large workloads to evaluate how the
system behaves under peak usage conditions. With component stress testing
you isolate each component (or service) and create test harnesses that
exercise each method exposed by that component. The idea here is to test
methods on each separate component using conditions that would far exceed
what the overall application would experience. For example, you could call
each method with a loop that loops 1,000,000 times and observe results to
discover if there are any memory-related or thread-related issues.
The Web Application Stress Tool and Application Center Test are some tools
that help you stress-test your component.
- Use Integration Stress Testing
After stress-testing each component in isolation, you could stress-test the
entire application with all its components and supporting services.
Integration stress-testing is largely concerned about interactions with
other services, processes and data structures.
With integration stress-testing you need to know usage scenarios, simulate
what users are trying to do, and more importantly know all the ways a user
goes through the application. This means that test scripts should attempt to
exercise the application according to the intended usage.
- Use Real-World Testing
Software that is reliable in isolation and in a controlled environment may
not be so reliable in real-world deployment. A real-world test ensure that
the application can happily coexist with other concurrent application. Real-world
testing often discovers failures due to interactions with other
applications.
Therefore, the test plan should ensure that the application will run in the
final target environment with no failures. This involves making sure that
the application will run in the server space with the final configuration
and with no failures
- Use Random Destruction Testing
One of the easiest ways to test for reliability is to test with random
input. This kind of testing attempts to crash the application by providing
erroneous and non-logical input. Note that this erroneous input
is not just simple keystrokes that provide strings where numbers are
expected. Erroneous and non-logical includes a wide-range of inputs
including simple keystrokes, simulated program events, simulated caches,
corrupted database tables, or any other input that can forced into the
application. Such testing, while difficult and time-consuming, helps improve
reliability by forcing failures at different parts of the system and
invoking (and thereby verifying) error handling.
The point with random testing is that it can be highly automated because it
is completely outside the concern of how the underlying application is
supposed to work. If the application does not break it passes the test, and
if the application breaks it fails the test.
Obviously, you will need a test harness to drive chaotic, high-stress and
non-logical inputs into the program. A relevant tool by Microsoft is called Injector
which can injects faults into any API without requiring access to the source
code.
- Obtain Windows Certification
Windows certification helps validate reliable and manageable applications.
See http://msdn.microsoft.com/certification
for more information.
Software reliability extends across the entire development cycle.
Given the major processes of a software lifecycle, requirements, design,
development, testing, installation, maintenance, and un-installation, how
reliable is each process? The following best practices are recommended for
creating highly-reliable application:
- Focus on Reliability
Be focused on how services are implemented and look for possible
problems within the application.
- Invest in People
Developers and operations staff should be properly trained in architecture
and lifecycle management practices. Create a team culture where reliability
is considered critical. Training and education should be provided where
possible.
- Use a Robust Operating System
For example, Windows servers often contain the following features:
- Workload Balancing
- Storage management
- Memory Protection
- Diagnostic and Recovery tools.
- Installation protection
- etc.
- Remove Failure Points from the Application Design
If a failure occurs, remove the failure and do not just treat the symptom.
For example, if an application leaks memory, do not just restart it to clear
memory but rather address the memory leak issue directly.
- Provide Ongoing Reliability Monitoring
All mission-critical data should provide monitoring data. There are several
excellent tools for monitoring and analyzing the health of the application:
- Windows Management Instrumentation (WMI).
- Performance Monitor (PefMon).
- Application Center 2000.
- Systems Management Server (SMS).
- Invest in Quality Software Engineering Processes
In general, achieving reliability derives more from the software lifecycle
and its supporting culture than if does from attempting to write
perfect-code. In other words, reliable applications result from embracing a
reliability culture. Excellent standards and software engineering processes
must become part of the business culture. This requires a substantial
investment in attitude, training, and budget. This commitment involves the
following:
- Coding standards and coding reviews
Coding standards and programming practices should be shared across the
business. Completed source code should be reviewed by peers for
correctness, performance, reliability issues
- Use a development lifecycle methodology
The use of a consistent, repeatable, formal software development method
always leads to more reliable systems. Having a formal specification
process promotes detailed analysis leading to discovery and innovation.
- Develop recovery procedures
You should have documented procedure that describe how to quickly and
correctly recover from failures. This procedures should be fully
practiced before the application goes into production. Operation manuals
should always document configuration, troubleshooting tips, and the
required steps for problem identification and resolution.
- Use proven change control procedures
Change control (or configuration management) is a complex procedural
issues and varies considerable depending on the environment. For
example, parallel-development where each user uses his/her own
development branch off of the main branch, helps keep the main branch
clean and always ready for a build without having to ask everybody to
check-in or uncheck-out their files.
- Deploy Changes Very Carefully
Adding a changed component to an existing application can cause failures in
a couple of ways: the component may have an error, or the component may
actually reveal errors in existing components. Only careful integration
testing will reveal these errors. A reliable application is only possible
when all components, services, configurations are synchronized
- Pay Attention to Budget
Very high reliability can cost more than it is worth. Certain kinds of
failure may have no consequence to the business or customer.