Availability

Summary

Overview
Planning Availability Levels
Designing for Availability
Testing for Availability
Best Practices for Availability

Overview

If your application needs to be available 24-7 then you must design for high availability. Designing for high availability means that you understand how design choices can help you maximize application availability and how testing can validate that the application meets its high availability requirement.

All applications are typically available for usage at least some of the time, but Web-based or business-critical applications are expected to be available round-the-clock. In general, availability is not easy to implement and typically requires a more complex architectural infrastructure than the previous generation of client-server applications.

Applications can generally be divided into three categories with respect to availability:

The application can fail with no consequence.
The application can tolerate unplanned downtime but may require varying recovery strategies.
The application must provide high availability using standby replication strategies to guarantee instant and transparent recovery with virtually no downtime.

An application may not be available for the following reasons:

Inadequate testing.
Change-management problems (changes to source files, testing documents, etc.)
Lack of ongoing monitoring and analysis.
Users' errors.
Weak code.
Lack of qualify software engineering practices.
Interactions with external systems or applications.
Different operating conditions (usage level changes, peak overloads).
Unusual events (security failure, viruses).
Hardware failures (network failures, servers, power, CPUs, etc.)
Environmental problems (fire, flood, overheating, etc.)

In general, 80% of these failures are software-related, 10% are hardware-related, and the remaining 10% are due to environmental and other miscellaneous problems.

Availability can be quantified as a percentage calculation based on how often the application is actually available for use when compared to the total, planned available runtime. The calculation for availability uses the following measures:

Name	Acronym	Calculation	Definition
Mean Time Between Failure	MTBF	Hours / Failure Count	Average length of time the application runs before failure.
Mean Time to Recover	MTTR	Repair Hours / Failure Count	Average length of time needed to repair and restore service after a failure.

Therefore,

The sections that follow discuss designing, testing, and best practices for creating a highly-available distributed application.

Planning Availability Levels

To decide what level of availability is appropriate for your application, you consider the following questions:

Who are the customers and what are their expectations.
How much downtime is acceptable.
Do internal company processes depend on the service.
What is the schedule and the budge.

Designing for availability is difficult. Because of the wide variety of application architectures, no single availability solution works for every situation. For example, the decision to employ a comprehensive, fault tolerant, fully redundant. load-balanced availability solution may be suitable for a business- or mission-critical application, but for an application that can accept some down times. Ultimately, the availability design for an application will depend on a combination of business-specific requirements, application-specific data, and the available budget. Having said this, what is a good availability number for different kinds of applications? The following table gives an approximate idea:

Category	Failure Count per Year	Downtime per Year (Hours)	Average Time to Repair (Hours)	Availability
Non-commercial	10	88	10	99.00%
Commercial	5	44	9	99.50%
Business-Critical	4	9	2.25	99.90%
Mission-Critical	4	1	.15	99.99%

To consider what these numbers imply, consider the scenario of upgrading a non-commercial application to become a mission-critical application:

Non-Commercial

As a starting point, your non-commercial application fails 10 times each year, giving a total downtime of about 88 hours per year. This is a typical application. It runs most of the time, and when it fails, some skilled staff identify the problem, devise a fix, restore data and restart the application.

Commercial

A new business strategy requires that the application be upgraded to commercial standards. The first step is to apply some architectural analysis and re-engineer some components to reduce the failure count by half to 5. For example, if some of these errors were due to poor error handling, you re-engineer the error handling process and even resolve a few recoverable error conditions. You also look at the supporting infrastructure used by your application. Even by reducing the failure count to 5, it still means that you have to resolve errors in under 9 hours. You decide to shorten the repair time by creating a trouble-shooting document and provide hands-on failure training.

Business-Critical

Making the transition from commercial availability at 99.50% to business-critical availability at 99.90% is much more difficult. Assume that with intensive analysis and component re-engineering you are able to reduce the failure count to just 4 failures per year. But the down time must still be reduced from 44 hours to just 9 hours a year (equivalent to 80% reduction). This is where industrial-strength availability engineering becomes critical.

Making the transition from commercial availability to business-critical availability requires full commitment to reliability/availability culture. This takes the form of staff training, rigorous quality software engineering practices, appropriate certifications and the right technologies. To begin, you may decide to:

Use transactions to handle all critical application logic in order to avoid data corruption.
You make a complete security overview with an eye on protecting data and application.
Implement advanced server functionality to provide memory protection, file protection, network load balancing (NLB), component load balancing (CLB) and rich distributed component services.
Install shared RAID disks for data stores.

Reducing recovery time is now critical. You may decide to:

Start by using two clusters, each with multiple servers and each with configures as a failover cluster. This will require enhanced application code that will reconnect database connections and restart transactions.
Add retry on failure logic in the client code.
Provide a graceful degradation if certain resources and hence operations are not available rather than fail the application.

Mission-Critical

Moving your application to mission-critical availability means that the application must perform its services with only 1 hour downtime in a single year. Achieving this availability is non-trivial especially when all errors must be resolved in 15 minutes (or less.)

The main technique for increasing availability is redundancy. This means you may have to:

Implement full redundancy including:
- Cloned business servers.
- Cloned database servers.
- Network infrastructure.
Ensure all servers are stateless.
Install backup power supplies.
Various environment protection systems.

Designing for Availability

Where reliability is concerned with the question "Does it work?", availability is concerned with the question "How long does it take to fix?" Designing for availability is about anticipating, detecting, and automatically resolving software/hardware failures before they result in service errors.

While availability engineering is about reducing unplanned downtime, reducing planned downtime is also important. Such planned downtime may include maintenance changes, OS upgrades, backups, or any other activity that temporarily disables the application.

The following topics discuss some some availability design ideas:

Avoid Older Traditional Approach to Availability
Reduce Unplanned Downtime with Clusters
Use Network Load Balancing (NLB)
Use RAID for data stroes
Reduce Planned Downtime
Isolate Mission Critical Applications
Use Queuing

Avoid Older Traditional Approach to Availability

Avoid these traditional method of providing high-availability:

Using multiple CPUs. This approach usually has problems because disk IO and memory speed generally constrain the server.
Using a duplicate system with fully replicated components and always on standby waiting to take over processing if the primary server fails. This has several problems:
- Cost of idle hardware with no improvement in performance.
- Service transfer difficulty because the current job on the failed server is generally lost.
- Doubled maintenance with occasional benefit.
- Increased network traffic from data replication with occasional benefit.

While these traditional methods still have uses and may be quite effective in certain cases, there are newer approaches to availability that make use of reduced hardware costs and advances in distributed computing architecture.

Reduce Unplanned Downtime with Clusters

Clustering is the premier technology for creating high-availability applications. What is clustering? Clustering is linking a group of independent systems so that they work together as a single system. Clustering is about linking many physical servers such that if one fails, the running application is swapped over to another server and continues running as if nothing has happened. Under Microsoft Windows, clustering is hardware independent and additional servers can be added to handle increases workload. A client interacts with a cluster as if the cluster is a single server even through in reality a cluster is a collection of independent servers.

A cluster consists of multiple servers that are physically networked together and logically connected using clustering software. The clustering software allows these independent servers to act as if there were a single server - in the event of failure (CPU, Disk IO, memory storage, network card, application component, etc), the workload is transparently moved to another server, current client processes are switched over, and the failed service restarted. This is done all automatically with no apparent downtime.

In general, cluster software can provide failover support for applications, file and print services, databases, and messaging systems.

To take advantage of clustering, an application generally needs to:

Maintain configuration and state information on a shared disk-storage.
Application must be written to reinstate database connections and restart transactions.

The following figure illustrates clustering before and after a server has failed:

Use Network Load Balancing (NLB)

Network load balancing is the other premier technology for creating high-availability applications (clustering being the other premier technology). NLB is intended to distribute traffic across a cluster of servers allowing multiple machines to appear as a single server to clients. NLB increases availability by distributing the work load if a server in the cluster fails. The following figure illustrates NLB before and after a server has failed:

As illustrated above, NLB automatically detects a failed server and redirects client traffic to the remaining functioning servers - all the time maintaining continuous unbroken client service. NLB is very important for creating high-availability applications:

Continuous application service
Customer's experience will not be interrupted with unplanned server downtimes. Workload will be automatically distributed when one of the cluster servers fail.
Incremental server additions
You can add servers to the cluster one-at-a time, avoiding expensive initial costs for creating high-available applications. Cluster changes immediately cause automatic redistribution of workload
.
Offline maintenance
Servers within the cluster can be individually taken off-line without affecting availability.

User RAID for Data Stores

RAID stands for Redundant Array of Independent Disks. RAID is a way to use multiple hard disks so that data is stored in multiple places. The benefit of RAID is that any disk failure automatically transfers control to a mirrored or reconstructable data image while allowing the application to run uninterrupted. The failed disk(s) can then be replaced with no interruption to the application. RAID provides one of the cheapest methods for increasing data-access fault-tolerance.

Reduce Planned Downtime

One of the best ways to avoid planned downtime is to use rolling updates. For example, if you need to update a component on a cluster server, simply move the server's resource groups to another server, take the server offline for maintenance, perform the update, and then bring the server online. During the server's downtime, other cluster servers handle the workload and the application experiences no downtime.

Isolate Mission Critical Applications

A high-availability application should not be risked by other applications. For missions critical applications it is extremely important that dependencies on data and system components from other applications be eliminated by using entirely separate physical backbone for each mission critical application. 'Using entirely separate physical backbone' means for example, not sharing a database, not sharing the same network infrastructure, not sharing the same front-end or back-end servers, and so on.

There are also other physical isolation techniques ranging from application-centric thread isolation to built-in latency using queues to limit external system dependency.

Use Queuing

With queuing, an application communicates with other applications by sending and receiving asynchronous messages. When compared to synchronous messaging, queuing offers a very useful strategy for guaranteed delivery - this is because it does not matter whether or not the necessary connectivity currently exists. Queuing is very is very useful during periods of large workloads, the could otherwise stress the system and possibly cause failures.

The immediate benefit is that queuing removes a point of failure from your application. Queuing therefore improves availability by increasing the number of routes available for successful message delivery.

Distributed File System (DFS)

A Distributed File System is a logical file structure applied to multiple server and file shares. DFS improves availability by being able to point to redundant file copies and hence increasing the likelihood of accessing a needed file - even if the primary datastore is down.

Testing for Availability

Testing for availability means running the application for a predefined period of time while collecting information about failures and time to repair them. This information is then used to calculate availability levels and compare them with the original or predicted availability levels.

From the above it can be seen that availability testing is primarily concerned with measuring and minimizing actual repair time, compared with reliability testing which is primarily concerned with finding defects and reducing the number of failures. It is worthwhile considering the availability formula again:

As the Mean Time To Recover (MTTR) goes to zero, percentage availability goes to 100%. This is the essential idea of availability testing: reduce and eliminate downtime. Towards reducing and eliminating downtime, remember that a software defect found after deployment is generally costs ten time more to fix than if found before deployment.

The following is a collection of testing concepts that are especially relevant to creating highly-available applications:

Test the Change Control Process
A business- or mission-critical application should not go into production until you can repeatedly perform error-free change control.
Test Catastrophic Failure
Create outages of catastrophic nature and test the recover process. For example, you could unplug network equipment and note how the system responds. Catastrophic testing validates the correctness of the recovery procedures and provides confidence for the recover team.
Test the Failover Technologies
Pick a piece of hardware, say a disk controller, and unplug it. Then observe how fast the support team is table to identify and replace the failed hardware. Or you could pull the plug on one of the front-end or back-end servers and observe the clustering failover technology as it restarts the failed application on another server. The application should not only stay online, but every user process should be completed correctly.
Testing the Monitoring Technology
Analyze the Windows Management Instrumentation WMI data using the intended monitoring reports and ensure that you can plainly see all resource consumption data and especially all test outages.
Test Help-Desk Procedures
The help-desk must be fully trained and ready to handle customer inquiries and failure scenarios.
Test for Resource Conflicts
Availability engineering requires an in-depth understanding of an application's interactions with other processes and systems. Therefore, you must try to evaluate all the ways in which some other application might interfere with the intended service. You must also test for conflicts, and possible consider other design alternatives.

Applications often run slowly because they compete for system resources such as CPU, disk IO and network bandwidth. A slow application is technically available but not many users would want to wait for it. It might have well failed.

Best Practices for Availability

Availability engineering is all about delivering application services in spite of failures. The following best practices are recommended for creating highly-available applications:

Use Clustering
Clustering is a key technology for high-availability applications because it provides instant failover services in the event of a failure.
Use Network Load Balancing
NLB enhances availability by detecting service failures and re-routing all traffic away from failed servers to still-running servers.
Provide Vigilant Monitoring
Continuous monitoring of operational workload and failure data is essential to discovering trends and improving service. Typically, the most important data is provided by Windows Management Instrumentation (WMI).

You should install monitoring tools and establish availability metrics for all business- and mission-critical applications. This will help identify potential problems before they become serious failures. Typical resources that should be monitored include:
- Physical and Logical Disk.
- Memory.
- CPU utilization.
- Network traffic.
- Work queues.
- Security errors.
- Application requests
- Transaction throughput
- Data access requests.
- etc.
Establish a Help Disk
The help desk should be responsible for gathering problem information, determine problem history, and start immediate problem resolution. The help disk should be centrally involved in every failure recovery scenario.
Test the Recovery Plan
A well-documented recover plan is only paper until it has been proven under rigorous test conditions. The recovery plan should be tested several times - pull a power plug, disconnect a network cable and observe response using the recovery plan.
Choose Good Infrastructure
Standardized hardware is essential for achieving high-availability. This ensures that replacement parts are easily available, reduces inventory parts, reduces training requirements, etc.
- Memory
  Enhance memory error detection and correction using only Error Corrective Coding (ECC) memory.
- Physical Data Storage
  Use at least two servers running within a cluster with a fiber channel connection to a shared RAID array. The RAID array maintains data availability during a failure while server failures are covered by the cluster service. NTFS file system should also be used because it provides data integrity and security.
- Database
  It is recommended that replication services be used within multiple databases. Because the replication process can queue user changes, the application can run without failure even if one database goes down.
- Networking
  Every server should have two network interface cards (NIC) to provide isolated public and private connectivity. Only front-end server (i.e., application- or business-logic servers) should have public addresses. This prevents direct public access to back-end servers (i.e., database servers). You should also implement multiple network paths to business- and mission-critical servers.
- Front-End Servers
  Use clustering for front-end servers along with Network Load Balancing (NLB) to smooth the workload and automatically redistribute network traffic in the event of a server failure. All front-end servers should deliver the same services and share the workload. With clustering and NLB, additional servers can be added to handle increased workload with no downtime. Use dedicated network interface cards so that servers can send heartbeats (periodic signals) to other servers and monitoring tools to verify application's status.
- Back-End Servers
  Same exact recommendations for front-end servers (back-end servers provide two data services - database and shared file storage.)
- Synchronize all Clocks
  Operating systems and applications running on different servers must be time-synchronized. Otherwise, process and file-creation time stamps can become out of synch possibly confusing some other manual/automatic process.
- Use Data Backups
  Anticipate the possibility of server data corruption of loss. Maintaining data integrity is very important. One strategy for maintaining data integrity is to create a full back of the primary database and then incrementally test the source server for data corruption. You could also combine the back with transaction logs so that you can use SQL Server's the Database Consistency Checker (DBCC) to detect and repair corruption.