Saturday, February 16, 2008

HACMP Basics

HACMP Basics
HACMP BasicsHistoryIBM's HACMP exists for almost 15 years. It's not actually an IBM product, they bought it from CLAM, which was later renamed to Availant and is now called LakeViewTech. Until august 2006, all development of HACMP was done by CLAM. Nowadays IBM does it's own development of HACMP in Austin, Poughkeepsie and BangaloreIBM's high availability solution for AIX, High Availability Cluster Multi Processing (HACMP), consists of two components:•High Availability: The process of ensuring an application is available for use through the use of duplicated and/or shared resources (eliminating Single Points Of Failure – SPOF's).Cluster Multi-Processing: Multiple applications running on the same nodes with shared or concurrent access to the data.A high availability solution based on HACMP provides automated failure detection, diagnosis, application recovery and node reintegration. With an appropriate application, HACMP can also provide concurrent access to the data for parallel processing applications, thus offering excellent horizontal scalability.What needs to be protected? Ultimately, the goal of any IT solution in a critical environment is to provide continuous service and data protection.The High Availability is just one building block in achieving the continuous operation goal. The High Availability is based on the availability hardware, software (OS and its components), application and network components.The main objective of the HACMP is to eliminate Single Points of Failure (SPOF's) “…A fundamental design goal of (successful) cluster design is the elimination of single points of failure (SPOFs)…”Eliminate Single Point of Failure (SPOF) Cluster Eliminated as a single point of failureNode Using multiple nodesPower Source Using Multiple circuits or uninterruptibleNetwork/adapter Using redundant network adaptersNetwork Using multiple networks to connect nodes.TCP/IP Subsystem Using non-IP networks to connect adjoining nodes & clientsDisk adapter Using redundant disk adapter or multiple adaptersDisk Using multiple disks with mirroring or RAIDApplication Add node for takeover; configure application monitorAdministrator Add backup or every very detailed operations guideSite Add additional site.Cluster ComponentsHere are the recommended practices for important cluster components.NodesHACMP supports clusters of up to 32 nodes, with any combination of active and standby nodes. While itis possible to have all nodes in the cluster running applications (a configuration referred to as "mutualtakeover"), the most reliable and available clusters have at least one standby node - one node that is normallynot running any applications, but is available to take them over in the event of a failure on an active node.Additionally, it is important to pay attention to environmental considerations. Nodes should not have acommon power supply - which may happen if they are placed in a single rack. Similarly, building a clusterof nodes that are actually logical partitions (LPARs) with a single footprint is useful as a test cluster, butshould not be considered for availability of production applications.Nodes should be chosen that have sufficient I/O slots to install redundant network and disk adapters.That is, twice as many slots as would be required for single node operation. This naturally suggests thatprocessors with small numbers of slots should be avoided. Use of nodes without redundant adaptersshould not be considered best practice. Blades are an outstanding example of this. And, just as every clusterresource should have a backup, the root volume group in each node should be mirrored, or be on aRAID device.Nodes should also be chosen so that when the production applications are run at peak load, there are stillsufficient CPU cycles and I/O bandwidth to allow HACMP to operate. The production applicationshould be carefully benchmarked (preferable) or modeled (if benchmarking is not feasible) and nodes chosenso that they will not exceed 85% busy, even under the heaviest expected load.Note that the takeover node should be sized to accommodate all possible workloads: if there is a singlestandby backing up multiple primaries, it must be capable of servicing multiple workloads. On hardwarethat supports dynamic LPAR operations, HACMP can be configured to allocate processors and memory toa takeover node before applications are started. However, these resources must actually be available, oracquirable through Capacity Upgrade on Demand. The worst case situation – e.g., all the applications ona single node – must be understood and planned for.NetworksHACMP is a network centric application. HACMP networks not only provide client access to the applicationsbut are used to detect and diagnose node, network and adapter failures. To do this, HACMP usesRSCT which sends heartbeats (UDP packets) over ALL defined networks. By gathering heartbeat informationon multiple nodes, HACMP can determine what type of failure has occurred and initiate the appropriaterecovery action. Being able to distinguish between certain failures, for example the failure of a networkand the failure of a node, requires a second network! Although this additional network can be “IPbased” it is possible that the entire IP subsystem could fail within a given node. Therefore, in additionthere should be at least one, ideally two, non-IP networks. Failure to implement a non-IP network can potentiallylead to a Partitioned cluster, sometimes referred to as 'Split Brain' Syndrome. This situation canoccur if the IP network(s) between nodes becomes severed or in some cases congested. Since each node isin fact, still very alive, HACMP would conclude the other nodes are down and initiate a takeover. Aftertakeover has occurred the application(s) potentially could be running simultaneously on both nodes. If theshared disks are also online to both nodes, then the result could lead to data divergence (massive data corruption).This is a situation which must be avoided at all costs.The most convenient way of configuring non-IP networks is to use Disk Heartbeating as it removes theproblems of distance with rs232 serial networks. Disk heartbeat networks only require a small disk orLUN. Be careful not to put application data on these disks. Although, it is possible to do so, you don't wantany conflict with the disk heartbeat mechanism!

No comments: