Cluster failure reason Post: 302885478

Sponsored Content

Operating Systems AIX Cluster failure reason Post 302885478 by bakunin on Sunday 26th of January 2014 07:45:03 PM

01-26-2014

Registered User

Quote:

Originally Posted by Tharsan

Ya actually the final aim is not to use the IBM cluster software (HACMP, powerHA)

So, basically you want to rewrite HACMP, yes? Why not, but be warned: there is a reason good cluster software is not coming in the dozens.

Let us see: a cluster is a device for making some "service" available even in cases of machines failing. So, what is a service?

A "service" is an application you can reach under a certain network address, therefore you need:

one (or more) network addresses,
some filesystems with data,
some processes serving said service.

This, bound into a group, is called a resource group in HACMP terminology.

You need also some device (say, a script, or whatever) telling you when the service is failing. Just checking some processes is problematic, because it could happen in some big software package that a certain process has to stop and another has to start as part of the normal operation. Therefore you need for every resource group a customised way of telling everything is good or not - a so-called application monitor. In its simplest form it will indeed check some processes, but it can be much more sophisticated than that.

This was the "internal" supervision, taking place on one node. You also need an "external" supervision, where the passive node checks if the active node is still alive. This is done via heartbeats, but is not always easy to tell, because if the service is not reachable via, say, network, this could mean that the node is failing or the connecting network is failing. Taking over in the first case corrects the problem while doing so in the second will achieve nothing. HACMP therefore uses network hearbeats, serial heartbeats and through shared disks (classically SCSI or SSA, nowadays FC networks) in parallel.

The cluster state which has to be avoided at all costs is the "split brain" condition: both nodes thinking they are primary and the other is failing. For this to avoid you need some means of shutting down a node as fast as possible. shutdown will be too slow, halt -q will be better and something like cat /etc/hosts > /dev/kmem (not possible any more since AIX 5.3 ML 1) would be best (fastest). Because you need to be able to trigger it from outside HACMP has the DMS (dead-man-switch), a kernel-extension which takes down the system real fast under certain conditions. While most of HACMP consists of scripts calling other scripts, this part is kernel-software. You will have to create such a thing too.

So far, off the top of my head. There is probably much more to say than what came to my mind right now, so just ask. I suggest reading the IBM redbooks about HACMP. Implementing a cluster software is a laudable effort, because even if you fail you will get to appreciate the problems it poses. If you even succeed, all the better.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

8 More Discussions You Might Find Interesting

1. High Performance Computing

Building a Solaris Cluster Express cluster in a VirtualBox on OpenSolaris

Provides a description of how to set up a Solaris Cluster Express cluster in a VirtualBox on OpenSolaris. More...

2. High Performance Computing

SUN Cluster Vs Veritas Cluster

Dear All, Can anyone explain about Pros and Cons of SUN and Veritas Cluster ? Any comparison chart is highly appreciated. Regards, RAA

3. Solaris

Subject: Sun Cluster 3.2.2 Apache HA failure, or cludge?

I folks, season's greetings. Hope you had a good festive season. I've got 2 related problems on the same Sun Cluster 3.2.2 Apache 2.0.63 cluster: clsetup error: ERROR: Failed to get connection to node localhost SunOS...

4. Solaris

Sun cluster and Veritas cluster question.

Yesterday my customer told me to expect a vcs upgrade to happen in the future. He also plans to stop using HDS and move to EMC. Am thinking how to migrate to sun cluster setup instead. My plan as follows leave the existing vcs intact as a fallback plan. Then install and build suncluster on...

5. UNIX for Dummies Questions & Answers

boot up failure unix sco after power failure

hi power went out. next day unix sco wont boot up error code 303. any help appreciated as we are clueless.

6. Solaris

Sun cluster 4.0 - zone cluster failover doubt

Hello experts - I am planning to install a Sun cluster 4.0 zone cluster fail-over. few basic doubts. (1) Where should i install the cluster s/w binaries ?. ( global zone or the container zone where i am planning to install the zone fail-over) (2) Or should i perform the installation on...

7. Red Hat

Problem in RedHat Cluster Node while network Failure or in Hang mode

Hi, We are having many RedHat linux Server with Cluster facility for availability of service like HTTPD / MySQL. We face some issue while some issue related to power disturbance / fluctuation or Network failure. There is two Cluster Node configured in...

8. Shell Programming and Scripting

How to get reason for ping failure using perls Net::Ping->new("icmp");?

Hi I am using perl to ping a list of nodes - with script below : $p = Net::Ping->new("icmp"); if ($p->ping($host,1)){ print "$host is alive.\n"; } else { print "$host is unreacheable.\n"; } $p->close();...

8 More Discussions You Might Find Interesting

1. High Performance Computing

Building a Solaris Cluster Express cluster in a VirtualBox on OpenSolaris

Discussion started by: Linux Bot

2. High Performance Computing

SUN Cluster Vs Veritas Cluster

Discussion started by: RAA

3. Solaris

Subject: Sun Cluster 3.2.2 Apache HA failure, or cludge?

Discussion started by: cluster

4. Solaris

Sun cluster and Veritas cluster question.

Discussion started by: sparcguy

5. UNIX for Dummies Questions & Answers

boot up failure unix sco after power failure

Discussion started by: fredthayer

6. Solaris

Sun cluster 4.0 - zone cluster failover doubt

Discussion started by: NVA

7. Red Hat

Problem in RedHat Cluster Node while network Failure or in Hang mode

Discussion started by: hirenkmistry

8. Shell Programming and Scripting

How to get reason for ping failure using perls Net::Ping->new("icmp");?

Discussion started by: tavanagh