Cluster failure reason

01-24-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Yes, exactly. There are more possible failure modes than one brain can imagine, but a very finite list of things your cluster is supposed to be providing and resources it uses to run.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

01-24-2014

Registered User

254, 56

Join Date: Aug 2013

Last Activity: 30 December 2015, 12:13 PM EST

Location: Windy City

Posts: 254

Thanks Given: 10

Thanked 56 Times in 52 Posts

Make sure you are pining the Persistent IP and NOT the Service IP, because Service IP will jump between the nodes, whereas persistent IP is hard bounded to the node.

You can check the cluster services, and the cluster state, and write a wrapper script to send an email if any of those goes south.
And ofcourse taking into consideration all the valuable suggestions given by forum members.

This User Gave Thanks to ibmtech For This Post:

ibmtech

View Public Profile for ibmtech

Find all posts by ibmtech

01-24-2014

Registered User

6,575, 572

Join Date: Sep 2007

Last Activity: 5 November 2019, 9:08 AM EST

Location: St. Gallen, Switzerland

Posts: 6,575

Thanks Given: 179

Thanked 572 Times in 484 Posts

The backbone of the highavailability is the nodes in a cluster checking each other so you should look into heartbeats. They are usually implemented by sharing a disk space, like a concurrent accessable VG, which is rather small and have the nodes write in there some bits and by the freshness of it the nodes can decide who is still up and alive.
Additionally there is heartbeating via network interfaces. Some even use or used serial interfaces etc.
This is an important part of HACMP/PowerHA and other Cluster Technologies.

Have a look here:
Heartbeating in HACMP - AIX 6.1 Information Center

This User Gave Thanks to zaxxon For This Post:

zaxxon

View Public Profile for zaxxon

Find all posts by zaxxon

01-26-2014

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by Tharsan

Ya actually the final aim is not to use the IBM cluster software (HACMP, powerHA)

So, basically you want to rewrite HACMP, yes? Why not, but be warned: there is a reason good cluster software is not coming in the dozens.

Let us see: a cluster is a device for making some "service" available even in cases of machines failing. So, what is a service?

A "service" is an application you can reach under a certain network address, therefore you need:

one (or more) network addresses,
some filesystems with data,
some processes serving said service.

This, bound into a group, is called a resource group in HACMP terminology.

You need also some device (say, a script, or whatever) telling you when the service is failing. Just checking some processes is problematic, because it could happen in some big software package that a certain process has to stop and another has to start as part of the normal operation. Therefore you need for every resource group a customised way of telling everything is good or not - a so-called application monitor. In its simplest form it will indeed check some processes, but it can be much more sophisticated than that.

This was the "internal" supervision, taking place on one node. You also need an "external" supervision, where the passive node checks if the active node is still alive. This is done via heartbeats, but is not always easy to tell, because if the service is not reachable via, say, network, this could mean that the node is failing or the connecting network is failing. Taking over in the first case corrects the problem while doing so in the second will achieve nothing. HACMP therefore uses network hearbeats, serial heartbeats and through shared disks (classically SCSI or SSA, nowadays FC networks) in parallel.

The cluster state which has to be avoided at all costs is the "split brain" condition: both nodes thinking they are primary and the other is failing. For this to avoid you need some means of shutting down a node as fast as possible. shutdown will be too slow, halt -q will be better and something like cat /etc/hosts > /dev/kmem (not possible any more since AIX 5.3 ML 1) would be best (fastest). Because you need to be able to trigger it from outside HACMP has the DMS (dead-man-switch), a kernel-extension which takes down the system real fast under certain conditions. While most of HACMP consists of scripts calling other scripts, this part is kernel-software. You will have to create such a thing too.

So far, off the top of my head. There is probably much more to say than what came to my mind right now, so just ask. I suggest reading the IBM redbooks about HACMP. Implementing a cluster software is a laudable effort, because even if you fail you will get to appreciate the problems it poses. If you even succeed, all the better.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

AIX

Cluster failure reason

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to get reason for ping failure using perls Net::Ping->new("icmp");?

Discussion started by: tavanagh

2. Red Hat

Problem in RedHat Cluster Node while network Failure or in Hang mode

Discussion started by: hirenkmistry

3. Solaris

Sun cluster 4.0 - zone cluster failover doubt

Discussion started by: NVA

4. UNIX for Dummies Questions & Answers

boot up failure unix sco after power failure

Discussion started by: fredthayer

5. Solaris

Sun cluster and Veritas cluster question.

Discussion started by: sparcguy

6. Solaris

Subject: Sun Cluster 3.2.2 Apache HA failure, or cludge?

Discussion started by: cluster

7. High Performance Computing

SUN Cluster Vs Veritas Cluster

Discussion started by: RAA

8. High Performance Computing

Building a Solaris Cluster Express cluster in a VirtualBox on OpenSolaris

Discussion started by: Linux Bot