Cluster failure reason


 
Thread Tools Search this Thread
Operating Systems AIX Cluster failure reason
# 8  
Old 01-24-2014
Yes, exactly. There are more possible failure modes than one brain can imagine, but a very finite list of things your cluster is supposed to be providing and resources it uses to run.
# 9  
Old 01-24-2014
Make sure you are pining the Persistent IP and NOT the Service IP, because Service IP will jump between the nodes, whereas persistent IP is hard bounded to the node.

You can check the cluster services, and the cluster state, and write a wrapper script to send an email if any of those goes south.
And ofcourse taking into consideration all the valuable suggestions given by forum members.
This User Gave Thanks to ibmtech For This Post:
# 10  
Old 01-24-2014
The backbone of the highavailability is the nodes in a cluster checking each other so you should look into heartbeats. They are usually implemented by sharing a disk space, like a concurrent accessable VG, which is rather small and have the nodes write in there some bits and by the freshness of it the nodes can decide who is still up and alive.
Additionally there is heartbeating via network interfaces. Some even use or used serial interfaces etc.
This is an important part of HACMP/PowerHA and other Cluster Technologies.

Have a look here:
Heartbeating in HACMP - AIX 6.1 Information Center
This User Gave Thanks to zaxxon For This Post:
# 11  
Old 01-26-2014
Quote:
Originally Posted by Tharsan
Ya actually the final aim is not to use the IBM cluster software (HACMP, powerHA)
So, basically you want to rewrite HACMP, yes? Why not, but be warned: there is a reason good cluster software is not coming in the dozens.

Let us see: a cluster is a device for making some "service" available even in cases of machines failing. So, what is a service?

A "service" is an application you can reach under a certain network address, therefore you need:

one (or more) network addresses,
some filesystems with data,
some processes serving said service.

This, bound into a group, is called a resource group in HACMP terminology.

You need also some device (say, a script, or whatever) telling you when the service is failing. Just checking some processes is problematic, because it could happen in some big software package that a certain process has to stop and another has to start as part of the normal operation. Therefore you need for every resource group a customised way of telling everything is good or not - a so-called application monitor. In its simplest form it will indeed check some processes, but it can be much more sophisticated than that.

This was the "internal" supervision, taking place on one node. You also need an "external" supervision, where the passive node checks if the active node is still alive. This is done via heartbeats, but is not always easy to tell, because if the service is not reachable via, say, network, this could mean that the node is failing or the connecting network is failing. Taking over in the first case corrects the problem while doing so in the second will achieve nothing. HACMP therefore uses network hearbeats, serial heartbeats and through shared disks (classically SCSI or SSA, nowadays FC networks) in parallel.

The cluster state which has to be avoided at all costs is the "split brain" condition: both nodes thinking they are primary and the other is failing. For this to avoid you need some means of shutting down a node as fast as possible. shutdown will be too slow, halt -q will be better and something like cat /etc/hosts > /dev/kmem (not possible any more since AIX 5.3 ML 1) would be best (fastest). Because you need to be able to trigger it from outside HACMP has the DMS (dead-man-switch), a kernel-extension which takes down the system real fast under certain conditions. While most of HACMP consists of scripts calling other scripts, this part is kernel-software. You will have to create such a thing too.

So far, off the top of my head. There is probably much more to say than what came to my mind right now, so just ask. I suggest reading the IBM redbooks about HACMP. Implementing a cluster software is a laudable effort, because even if you fail you will get to appreciate the problems it poses. If you even succeed, all the better.

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to get reason for ping failure using perls Net::Ping->new("icmp");?

Hi I am using perl to ping a list of nodes - with script below : $p = Net::Ping->new("icmp"); if ($p->ping($host,1)){ print "$host is alive.\n"; } else { print "$host is unreacheable.\n"; } $p->close();... (4 Replies)
Discussion started by: tavanagh
4 Replies

2. Red Hat

Problem in RedHat Cluster Node while network Failure or in Hang mode

Hi, We are having many RedHat linux Server with Cluster facility for availability of service like HTTPD / MySQL. We face some issue while some issue related to power disturbance / fluctuation or Network failure. There is two Cluster Node configured in... (0 Replies)
Discussion started by: hirenkmistry
0 Replies

3. Solaris

Sun cluster 4.0 - zone cluster failover doubt

Hello experts - I am planning to install a Sun cluster 4.0 zone cluster fail-over. few basic doubts. (1) Where should i install the cluster s/w binaries ?. ( global zone or the container zone where i am planning to install the zone fail-over) (2) Or should i perform the installation on... (0 Replies)
Discussion started by: NVA
0 Replies

4. UNIX for Dummies Questions & Answers

boot up failure unix sco after power failure

hi power went out. next day unix sco wont boot up error code 303. any help appreciated as we are clueless. (11 Replies)
Discussion started by: fredthayer
11 Replies

5. Solaris

Sun cluster and Veritas cluster question.

Yesterday my customer told me to expect a vcs upgrade to happen in the future. He also plans to stop using HDS and move to EMC. Am thinking how to migrate to sun cluster setup instead. My plan as follows leave the existing vcs intact as a fallback plan. Then install and build suncluster on... (5 Replies)
Discussion started by: sparcguy
5 Replies

6. Solaris

Subject: Sun Cluster 3.2.2 Apache HA failure, or cludge?

I folks, season's greetings. Hope you had a good festive season. I've got 2 related problems on the same Sun Cluster 3.2.2 Apache 2.0.63 cluster: clsetup error: ERROR: Failed to get connection to node localhost SunOS... (0 Replies)
Discussion started by: cluster
0 Replies

7. High Performance Computing

SUN Cluster Vs Veritas Cluster

Dear All, Can anyone explain about Pros and Cons of SUN and Veritas Cluster ? Any comparison chart is highly appreciated. Regards, RAA (4 Replies)
Discussion started by: RAA
4 Replies

8. High Performance Computing

Building a Solaris Cluster Express cluster in a VirtualBox on OpenSolaris

Provides a description of how to set up a Solaris Cluster Express cluster in a VirtualBox on OpenSolaris. More... (0 Replies)
Discussion started by: Linux Bot
0 Replies
Login or Register to Ask a Question