Crash dump and Panic message : RSCT Dead Man Switch Timeout for HACMP; halting non-responsive node


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Operating Systems AIX Crash dump and Panic message : RSCT Dead Man Switch Timeout for HACMP; halting non-responsive node
# 1  
Crash dump and Panic message : RSCT Dead Man Switch Timeout for HACMP; halting non-responsive node

Dear all
i have two aix system
-Model : P770
-OS version: AIX 6.1
-patch level : 6100-07-04-1216
-ha version : HACMP v 6.1.0.8
-host : A, B

last Wednesday, my B system suddenly went down with crash dump. after 1 minute, A system went down with crash dump. I checked the dump of A-system using kdb command and found the following:
PANIC MESSAGES:
RSCT Dead Man Switch Timeout for HACMP; halting non-responsive node


So, I was convinced that the phrase was the cause of the down of A-system. Is my judgment correct?
And I looked for what dead man switch is. but it was not well understood.
Why did the dead man switch bring down A-system?

Please anybody explain to me?
# 2  
When the crashed AIX node restarts, issue the command:

Code:
errpt -J KERNEL_PANIC

to look for any AIX error log entries that were created when the node crashed. If this command produces an output like:

Code:
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
225E3B63   0821085101 T S PANIC          SOFTWARE
                                         PROGRAM ABNORMALLY
                                         TERMINATED

... then try to run:

Code:
errpt -a

...to get details for the event.

Hope this helps.
# 3  
hi Neo
Thank you for your reply.
as you mention, errpt log on the A-system node as follow,
when I checked the logs, the dead man switch was triggered by the heavy io, which seems to cause the system to crash.
please confirm it for me.

Code:
[A-system]:root>errpt -a 
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
AFA89905   1218142419 I O grpsvcs        Group Services daemon started
97419D60   1218142419 I O topsvcs        Topology Services daemon started
A6DF45AA   1218142219 I O RMCdaemon      The daemon is started.
A2205861   1218142119 P S SYSPROC        Excessive interrupt disablement time
67145A39   1218142019 U S SYSDUMP        SYSTEM DUMP
F48137AC   1218141919 U O minidump       COMPRESSED MINIMAL DUMP
225E3B63   1218141919 T S PANIC          SOFTWARE PROGRAM ABNORMALLY TERMINATED
9DBCFDEE   1218142119 T O errdemon       ERROR LOGGING TURNED ON
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
AB59ABFF   1218141019 U U LIBLVM         Remote node Concurrent Volume Group fail
90EDB0A5   1218141019 P S topsvcs        Dead Man Switch being allowed to expire.
3D32B80D   1218141019 P S topsvcs        NIM thread blocked

[A-system]:root>errpt -j 90EDB0A5   
LABEL:          TS_DMS_EXPIRING_EM
IDENTIFIER:     90EDB0A5

Date/Time:       Wed Dec 18 14:10:39 KST 2019
Sequence Number: 106262
Machine Id:      00C25B674C00
Node Id:         MESDB01
Class:           S
Type:            PEND
WPAR:            Global
Resource Name:   topsvcs

Description
Dead Man Switch being allowed to expire.
If a TS_DMS_RESTORED_TE error appears after this, that will indicate this
condition has been recovered from.  Otherwise, a DMS-triggered node failure
should be expected to occur after the time indicated in the Detail Data.

Probable Causes
Topology Services has detected blockage that puts it in danger of suffering
a sundered network.  This is due to all viable NIM processes experiencing
blockage, or the daemon's main thread being hung for too long.

User Causes
Excessive I/O load is causing high I/O interrupt traffic
Excessive memory consumption is causing high memory contention

        Recommended Actions
        Reduce application load on the system
        Change (relax) Topology Services tunable parameters
        Call IBM Service if problem persists

Failure Causes
Problem in Operating System prevents processes from running
Excessive I/O interrupt traffic prevents processes from running
Excessive virtual memory activity prevents Topology Services from making progress

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Change (relax) Topology Services tunable parameters
        Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.41,5004
ERROR ID
6Z0PvE0DHPyR/8YX./cU08....................
REFERENCE CODE

Time remaining until DMS triggers (in msec)
        3000
DMS trigger interval (in msec)
       64000
[A-system]:root>errpt -j  3D32B80D
LABEL:          TS_NIM_ERROR_STUCK_
IDENTIFIER:     3D32B80D

Date/Time:       Wed Dec 18 14:10:19 KST 2019
Sequence Number: 106261
Machine Id:      00C25B674C00
Node Id:         MESDB01
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   topsvcs

Description
NIM thread blocked

Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU
The system clock was set forward

User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention
The system clock was manually set forward

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.41,7916
ERROR ID
6BUfAx.vGPyR/bxj1/cU08....................
REFERENCE CODE

Thread which was blocked
send thread
Interval in seconds during which process was blocked
          37
Interface name
rmndhb_lv_01.1_2

# 4  
Well, it appears to me you have a performance issue (resource issue) causing the failure.

It could be a disk error, a bus error, or a file system error causing the I/O problem; or like the message says, it could be a memory issue.

So, you should probably look in your syslog files and boot messages for errors related to the above recommendations:

Code:
  
        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters

I would start at the top (look for errors in log files) first.
# 5  
Could this be that whatever is supplying the common disk used to keep heartbeat failed? That way, both nodes would be unable to keep updating the shared disk and the usual response is to terminate all services to avoid getting in the way, i.e. to panic/abort. We've had a Oracle RAC database cluster do this before. not pretty, but it is the best course of action to avoid damage.




Robin
# 6  
Hi Neo and rbattel
my system are using GPFS(archive directory) and raw devices with Oracle RAC on the hacmp.

I think the reason why A-system has down itself is that B-system is down due to system bug and all oracle sessions of B-system node are moved to A-system, which causes huge IO on A-system. The sync time of the A-system was slowed down, and as a result, the system was down when the dead man switch limit was reached.
# 7  
You lost all heartbeats from node 1 to node 2 - thats the reason for the crash. This might happen when your system is simply too busy - but since you should have both heartbeat on disk and heartbeat via network, you should think that there is time enough to send at least one every couple of seconds, Your cluster heartbeat settings might be too tight - giving it more time for the heartbeat might help preventing this issue in the future.
Just out of curiosity - using GPFS and HACMP and RAC on the same systems appears to me to be a completely unnecessary setup, as you are running essentially 3 different cluster products on a system when RAC alone would suffice. Why ?
This User Gave Thanks to zxmaus For This Post:
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #919
Difficulty: Medium
The Unix epoch is the time 00:00:00 EST on 1 January 1970.
True or False?

10 More Discussions You Might Find Interesting

1. OS X (Apple)

MacOS 10.15.2 Catalina display crash and system panic

MacPro (2013) 12-Core, 64GB RAM (today's crash): panic(cpu 2 caller 0xffffff7f8b333ad5): userspace watchdog timeout: no successful checkins from com.apple.WindowServer in 120 seconds service: com.apple.logd, total successful checkins since load (318824 seconds ago): 31883, last successful... (3 Replies)
Discussion started by: Neo
3 Replies

2. HP-UX

Prevent crash dump when SG cluster node reboots

Hi Experts, I have configured HP-UX Service Guard cluster and it dumps crash every time i reboot a cluster node. Can anyone please help me to prevent these unnecessary crash dumps at the time of rebooting SG cluster node? Thanks in advance. Vaishey (2 Replies)
Discussion started by: Vaishey
2 Replies

3. AIX

HACMP switch over

Hi I had an active passive cluster. Node A went down and all resource groups moved to Node B. Now we brought up Node A. What is the procedure to bring everything back to Node A. Node A #lssrc -a | grep cl clcomdES clcomdES 323782 active clstrmgrES cluster... (9 Replies)
Discussion started by: samsungsamsung
9 Replies

4. AIX

hacmp in a 7 node configuration ?

Hi Guys, I have to design a multinode hacmp cluster and am not sure if the design I am thinking of makes any sense. I have to make an environment that currently resides on 5 nodes more resilient but I have the constrain of only having 4 frames. In addition the business doesnt want to pay for... (7 Replies)
Discussion started by: zxmaus
7 Replies

5. UNIX for Advanced & Expert Users

Linux heartbeat on redhat 4:node dead

Hi. I have started heartbeat on two redhat servers. Using eth0. Before I start heartbeat I can ping the two server to each other. Once I start heartbeat both the server become active as they both have warnings that the other node is dead. Also I am not able to ping each other. After stopping... (1 Reply)
Discussion started by: amrita garg
1 Replies

6. Solaris

crash dump

hi , i have machine that is crashed how i can enable core dump file & how can i find it ? :confused: (4 Replies)
Discussion started by: lid-j-one
4 Replies

7. AIX

Node Switch Reasons in HACMP

Hi Guys, I have two nodes clustered. Each node is AIX 5.2 & they are clustered with HACMP 5.2. The mode of the cluster is Active/Passive which mean one node is the Active node & have all resource groups on it & the 2nd node is standby. Last Monday I noted that all resource groupes have been... (2 Replies)
Discussion started by: aldowsary
2 Replies

8. Solaris

crash dump

Can anyone of you help me in enabling crash dump on Solaris 5.5.1 (1 Reply)
Discussion started by: csreenivas
1 Replies

9. HP-UX

crash dump

hi friends, i know that when there is a crash then that memory image is put into /var/adm/crash but if the system hangs up and if i have access to console of that machine then how can i take the crash dump manully. thanks (2 Replies)
Discussion started by: mxms755
2 Replies

10. UNIX for Dummies Questions & Answers

help, what is the difference between core dump and panic dump?

help, what is the difference between core dump and panic dump? (1 Reply)
Discussion started by: aileen
1 Replies

Featured Tech Videos