hi Neo
Thank you for your reply.
as you mention, errpt log on the A-system node as follow,
when I checked the logs, the dead man switch was triggered by the heavy io, which seems to cause the system to crash.
please confirm it for me.
Code:
[A-system]:root>errpt -a
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
AFA89905 1218142419 I O grpsvcs Group Services daemon started
97419D60 1218142419 I O topsvcs Topology Services daemon started
A6DF45AA 1218142219 I O RMCdaemon The daemon is started.
A2205861 1218142119 P S SYSPROC Excessive interrupt disablement time
67145A39 1218142019 U S SYSDUMP SYSTEM DUMP
F48137AC 1218141919 U O minidump COMPRESSED MINIMAL DUMP
225E3B63 1218141919 T S PANIC SOFTWARE PROGRAM ABNORMALLY TERMINATED
9DBCFDEE 1218142119 T O errdemon ERROR LOGGING TURNED ON
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
AB59ABFF 1218141019 U U LIBLVM Remote node Concurrent Volume Group fail
90EDB0A5 1218141019 P S topsvcs Dead Man Switch being allowed to expire.
3D32B80D 1218141019 P S topsvcs NIM thread blocked
[A-system]:root>errpt -j 90EDB0A5
LABEL: TS_DMS_EXPIRING_EM
IDENTIFIER: 90EDB0A5
Date/Time: Wed Dec 18 14:10:39 KST 2019
Sequence Number: 106262
Machine Id: 00C25B674C00
Node Id: MESDB01
Class: S
Type: PEND
WPAR: Global
Resource Name: topsvcs
Description
Dead Man Switch being allowed to expire.
If a TS_DMS_RESTORED_TE error appears after this, that will indicate this
condition has been recovered from. Otherwise, a DMS-triggered node failure
should be expected to occur after the time indicated in the Detail Data.
Probable Causes
Topology Services has detected blockage that puts it in danger of suffering
a sundered network. This is due to all viable NIM processes experiencing
blockage, or the daemon's main thread being hung for too long.
User Causes
Excessive I/O load is causing high I/O interrupt traffic
Excessive memory consumption is causing high memory contention
Recommended Actions
Reduce application load on the system
Change (relax) Topology Services tunable parameters
Call IBM Service if problem persists
Failure Causes
Problem in Operating System prevents processes from running
Excessive I/O interrupt traffic prevents processes from running
Excessive virtual memory activity prevents Topology Services from making progress
Recommended Actions
Examine I/O and memory activity on the system
Reduce load on the system
Change (relax) Topology Services tunable parameters
Call IBM Service if problem persists
Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.41,5004
ERROR ID
6Z0PvE0DHPyR/8YX./cU08....................
REFERENCE CODE
Time remaining until DMS triggers (in msec)
3000
DMS trigger interval (in msec)
64000
[A-system]:root>errpt -j 3D32B80D
LABEL: TS_NIM_ERROR_STUCK_
IDENTIFIER: 3D32B80D
Date/Time: Wed Dec 18 14:10:19 KST 2019
Sequence Number: 106261
Machine Id: 00C25B674C00
Node Id: MESDB01
Class: S
Type: PERM
WPAR: Global
Resource Name: topsvcs
Description
NIM thread blocked
Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU
The system clock was set forward
User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention
The system clock was manually set forward
Recommended Actions
Examine I/O and memory activity on the system
Reduce load on the system
Tune virtual memory parameters
Call IBM Service if problem persists
Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O
Recommended Actions
Examine I/O and memory activity on the system
Reduce load on the system
Tune virtual memory parameters
Call IBM Service if problem persists
Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.41,7916
ERROR ID
6BUfAx.vGPyR/bxj1/cU08....................
REFERENCE CODE
Thread which was blocked
send thread
Interval in seconds during which process was blocked
37
Interface name
rmndhb_lv_01.1_2
hi friends,
i know that when there is a crash then that memory image is
put into /var/adm/crash
but if the system hangs up and if i have access to console of
that machine then how can i take the crash dump manully.
thanks (2 Replies)
Hi Guys,
I have two nodes clustered. Each node is AIX 5.2 & they are clustered with HACMP 5.2. The mode of the cluster is Active/Passive which mean one node is the Active node & have all resource groups on it & the 2nd node is standby.
Last Monday I noted that all resource groupes have been... (2 Replies)
Hi.
I have started heartbeat on two redhat servers. Using eth0.
Before I start heartbeat I can ping the two server to each other.
Once I start heartbeat both the server become active as they both have warnings that the other node is dead.
Also I am not able to ping each other. After stopping... (1 Reply)
Hi Guys,
I have to design a multinode hacmp cluster and am not sure if the design I am thinking of makes any sense.
I have to make an environment that currently resides on 5 nodes more resilient but I have the constrain of only having 4 frames. In addition the business doesnt want to pay for... (7 Replies)
Hi
I had an active passive cluster. Node A went down and all resource groups moved to Node B.
Now we brought up Node A. What is the procedure to bring everything back to Node A.
Node A #lssrc -a | grep cl
clcomdES clcomdES 323782 active
clstrmgrES cluster... (9 Replies)
Hi Experts,
I have configured HP-UX Service Guard cluster and it dumps crash every time i reboot a cluster node. Can anyone please help me to prevent these unnecessary crash dumps at the time of rebooting SG cluster node?
Thanks in advance.
Vaishey (2 Replies)