rg_move: Failure occurred

10-25-2010

Registered User

81, 2

Join Date: Sep 2009

Last Activity: 7 March 2014, 8:25 AM EST

Posts: 81

Thanks Given: 2

Thanked 2 Times in 2 Posts

rg_move: Failure occurred

Hi,

I have two node cluster.
There are 5 RG's.

3 of them on node2 and 2 of them on node1.

Verification of cluster is done without errors.

When i try to move of put offline RG i got this error for every RG:

Code:

rg_move: Failure occurred while processing Resource Group RG_XXX. Manual intervention required.

Then this RG RG_XXX goes to ERROR state and cluster is UNSTABLE.

I have tried unmount everything , varyoff vg's and run smit hacmp -> Problem Determination Tools -> Recover From HACMP Script Failure , but it didn't help.

I had to reboot both nodes to get cluster back to STABLE state.

Then I started up all RG's successfully.

Then I manually run stop script for RG XXX and all applications has been stopped successfully, then I unmounted fs , varyoffvg and tried turn off RG_XXX via hacmp but I got that error above.

Could you tell me what could be a problem that non of RG's can be turned off of moved to another node ?
Could you tell me some hints where to look or what to check ? pls

Thank you.

phobus

View Public Profile for phobus

Find all posts by phobus

10-25-2010

Registered User

6,575, 572

Join Date: Sep 2007

Last Activity: 5 November 2019, 9:08 AM EST

Location: St. Gallen, Switzerland

Posts: 6,575

Thanks Given: 179

Thanked 572 Times in 484 Posts

Did you check if there is more detailed information according to the actions you did issue inside /tmp/hacmp.out (if this is still the actual path to it)?

zaxxon

View Public Profile for zaxxon

Find all posts by zaxxon

10-25-2010

Registered User

81, 2

Join Date: Sep 2009

Last Activity: 7 March 2014, 8:25 AM EST

Posts: 81

Thanks Given: 2

Thanked 2 Times in 2 Posts

I got this from hacmp.out. Everything above ERROR is with 0 exit code.

Code:

Calling /HACMP_V5/events/after_node_down_local with parameters: node_down_local 1
Oct 25 11:10:51 EVENT COMPLETED: after_node_down_local 0

Oct 25 11:10:51 EVENT COMPLETED: node_down_local 0



***************************
Oct 25 2010 11:10:51 !!!!!!!!!! ERROR !!!!!!!!!!
***************************
Oct 25 2010 11:10:51 rg_move: Failure occurred while processing Resource Group RG_XXX. Manual intervention required.

Oct 25 11:10:52 POST EVENT COMMAND: after_rg_move node2 2 RELEASE

Calling /HACMP_V5/events/after_rg_move with parameters: rg_move 1 node2 2 RELEASE
Oct 25 11:10:52 EVENT COMPLETED: after_rg_move node2 2 RELEASE 0

Oct 25 11:10:52 EVENT COMPLETED: rg_move node2 2 RELEASE 0


Oct 25 11:10:52 POST EVENT COMMAND: after_rg_move_release node2 2

Calling /HACMP_V5/events/after_rg_move_release with parameters: rg_move_release 1 node2 2
Oct 25 11:10:52 EVENT COMPLETED: after_rg_move_release node2 2 0

Oct 25 11:10:52 EVENT COMPLETED: rg_move_release node2 2 0

                        HACMP Event Summary
Event: TE_RG_MOVE
Start time: Mon Oct 25 11:10:27 2010

End time: Mon Oct 25 11:10:52 2010

Action:         Resource:                       Script Name:
----------------------------------------------------------------------------
Releasing resource group:       RG_XXX      node_down_local
Search on: Mon.Oct.25.11:10:28.MESZ.2010.node_down_local.RG_XXX.ref
Releasing resource:     All_servers     stop_server
Search on: Mon.Oct.25.11:10:29.MESZ.2010.stop_server.All_servers.RG_XXX.ref
Resource offline:       All_nonerror_servers    stop_server
Search on: Mon.Oct.25.11:10:49.MESZ.2010.stop_server.All_nonerror_servers.RG_XXX.ref
Releasing resource:     All_service_addrs       release_service_addr
Search on: Mon.Oct.25.11:10:50.MESZ.2010.release_service_addr.All_service_addrs.RG_XXX.ref
Resource offline:       All_nonerror_service_addrs      release_service_addr
Search on: Mon.Oct.25.11:10:51.MESZ.2010.release_service_addr.All_nonerror_service_addrs.RG_XXX.ref
Error encountered with group:   RG_XXX      node_down_local
Search on: Mon.Oct.25.11:10:51.MESZ.2010.node_down_local.RG_XXX.ref
----------------------------------------------------------------------------

Oct 25 11:10:52 EVENT START: event_error 1 TE_RG_MOVE


Oct 25 11:10:52 PRE EVENT COMMAND: before_event_error 1 TE_RG_MOVE

Calling /HACMP_V5/events/before_event_error with parameters: event_error 1 TE_RG_MOVE
Oct 25 11:10:52 EVENT COMPLETED: before_event_error 1 TE_RG_MOVE 0

WARNING: Cluster node0102 Failed while running event [RG], exit status was 1
Check hacmp.out on this node for errors.
FFDC event log collection saved to /tmp/ibmsupt/hacmp/eventlogs.2010.10.25.11.10

Oct 25 11:10:52 POST EVENT COMMAND: after_event_error 1 TE_RG_MOVE

Calling /HACMP_V5/events/after_event_error with parameters: event_error 0 1 TE_RG_MOVE
Oct 25 11:10:52 EVENT COMPLETED: after_event_error 1 TE_RG_MOVE 0

Oct 25 11:10:52 EVENT COMPLETED: event_error 1 TE_RG_MOVE 0


Oct 25 11:16:27 EVENT START: config_too_long 360 TE_RG_MOVE


Oct 25 11:16:28 PRE EVENT COMMAND: before_config_too_long 360 TE_RG_MOVE

Calling /HACMP_V5/events/before_config_too_long with parameters: config_too_long 360 TE_RG_MOVE
Oct 25 11:16:28 EVENT COMPLETED: before_config_too_long 360 TE_RG_MOVE 0

FFDC event log collection saved to /tmp/ibmsupt/hacmp/eventlogs.2010.10.25.11.16
WARNING: Cluster node0102 has been running recovery program 'TE_RG_MOVE' for 360 seconds. Please check cluster status.
WARNING: Cluster node0102 has been running recovery program 'TE_RG_MOVE' for 390 seconds. Please check cluster status.
WARNING: Cluster node0102 has been running recovery program 'TE_RG_MOVE' for 420 seconds. Please check cluster status.
WARNING: Cluster node0102 has been running recovery program 'TE_RG_MOVE' for 450 seconds. Please check cluster status.
WARNING: Cluster node0102 has been running recovery program 'TE_RG_MOVE' for 480 seconds. Please check cluster status.
WARNING: Cluster node0102 has been running recovery program 'TE_RG_MOVE' for 540 seconds. Please check cluster status.
WARNING: Cluster node0102 has been running recovery program 'TE_RG_MOVE' for 600 seconds. Please check cluster status.
WARNING: Cluster node0102 has been running recovery program 'TE_RG_MOVE' for 660 seconds. Please check cluster status.
WARNING: Cluster node0102 has been running recovery program 'TE_RG_MOVE' for 720 seconds. Please check cluster status.
WARNING: Cluster node0102 has been running recovery program 'TE_RG_MOVE' for 780 seconds. Please check cluster status.
WARNING: Cluster node0102 has been running recovery program 'TE_RG_MOVE' for 900 seconds. Please check cluster status.

The first line "Calling /HACMP_V5/events/after_node_down_local with parameters: node_down_local 1" it was run with parameter 1. Is that ok ?

Thank you for support.

phobus

View Public Profile for phobus

Find all posts by phobus

10-26-2010

Registered User

6,575, 572

Join Date: Sep 2007

Last Activity: 5 November 2019, 9:08 AM EST

Location: St. Gallen, Switzerland

Posts: 6,575

Thanks Given: 179

Thanked 572 Times in 484 Posts

I think that "node_down_local 1" is just a status in between when the RG is being brought down ie. not up yet. It looks to me as if it is not the problem since that event was just completed with a parameter or status of 0, right in the next lines.

When the cluster is currently in an undefined state, I would check which VGs are maybe still active (you should know which resources network and disk wise etc. are part of that RG xxx) and which adapters are still up, to get a clue which of those might have a problem - also check if your issued scripts for that RG might have written a log somewhere (if they do at all).
There is also 2 warnings pointing to FFDC event log files under /tmp/ibmsupt/hacmp which could be also investigated with with one of the fc* commands you can find in /usr/sbin/rsct/bin like fcreport etc. You might have to check some documentation for that; google for "aix ffdc" and you'll get some IBM documentation sites.

Is there anything related in the errpt that might be helpful?

Did this cluster work after it has been tested? Have there been made changes to the hardware or software and you had not cluster tests after that?

What version of HA/CMP do you use?

Sadly the hacmp.out shows not the detailed information I hoped to see. You can check and set the verbosity of it via
smitty hacmp -> Problem Determination -> HACMP Log Viewing and Management -> Change/Show HACMP Log File Parameters -> Select your node ...

This is the path from an HA/CMP 5.3 cluster node. Later versions should have a smiliar path from what I saw.

zaxxon

View Public Profile for zaxxon

Find all posts by zaxxon

10-27-2010

Registered User

81, 2

Join Date: Sep 2009

Last Activity: 7 March 2014, 8:25 AM EST

Posts: 81

Thanks Given: 2

Thanked 2 Times in 2 Posts

Thanks for helping.
I will try to do what you suggested above.

phobus

View Public Profile for phobus

Find all posts by phobus

AIX

rg_move: Failure occurred

8 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Mload-UTY2408 occurred while trying to record control output infomation - terminating

Discussion started by: Rajesh123

2. Shell Programming and Scripting

Error occurred during initialization of VM

Discussion started by: Kamal1108

3. AIX

0511-193 An error occurred

Discussion started by: arunmistry

4. Solaris

Error occurred while making the net-snmp 5.4.4 on Solaris 5.10 version.

Discussion started by: ziosnim

5. Red Hat

KusuDB: Operational Error occurred when connecting to the DB

Discussion started by: ahsanpmd

6. UNIX for Dummies Questions & Answers

boot up failure unix sco after power failure

Discussion started by: fredthayer

7. Shell Programming and Scripting

sort: 0653-657 A write error occurred while sorting.

Discussion started by: diksha2207

8. Red Hat

rhn_register A socket error occurred

Discussion started by: s_linux