Oracle RAC Cluster on RedHat Linux, Node eviction

06-25-2010

Registered User

8, 0

Join Date: Dec 2007

Last Activity: 16 June 2011, 6:57 PM EDT

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

Oracle RAC Cluster on RedHat Linux, Node eviction

We have 2 node Oracle RAC Cluster. It is running RHEL 5 (2.6.18-92.1.10.el5PAE)
Hardware is HP DL360

We have node eviction issue, Oracle evicts node. It is not very consistent.

Oracle has looked all log files and said this is not Oracle issue but rather system or os issue. Server does not respond within say 30 seconds, so it gets evicted.
Oracle have also given some sample tools to run every few seconds called OS Watcher scripts which basically runs vmstat, top,mpstat etc every 10 seconds.

From these logs file, we see it stop updating logfile before eviction happens e.g.
Say node eviction happended at 3:41:00 AM and we see that it stooped looging into into log files around 03:40:30 AM so that clearly tells that server is not responding.

These scripts runs locally on server.

I am also running across server simple ping command to check status and we see similar behavior that even ping command stop and does not write any output before any eviction happens.

So from Oracle perspective, it is very clear that it is either Server (Hardware) or OS issue.
At the time node eviction happens, server was quite idle based on OS watcher collection.

My question is "How we go from here". I am not sys admin , I am DBA.

What other info, we could gather that would provide more info what is causing server hung issue. We don't see any thing in server log files ( message log)
What tools we could use to trobleshoot either hardware issue or OS issue.

---------- Post updated at 04:09 PM ---------- Previous update was at 03:45 PM ----------

Code:

Cluster Log:


2010-06-25 03:41:23.629
[cssd(11240)]CRS-1612:node host2 (2) at 50% heartbeat fatal, eviction in 14.008 seconds
2010-06-25 03:41:24.631
[cssd(11240)]CRS-1612:node host2 (2) at 50% heartbeat fatal, eviction in 13.008 seconds
2010-06-25 03:41:31.623
[cssd(11240)]CRS-1611:node host2 (2) at 75% heartbeat fatal, eviction in 6.008 seconds
2010-06-25 03:41:35.631
[cssd(11240)]CRS-1610:node host2 (2) at 90% heartbeat fatal, eviction in 2.254 seconds
2010-06-25 03:41:36.622
[cssd(11240)]CRS-1610:node host2 (2) at 90% heartbeat fatal, eviction in 1.008 seconds
2010-06-25 03:41:37.624
[cssd(11240)]CRS-1610:node host2 (2) at 90% heartbeat fatal, eviction in 0.008 seconds
2010-06-25 03:41:38.147
[cssd(11240)]CRS-1607:CSSD evicting node host2. Details in /u01/app/oracle/product/crs/11.1/log/host1/cssd/ocssd.log.
[cssd(11240)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host1 .



Based on log, eviction would have started after  03:14:38 and I would show you other logs which shows that host2 was not responding.


This is ping log to host2 server which we run from host1: It missed 03:41:06 and so on. It runs every 3 seconds.


Fri, 25 Jun 2010 03:40:45 -0700 :  64 bytes from host1.local (199.99.202.179): icmp_seq=35503 ttl=64 time=0.164 ms
Fri, 25 Jun 2010 03:40:48 -0700 :  64 bytes from host1.local (199.99.202.179): icmp_seq=35504 ttl=64 time=0.157 ms
Fri, 25 Jun 2010 03:40:51 -0700 :  64 bytes from host1.local (199.99.202.179): icmp_seq=35505 ttl=64 time=0.163 ms
Fri, 25 Jun 2010 03:40:54 -0700 :  64 bytes from host1.local (199.99.202.179): icmp_seq=35506 ttl=64 time=0.164 ms
Fri, 25 Jun 2010 03:40:57 -0700 :  64 bytes from host1.local (199.99.202.179): icmp_seq=35507 ttl=64 time=0.136 ms
Fri, 25 Jun 2010 03:41:00 -0700 :  64 bytes from host1.local (199.99.202.179): icmp_seq=35508 ttl=64 time=0.134 ms
Fri, 25 Jun 2010 03:41:03 -0700 :  64 bytes from host1.local (199.99.202.179): icmp_seq=35509 ttl=64 time=0.142 ms
                                                                                                                                    

This is not network issue , here is proof, we  run every 10 seconds OS monitoring 

This is vmstat output and you will see it stopped at 03:41:06 and it even did not complete all output.
zzz ***Fri Jun 25 03:40:46 PDT 2010
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 8  0 385092 168152  78308 2595720    1    1   125    73    5    2  4  2 91  3  0
 0  1 385092 169152  78308 2595764    0    0   119   158 1145 5154  1  3 96  1  0
 1  0 385092 168904  78316 2595756    0    0    51    93 1076 4518  0  0 98  1  0
zzz ***Fri Jun 25 03:40:56 PDT 2010
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
10  0 385092 167224  78396 2595808    1    1   125    73    5    2  4  2 91  3  0
 0  1 385092 168340  78396 2595844    0    0    83   136 1113 5163  4  3 92  1  0
 0  0 385092 168588  78404 2595836    0    0   259   118 1123 4650  0  0 98  1  0
zzz ***Fri Jun 25 03:41:06 PDT 2010
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 8  0 385092 167836  78488 2595996    1    1   125    73    5    2  4  2 91  3  0



Output from mpstat monitoring, same thing, it did not print @03:41:06


zzz ***Fri Jun 25 03:40:46 PDT 2010
Linux 2.6.18-92.1.10.el5PAE (host2.local)        06/25/2010

03:40:46 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:40:47 AM  all    1.00    0.00    2.74    0.50    0.00    0.25    0.00   95.51   1135.64
03:40:48 AM  all    0.50    0.00    0.25    1.00    0.00    0.00    0.00   98.25   1074.00
03:40:49 AM  all    1.25    0.00    1.75    4.26    0.00    0.00    0.00   92.73   1292.00
Average:     all    0.92    0.00    1.58    1.92    0.00    0.08    0.00   95.50   1167.11
zzz ***Fri Jun 25 03:40:56 PDT 2010
Linux 2.6.18-92.1.10.el5PAE (host2.local)        06/25/2010

03:40:56 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:40:57 AM  all    3.50    0.00    3.25    0.75    0.00    0.00    0.00   92.50   1116.00
03:40:58 AM  all    0.50    0.00    0.25    1.49    0.00    0.00    0.00   97.76   1120.00
03:40:59 AM  all    2.01    0.00    0.50    2.26    0.00    0.00    0.00   95.24   1109.00
Average:     all    2.00    0.00    1.33    1.50    0.00    0.00    0.00   95.17   1115.00
zzz ***Fri Jun 25 03:41:06 PDT 2010

sanjay92

View Public Profile for sanjay92

Find all posts by sanjay92

08-26-2010

Registered User

2, 0

Join Date: Aug 2010

Last Activity: 30 August 2010, 2:21 AM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi,

Can you please mention the version of the Oracle RAC, the patchset,the number of nodes and the number of instances running on each node.

Marius

mardaff

View Public Profile for mardaff

Find all posts by mardaff

08-28-2010

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

Since it was obviously working up till recently, the question has to be about what has recenly changed in the operating systems on your platforms. Have OS patches been applied? Are disks filling up? Is a NIC failing?

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

08-29-2010

Registered User

1,713, 295

Join Date: Feb 2010

Last Activity: 26 April 2017, 8:59 AM EDT

Location: istanbul

Posts: 1,713

Thanks Given: 4

Thanked 295 Times in 286 Posts

this seem network problem on host2 through other nodes..check host2 network settings , network cables and private cluster network or host2 rac settings..

ygemici

View Public Profile for ygemici

Find all posts by ygemici

08-30-2010

Registered User

2, 0

Join Date: Aug 2010

Last Activity: 30 August 2010, 2:21 AM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by ygemici

this seem network problem on host2 through other nodes..check host2 network settings , network cables and private cluster network or host2 rac settings..

If this one doesn't fix your problem... then please reply with the info I asked. I've faced the same problem a few months ago and now is .....gone.

Quote:

Originally Posted by mardaff

Hi,

Can you please mention the version of the Oracle RAC, the patchset,the number of nodes and the number of instances running on each node.

Marius

mardaff

View Public Profile for mardaff

Find all posts by mardaff

08-30-2010

Registered User

1,713, 295

Join Date: Feb 2010

Last Activity: 26 April 2017, 8:59 AM EDT

Location: istanbul

Posts: 1,713

Thanks Given: 4

Thanked 295 Times in 286 Posts

Quote:

Originally Posted by mardaff

If this one doesn't fix your problem... then please reply with the info I asked. I've faced the same problem a few months ago and now is .....gone.

if you experience with this issue before then you must tell how to go (except resetup inode or reinstal ocfs2 or rac req rpms and reconfigure inode

) ..
sometimes version informations can be unimportant ..

ygemici

View Public Profile for ygemici

Find all posts by ygemici

08-30-2010

Registered User

8, 0

Join Date: Dec 2007

Last Activity: 16 June 2011, 6:57 PM EDT

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

We have not seen this issue since 2 months. We haven't done anything to fix this. Our DB version is 10.2.0.4 and Clusterware is 11.1.0.6.

sanjay92

View Public Profile for sanjay92

Find all posts by sanjay92

UNIX for Advanced & Expert Users

Oracle RAC Cluster on RedHat Linux, Node eviction

1 More Discussions You Might Find Interesting

1. UNIX and Linux Applications

Slow sftp on one node of RAC cluster

Discussion started by: blkav07