System rebooted itself but errpt did not catch


 
Thread Tools Search this Thread
Operating Systems AIX System rebooted itself but errpt did not catch
# 8  
Old 05-28-2016
Quote:
Originally Posted by filosophizer
Code:
D221BD55   0522083816 I O perftune       RESTRICTED TUNABLES MODIFIED AT REBOOT
EC0BCCD4   0522083816 T H ent1           ETHERNET DOWN
EC0BCCD4   0522083816 T H ent2           ETHERNET DOWN
54E8A127   0522083516 T H sissas0        DEVICE OR MEDIA ERROR

Actually this is something you should investigate: There are disk errors noted prior to the boot (at 8:35) and two network interfacces going down (at 8:38). If the system is part of an HACMP cluster it might well be that HACMP did it (the so-called "dead-man-switch") to preserve cluster integrity.

For reasons similar to what has been said about RAC (which is a "cluster" on application level) HACMP doesn't take "no" for an answer when downing systems either. halt -q is perhaps the most gentle and longest drawn-out method it employs.

I hope this helps.

bakunin
# 9  
Old 05-29-2016
bakunin, thanks for the reply.

It is not part of HACMP cluster. Only RAC

hdisk3 -- that is a disk which is a free disk / which has some bad sectors, but it should not let the server reboot.

I am suspecting that a user did the reboot and removed the entry from error report, and trying to search the logs.

Last edited by filosophizer; 05-29-2016 at 07:04 AM..
# 10  
Old 05-29-2016
So your node is called "prd" ? How many nodes are in the RAC cluster? Or is it a one-node cluster? Did you check the logs for this node on the other nodes? I doubt you will find any information on the node itself, you must check the logs on the other nodes, the nodes that did not go down...
# 11  
Old 05-30-2016
Scrutunizer

This is a 2 node RAC (version 10) , you mentioned that I should check logs on the other node which is called DR ;

what logs or what can I look into ?


below is something I thought of posting but, any suggestions would be appreciated.

Code:
root@DR /etc/oracle/oprocd> ls -ltra
-rw-r--r--    1 root     system          175 Mar 17 00:09 dr.oprocd.log.2016-03-17-00:19:23
-rw-r--r--    1 root     system          175 Mar 17 00:19 dr.oprocd.log.2016-03-21-23:36:51
-rw-r--r--    1 root     system          175 Mar 21 23:36 dr.oprocd.log.2016-03-22-00:32:19
-rw-r--r--    1 root     system          863 Apr 03 09:45 dr.oprocd.log.2016-04-03-09:46:36
-rw-r--r--    1 root     system          304 Apr 24 14:50 dr.oprocd.log.2016-04-24-15:21:08
-rw-r--r--    1 root     system          175 Apr 24 15:21 dr.oprocd.log.2016-04-24-16:15:52
-rw-r--r--    1 root     system          304 Apr 25 11:55 dr.oprocd.log.2016-04-25-11:56:46
drwxrwx---    2 root     system          256 Apr 25 11:56 stop
-rw-r--r--    1 root     system          175 Apr 25 11:56 dr.oprocd.log
-rwxr--r--    1 root     system          512 Apr 25 11:56 dr.oprocd.lgl
drwxrwx---    2 root     system          256 Apr 25 11:56 fatal
drwxrwx---    2 root     system          256 Apr 25 11:56 check
drwxrwxr-x    5 root     system        12288 Apr 25 11:56 .


root@DR /etc/oracle/oprocd> errpt | more

IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
B6267342   0529235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0529150016 P O dumpcheck      The largest dump device is too small.
B6267342   0528235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0528150016 P O dumpcheck      The largest dump device is too small.
B6267342   0527235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0527150016 P O dumpcheck      The largest dump device is too small.
B6267342   0526235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0526150016 P O dumpcheck      The largest dump device is too small.
B6267342   0525235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0525150016 P O dumpcheck      The largest dump device is too small.
B6267342   0524235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0524150016 P O dumpcheck      The largest dump device is too small.
B6267342   0523235516 P H hdisk32        DISK OPERATION ERROR
E87EF1BE   0523150016 P O dumpcheck      The largest dump device is too small.
B6267342   0523132516 P H hdisk32        DISK OPERATION ERROR
B6267342   0523114216 P H hdisk32        DISK OPERATION ERROR
B6267342   0523113916 P H hdisk32        DISK OPERATION ERROR
26623394   0523113616 T H fscsi2         COMMUNICATION PROTOCOL ERROR
26623394   0523113616 T H fscsi2         COMMUNICATION PROTOCOL ERROR
F3931284   0523112616 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0523112616 T H ent2           ETHERNET DOWN
F3931284   0523112316 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0523112316 T H ent2           ETHERNET DOWN
F3931284   0523112016 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0523112016 T H ent2           ETHERNET DOWN
F3931284   0523112016 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0523112016 T H ent2           ETHERNET DOWN
54E8A127   0523101216 T H sissas0        DEVICE OR MEDIA ERROR
E87EF1BE   0522150016 P O dumpcheck      The largest dump device is too small.
F3931284   0522083816 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0522083816 T H ent2           ETHERNET DOWN
8650BE3F   0522083816 I H ent9           ETHERCHANNEL RECOVERY
F3931284   0522083816 I H ent0           ETHERNET NETWORK RECOVERY MODE
59224136   0522083816 P H ent9           ETHERCHANNEL FAILOVER
EC0BCCD4   0522083816 T H ent0           ETHERNET DOWN
F3931284   0522083416 I H ent2           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0522083416 T H ent2           ETHERNET DOWN
F3931284   0522083416 I H ent0           ETHERNET NETWORK RECOVERY MODE
EC0BCCD4   0522083416 T H ent0           ETHERNET DOWN
F3931284   0522083316 I H ent0           ETHERNET NETWORK RECOVERY MODE
F3931284   0522083316 I H ent2           ETHERNET NETWORK RECOVERY MODE
E87EF1BE   0521150016 P O dumpcheck      The largest dump device is too small.
E87EF1BE   0520150016 P O dumpcheck      The largest dump device is too small.
B50A3F81   0520104416 P H ent9           TOTAL ETHERCHANNEL FAILURE
EC0BCCD4   0520104416 T H ent2           ETHERNET DOWN
EC0BCCD4   0520104416 T H ent0           ETHERNET DOWN
F3931284   0520101316 I H ent0           ETHERNET NETWORK RECOVERY MODE
59224136   0520101316 P H ent9           ETHERCHANNEL FAILOVER
F3931284   0520101316 I H ent2           ETHERNET NETWORK RECOVERY MODE
1788894A   0520101316 P H ent9           ETHERCHANNEL CANNOT FAIL OVER
EC0BCCD4   0520101316 T H ent0           ETHERNET DOWN
8650BE3F   0520101316 I H ent9           ETHERCHANNEL RECOVERY
EC0BCCD4   0520101316 T H ent2           ETHERNET DOWN
F3931284   0520101316 I H ent0           ETHERNET NETWORK RECOVERY MODE
59224136   0520101316 P H ent9           ETHERCHANNEL FAILOVER
F3931284   0520101316 I H ent2           ETHERNET NETWORK RECOVERY MODE
1788894A   0520101316 P H ent9           ETHERCHANNEL CANNOT FAIL OVER
EC0BCCD4   0520101316 T H ent0           ETHERNET DOWN
8650BE3F   0520101316 I H ent9           ETHERCHANNEL RECOVERY
EC0BCCD4   0520101316 T H ent2           ETHERNET DOWN
F3931284   0520101016 I H ent0           ETHERNET NETWORK RECOVERY MODE
59224136   0520101016 P H ent9           ETHERCHANNEL FAILOVER
F3931284   0520101016 I H ent2           ETHERNET NETWORK RECOVERY MODE
1788894A   0520100516 P H ent9           ETHERCHANNEL CANNOT FAIL OVER
EC0BCCD4   0520100516 T H ent0           ETHERNET DOWN
EC0BCCD4   0520100516 T H ent2           ETHERNET DOWN

root@DR /etc/oracle/oprocd> uptime
  08:48AM   up 35 days,  16:33,  3 users,  load average: 3.73, 3.59, 3.58

root@DR /etc/oracle/oprocd> last reboot
reboot    ~                                   Apr 24 16:14
reboot    ~                                   Apr 24 15:20


Last edited by filosophizer; 05-30-2016 at 02:40 PM..
# 12  
Old 05-30-2016
Read the first answer from this thread:

Changing host Time backwards - two node RAC | Oracle Community

According to what you've pasted from Oracle logs, your clock was 2 hours ahead and then it was changed somehow. Or at least Oracle thought, that it was changed 2 hours back. Such situation, even if the clock would be 1 second back, leads to server reboot by Oracle Clusterware. Even on AIX.
This User Gave Thanks to agent.kgb For This Post:
# 13  
Old 05-30-2016
Thank You agent.kgb,
very interesting post, I checked the link which you have posted

Quote:
if you set the clock backwards while clusterware is running, the monitoring processes of clusterware processes fail, since they "think" they detected a scheduling problem of the root processes.
This will lead to a node reboot.

Hence stop clusterware before changing the clock.
It appears, in my case, that someone played with the time, intentionally, otherwise how would this happen? RAC has been running for 3 years and it never happened, suddenly it happens and that too, 2 hours difference.

Now, I will have to find out if time was changed on the AIX by user.
# 14  
Old 05-30-2016
Quote:
Originally Posted by filosophizer
It appears, in my case, that someone played with the time, intentionally, otherwise how would this happen?
I suppose the right term for it is unintentionally - duh! Well, joking aside, it might well be that someone not aware of that (actually i wouldn't have known that either, kudos to agent.kgb!) has changed the system time and caused the reboot.

Usually, if you use NTP, the time adjustment is set to "slew", which means that if time is off it is slowly adjusted, either by stretching the ticking of the clock out or by compressing it, so that the right time is eventually reached but now "jumps" occur. If your NTP setting is different, it might have adjusted an off time immediately and this way caused the reboot. Look at /etc/ntp.conf to see what your NTP client is set to.

Here is some additional information about configuring NTP.

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Red Hat

Server rebooted.

Hi, Yesterday one of Red Hat Server 4.2 got rebooted. I have checked /var/log/messages, but does not find out any serious issue related to peformance / hardware issue. how to find out why server was rebooted? (1 Reply)
Discussion started by: manoj.solaris
1 Replies

2. Linux

Rebooted suddenly

Hi Team, server rebooted happen sunddely, i check all the log files but ..i didn't find any reason...kindly share your's ideas with me... Thanks in advace Rajesh_Apple...:b: (1 Reply)
Discussion started by: Rajesh_Apple
1 Replies

3. Solaris

Checking Who rebooted a Host

Hi, Not sure if this was asked in a tread already(searched but did not find anything :( ) I want to know who rebooted a system without reading through allot of /var/adm/messages I know the command last will show me when the system was rebooted *user* *Login Protocol* *IP address* ... (2 Replies)
Discussion started by: Amr1ta
2 Replies

4. AIX

server rebooted

Hi, I want to know how to find out which user has rebooted the server? I have used last command but it is not giving username though it is showing below output reboot --------------- date Regards, Manoj (5 Replies)
Discussion started by: manoj.solaris
5 Replies

5. HP-UX

How can we know that the server was rebooted by which user in hp unix

Hi , Plz some one can help me ... How can we know that the server was rebooted by which user in hp unix and linux. Regards Venkata Jeevan (1 Reply)
Discussion started by: jeevanbv
1 Replies

6. Solaris

rebooted alone

dear all Iam unix administrator and yesterday the server rebooted alone and when i check the messages i find the below errors can you help me (3 Replies)
Discussion started by: murad.jaber
3 Replies

7. Solaris

server rebooted by user

Hi, how can i know who has rebooted the server? even last command is not displaying the user, wheather any way to track the user. (2 Replies)
Discussion started by: manoj.solaris
2 Replies

8. Shell Programming and Scripting

how can I know when system last rebooted?

hi anyone one here for helping me? plzzzzzzzzzzzzz I would like to know how Licensing information such as the operating system revision level and license restrictions in terms of user numbers can be seen? and When was the system last rebooted can also be seen? (3 Replies)
Discussion started by: nokia1100
3 Replies

9. UNIX for Dummies Questions & Answers

How to identify who rebooted the linux server

Hi All, Since server is located at remote place so how to identify which user rebooted the server. Is there any way to identify the user. Thanks in advance, Reg, Bache Gowda (1 Reply)
Discussion started by: bache_gowda
1 Replies

10. UNIX for Dummies Questions & Answers

Sun Machine Rebooted

Hi , My Sun Machine Rebooted by itself 2 days back . Its running fine now , But i wanted to find out wat caused it to reboot... This is wat the var/adm/messages show . Kern.notice:- System booting after fatal error FATAL... Wat causes this message ... And wat tasks should i do to ensure it... (6 Replies)
Discussion started by: DPAI
6 Replies
Login or Register to Ask a Question