System rebooted itself but errpt did not catch


 
Thread Tools Search this Thread
Operating Systems AIX System rebooted itself but errpt did not catch
# 1  
Old 05-24-2016
System rebooted itself but errpt did not catch

Hi,

like to know how I can find out the reason behind a system rebooted.

If the system rebooted by itself --or-- if it was rebooted by a user and user removed the entry from errpt and history.

this thread, we will dedicate to find out HOW/WHO

Last edited by filosophizer; 05-27-2016 at 06:41 PM..
# 2  
Old 05-24-2016
Post output of the commands:

Code:
uptime
last reboot
who -b
errpt | head -20

# 3  
Old 05-24-2016
Could a power outage not explain the lack of logging in the matter?
# 4  
Old 05-24-2016
No Scott.
The system should have enough power in reserve to log an EPOW (Emergency Power Off Warning) even in a total power outage, before it goes down.....
If the disk containing the error log file is already inaccessible then the system should have enough power in reserve to log the error to the firmware NVRAM and it will be written out to the AIX error log when power is restored.
You'll see the usual error logging turned on message at boot time and then you'll see an error with a slightly earlier time stamp written out after the error daemon is up an running.
AIX system have been designed this way for many years.

---------- Post updated at 12:10 AM ---------- Previous update was at 12:06 AM ----------

filosophizer,
If the errpt entry was removed there will be a jump in the sequence numbers....
Although AIX often drops one even on a reboot - never understood, or looked into, why.
What does the error report show?
This User Gave Thanks to dukessd For This Post:
# 5  
Old 05-27-2016
Thanks for the reply.

Here is the output of the commands

Code:
root@PRD /> uptime
  10:53PM   up 5 days,  14:15,  2 users,  load average: 4.93, 4.46, 3.76

root@PRD /> errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
E87EF1BE   0527150016 P O dumpcheck      The largest dump device is too small.
E87EF1BE   0526150016 P O dumpcheck      The largest dump device is too small.
E87EF1BE   0525150016 P O dumpcheck      The largest dump device is too small.
E87EF1BE   0524150016 P O dumpcheck      The largest dump device is too small.
E87EF1BE   0523150016 P O dumpcheck      The largest dump device is too small.
26623394   0523104416 T H fscsi2         COMMUNICATION PROTOCOL ERROR
26623394   0523104316 T H fscsi2         COMMUNICATION PROTOCOL ERROR
26623394   0523104316 T H fscsi2         COMMUNICATION PROTOCOL ERROR
B6267342   0522235516 P H hdisk3         DISK OPERATION ERROR
E87EF1BE   0522150016 P O dumpcheck      The largest dump device is too small.
A6DF45AA   0522083916 I O RMCdaemon      The daemon is started.
D221BD55   0522083816 I O perftune       RESTRICTED TUNABLES MODIFIED AT REBOOT
EC0BCCD4   0522083816 T H ent1           ETHERNET DOWN
EC0BCCD4   0522083816 T H ent2           ETHERNET DOWN
54E8A127   0522083516 T H sissas0        DEVICE OR MEDIA ERROR
B6267342   0522083516 P H hdisk3         DISK OPERATION ERROR
B6267342   0519220616 P H hdisk3         DISK OPERATION ERROR
root@PRD />

root@PRD /> last reboot
reboot    ~                                   May 22 08:38
reboot    ~                                   May 01 09:37
reboot    ~                                   Apr 21 17:01
reboot    ~                                   Apr 18 19:02
reboot    ~                                   Apr 18 16:50
reboot    ~                                   Apr 18 16:16
reboot    ~                                   Mar 16 06:25

root@PRD /> who -b
   .        system boot May 22 08:38




IBM How to Investigate a System Reboot - United States
IBM Server running AIX with Oracle RAC reboots itself - United States
The /var/adm/wtmp account file
This binary file is used to store various types of login information. One type of information stored in this file is user login records. These records document the user name and time of login. Pseudo user names are used for shutdown and reboot. So when a system is shut down using one of the shut down commands, a record with the user name shutdown will be logged into the wtmp file. Similarly when a system is booted, a record with the user name reboot will be written into the wtmp file. Some shut down commands have flags that can be used to suppress login records in the wtmp file.

Note: Technically a reboot is a warm boot but the pseudo user name reboot is written into the wtmp file for both warm boots and cold boots.

Here is last command from it:
Code:

---------------------------

root      pts/2        opmanager       May 22 08:54 - 08:54  (00:00)
root      pts/1        dbapc                  May 22 08:49 - 08:55  (00:05)
root      pts/1        opmanager       May 22 08:49 - 08:49  (00:00)
oracle    ftp          dbapc                  May 22 08:45 - 08:55  (00:10)
oracle    ftp          dbapc                  May 22 08:45 - 08:45  (00:00)
oracle    pts/0        dbapc                  May 22 08:39 - 10:11  (01:32)
reboot    ~                                   May 22 08:38 
root      pts/0        opmanager       May 20 10:00 - 10:00  (00:00)
root      pts/0        opmanager       May 20 09:54 - 09:55  (00:00)
root      pts/0        opmanager       May 20 09:49 - 09:49  (00:00)

I see that right after reboot user oracle from PC: dbapc logged in. It could be possible that Mr. DBAPC was smart enough, did a reboot and removed the entry from errpt by
Code:
 errpt -J NUMBER

Task is to find out
1- Who did the reboot, (user initiated) or abnormal shutdown (don't think so, there was power outage)

2- RAC reboot -- RAC could have done reboot but again errpt would have captured this

Last edited by filosophizer; 05-27-2016 at 06:42 PM..
# 6  
Old 05-28-2016
If your system is part of a RAC cluster, then it may be that it was the cluster's decision to evict that particular node. Evicting a node is done as quickly as possible to preserve the integrity of the database. This means: immediate shutdown of the host/node, without logging a shutdown record in /var/adm/wtmp, not calling sync to flush file buffers and not sending processes a SIGTERM.

If your node has been evicted, there should be information about this in the RAC cluster logs on the remaining nodes of the cluster...

Last edited by Scrutinizer; 05-28-2016 at 05:14 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 7  
Old 05-28-2016
Scrutunizer

checking RAC logs

Code:
root@PRD /etc/oracle/oprocd> ls -ltra

-rw-r--r--    1 root     system          304 May 22 10:05 prd.oprocd.log.2016-05-22-10:10:59
drwxrwx---    2 root     system          256 May 22 10:11 stop
-rw-r--r--    1 root     system          175 May 22 10:11 prd.oprocd.log
-rwxr--r--    1 root     system          512 May 22 10:11 prd.oprocd.lgl
drwxrwx---    2 root     system          256 May 22 10:11 fatal
drwxrwx---    2 root     system          256 May 22 10:11 check


root@PRD /etc/oracle/oprocd> cat prd.oprocd.log
May 22 10:11:00.148 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
May 22 10:11:00.194 | INF | fatal mode startup, setting process to fatal mode

root@PRD /etc/oracle/oprocd>
root@PRD /etc/oracle/oprocd>
root@PRD /etc/oracle/oprocd> cat prd.oprocd.log.2016-05-22-10:10:59
May 22 08:39:39.546 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
May 22 08:39:39.628 | INF | fatal mode startup, setting process to fatal mode
May 22 10:05:26.457 | INF | shutting down from client request
May 22 10:05:26.457 | INF | exiting current process in NORMAL mode


But, shouldn't the timestamp on the RAC logs be before the reboot, the timestamps certainly are after the reboot

if you notice AIX logs: reboot happened at 8:38 where as the RAC log is 10:11

a 2 hour gap, ?!

Am I reading it correct ?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Red Hat

Server rebooted.

Hi, Yesterday one of Red Hat Server 4.2 got rebooted. I have checked /var/log/messages, but does not find out any serious issue related to peformance / hardware issue. how to find out why server was rebooted? (1 Reply)
Discussion started by: manoj.solaris
1 Replies

2. Linux

Rebooted suddenly

Hi Team, server rebooted happen sunddely, i check all the log files but ..i didn't find any reason...kindly share your's ideas with me... Thanks in advace Rajesh_Apple...:b: (1 Reply)
Discussion started by: Rajesh_Apple
1 Replies

3. Solaris

Checking Who rebooted a Host

Hi, Not sure if this was asked in a tread already(searched but did not find anything :( ) I want to know who rebooted a system without reading through allot of /var/adm/messages I know the command last will show me when the system was rebooted *user* *Login Protocol* *IP address* ... (2 Replies)
Discussion started by: Amr1ta
2 Replies

4. AIX

server rebooted

Hi, I want to know how to find out which user has rebooted the server? I have used last command but it is not giving username though it is showing below output reboot --------------- date Regards, Manoj (5 Replies)
Discussion started by: manoj.solaris
5 Replies

5. HP-UX

How can we know that the server was rebooted by which user in hp unix

Hi , Plz some one can help me ... How can we know that the server was rebooted by which user in hp unix and linux. Regards Venkata Jeevan (1 Reply)
Discussion started by: jeevanbv
1 Replies

6. Solaris

rebooted alone

dear all Iam unix administrator and yesterday the server rebooted alone and when i check the messages i find the below errors can you help me (3 Replies)
Discussion started by: murad.jaber
3 Replies

7. Solaris

server rebooted by user

Hi, how can i know who has rebooted the server? even last command is not displaying the user, wheather any way to track the user. (2 Replies)
Discussion started by: manoj.solaris
2 Replies

8. Shell Programming and Scripting

how can I know when system last rebooted?

hi anyone one here for helping me? plzzzzzzzzzzzzz I would like to know how Licensing information such as the operating system revision level and license restrictions in terms of user numbers can be seen? and When was the system last rebooted can also be seen? (3 Replies)
Discussion started by: nokia1100
3 Replies

9. UNIX for Dummies Questions & Answers

How to identify who rebooted the linux server

Hi All, Since server is located at remote place so how to identify which user rebooted the server. Is there any way to identify the user. Thanks in advance, Reg, Bache Gowda (1 Reply)
Discussion started by: bache_gowda
1 Replies

10. UNIX for Dummies Questions & Answers

Sun Machine Rebooted

Hi , My Sun Machine Rebooted by itself 2 days back . Its running fine now , But i wanted to find out wat caused it to reboot... This is wat the var/adm/messages show . Kern.notice:- System booting after fatal error FATAL... Wat causes this message ... And wat tasks should i do to ensure it... (6 Replies)
Discussion started by: DPAI
6 Replies
Login or Register to Ask a Question