Vmstat fault section all values are 0


 
Thread Tools Search this Thread
Operating Systems AIX Vmstat fault section all values are 0
# 1  
Old 02-15-2013
Vmstat fault section all values are 0

Hi all,

Recently I facing problem with my AIX server. we experience slowness on performance. there are some application installed in this server such as : Oracle 10g database, control-m client agent, and some monitoring tools.

when we're facing the problem we're noticing that vmstat value a bit strange,
below are the output :

Code:
$ vmstat 5

System configuration: lcpu=12 mem=53248MB

kthr    memory              page              faults        cpu
----- ----------- ------------------------ ------------ -----------
r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
1  0 5370602 22908   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370600 22910   0   0   0   0    0   0   0    0   0  3  0 97  0
1  0 5370594 22916   0   0   0   0    0   0   0    0   0  1  0 99  0
1  0 5370598 22911   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370596 22912   0   0   0   0    0   0   0    0   0  3  0 97  0
0  0 5370596 22912   0   0   0   0    0   0   0    0   0  0  0 99  0
0  0 5370593 22915   0   0   0   0    0   0   0    0   0  0  0 99  0
0  1 5370582 22926   0   0   0   0    0   0   0    0   0  3  1 95  1
0  0 5370589 22919   0   0   0   0    0   0   0    0   0  1  0 98  0
1  0 5370587 22921   0   0   0   0    0   0   0    0   0  2  0 98  0
1  0 5370586 22922   0   0   0   0    0   0   0    0   0  3  0 97  0
0  1 5370578 22298   0   0   0   0    0   0   0    0   0  1  0 97  2
0  0 5370579 22297   0   0   0   0    0   0   0    0   0  0  0 99  0
2  0 5370576 22299   0   0   0   0    0   0   0    0   0  3  0 97  0
1  0 5370573 22288   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370569 22150   0   0   0   0    0   0   0    0   0  1  0 98  1
1  0 5370570 22149   0   0   0   0    0   0   0    0   0  1  0 99  0
1  0 5370568 22151   0   0   0   0    0   0   0    0   0  2  0 98  0
0  0 5370536 22183   0   0   0   0    0   0   0    0   0  0  0 99  0

all value in faults section is 0 (in sy cs)

and my application starting to generate <defunct> process until reach maximum process for user. and force us to reboot the server to resolve this issue.

Code:
oracle 6426974       1   5                  0:00 <defunct>
  oracle 6431072       1   5                  0:00 <defunct>
  oracle 6435170       1   4                  0:00 <defunct>
  oracle 6439268       1   4                  0:00 <defunct>
  oracle 6443366       1   4                  0:00 <defunct>
  oracle 6447464       1   4                  0:00 <defunct>
  oracle 6451562       1   4                  0:00 <defunct>
  oracle 6455660       1   4                  0:00 <defunct>
  oracle 6459758       1   4                  0:00 <defunct>
  oracle 6463856       1   4                  0:00 <defunct>
  oracle 6467954       1   5                  0:00 <defunct>
  oracle 6472052       1   5                  0:00 <defunct>
  oracle 6476150       1   4                  0:00 <defunct>
  oracle 6480248       1   4                  0:00 <defunct>
  oracle 6484346       1   4                  0:00 <defunct>
  oracle 6488444       1   4                  0:00 <defunct>
  oracle 6492542       1   4                  0:00 <defunct>
  oracle 6496640       1   5                  0:00 <defunct>

my questions are :
1. is it normal for faults section in vmstat to have 0 value for all?
2. what could possibly caused this issue?
3. is there any log i could check?

appreciate if anyone could help me, because i'm newbie in AIX.
Thanks before.

Moderator's Comments:
Mod Comment Use code tags, thanks.

Last edited by zaxxon; 02-15-2013 at 07:59 AM.. Reason: code tags, see PM
# 2  
Old 02-15-2013
The many zeros in vnstat are okay.
I assume your second sample is from ps command.
<defunct> with PPID=1 is bad; it looks like a fault in the kernel.
Watch out for a kernel patch!
I am Unix expert not AIX expert. I wonder the 2nd column has PID >99999 - this is quite high. Maybe too high?
# 3  
Old 02-15-2013
High PIDs on up-to-date AIX systems are ok. I remember it was a problem (or could have been) on AIX 4.3.3? 5.2?, but it's too long ago to say for sure, sorry.

I have not yet seen so many zeros in vmstat output. Must not mean much but looks strange for me.

The <defunct> processes are really not a good sign. They are zombies, failures of the program.
You say that you have so many processes so that the maximum number of processes per user is hit. How do you know this?
Do you have your box tuned with the recommendations for Oracle like setting at least maxuproc=4096. If it is on the default value, it will be too low most probably. That could be related to the Oracle error messages you get.
Here is a discussion about it:
https://forums.oracle.com/forums/thr...sageID=3445541

But you also find it in setup/tuning recommendations for Oracle on AIX.

Beside all that you should also have a look at the entries in the Error Report of AIX (errpt).

Last edited by zaxxon; 02-15-2013 at 09:06 AM.. Reason: phrasing
# 4  
Old 02-15-2013
IMHO, this is not normal behavior. My first guess would be that a program has been restored, or a patch applied, and the libC and/or other shared library is not correct.

If I was on site and could look at other things I would recommend many other things - but for now, to remove many many variables in a short amount of time - AND to know if it is spurious or continous I would look at performing a reboot.

BUT!!! The other common cause of issues with libraries going bad, because they are cached in memory is either a disk gone bad (e.g. rootvg) so programs "run" but are in accurate because they cannot get/write to disk (e.g., a partition can run for hours even though it's rootvg is missing (VIOS is offline by accident) - or - that someone has done "rm -rf /..." by accident. So files are removed, but still open (shared libraries) so programs can still run "some".

Program to check: errpt

Code:
errpt | head

re: PID values. The long PID values imply that the 64-bit kernel is active so larger PID and TID values are normal

Code:
errpt -a | more

Code:
 
errpt -c

If you think the system will survive a reboot, and you can get a window to perform it - it is a serious option. But be careful - if your disk is bad and you cannot (re)boot you must decide beforehand what is worse: no availability or degraded integrity.

---------- Post updated at 04:25 PM ---------- Previous update was at 04:22 PM ----------

re: PID values. The 7-digit values imply that a 64-bit kernel is active.
# 5  
Old 02-16-2013
OK, let us go over your provided outputs.

Quote:
Originally Posted by Arief Winanto
Code:
$ vmstat 5

System configuration: lcpu=12 mem=53248MB

kthr    memory              page              faults        cpu
----- ----------- ------------------------ ------------ -----------
r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
1  0 5370602 22908   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370600 22910   0   0   0   0    0   0   0    0   0  3  0 97  0
1  0 5370594 22916   0   0   0   0    0   0   0    0   0  1  0 99  0
1  0 5370598 22911   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370596 22912   0   0   0   0    0   0   0    0   0  3  0 97  0
0  0 5370596 22912   0   0   0   0    0   0   0    0   0  0  0 99  0
0  0 5370593 22915   0   0   0   0    0   0   0    0   0  0  0 99  0
0  1 5370582 22926   0   0   0   0    0   0   0    0   0  3  1 95  1
0  0 5370589 22919   0   0   0   0    0   0   0    0   0  1  0 98  0
1  0 5370587 22921   0   0   0   0    0   0   0    0   0  2  0 98  0
1  0 5370586 22922   0   0   0   0    0   0   0    0   0  3  0 97  0
0  1 5370578 22298   0   0   0   0    0   0   0    0   0  1  0 97  2
0  0 5370579 22297   0   0   0   0    0   0   0    0   0  0  0 99  0
2  0 5370576 22299   0   0   0   0    0   0   0    0   0  3  0 97  0
1  0 5370573 22288   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370569 22150   0   0   0   0    0   0   0    0   0  1  0 98  1
1  0 5370570 22149   0   0   0   0    0   0   0    0   0  1  0 99  0
1  0 5370568 22151   0   0   0   0    0   0   0    0   0  2  0 98  0
0  0 5370536 22183   0   0   0   0    0   0   0    0   0  0  0 99  0

First: all the paging-related columns (re, pi, po, fr, sr, cy, in, sy) being 0 means that the machine has very much memory compared o what it needs. Its kernel doesn't even bother to look for pages it could steal, so the machine really must have plenty. Not even the file-cache seems to reach its saturation. Post the output of svmon -G and we could perhaps show you how much the machine needs and hw much it really has in comparison.

Second: if you look at the columns with the run- and blocked-queue (leftmost, "r" and "b") you see occasional 1s in the blocked-column. This is not a problem in and of itself, but one starts to wonder where it comes from. Nonzero entries in "b" mean that there is a process ready to run, which can't because of some outside factor prohibiting it. Usually this is a side effect of paging (the process waits until its memory is paged in again), but this is not the case here.

Third: now we inspect the rightmost part of the output, which shows how the processor(s) is used. "us" (process spends time in user space) and "sy" (process spends time in system space) are near 0, so the system does next to nothing. But "wa" (wait) is non-zero and this corresponds to the blocked-entries. It means that a process, otherwise ready to run, is waiting for I/O. So it looks like the machine is slightly I/O-bound. This could come from:

- disks (or SAN, whatever) pose a bottleneck
- network over which data are transferred is slow
- another I/O-path - serial line, whatever - is the culprit

Now to the Zombie-problem: when a process ends, it sets an exit code. If you run a system command at the shell level and query the errorlevel you query in fact the exit code of the program. When a program now calls another program (a "fork") it usually does so in a way that it gets this exit code upon termination of the child process. As long as the exit code is not queried by the parent process the entry in the process table remains.

Now it happens sometimes that a parent process terminates (voluntarily or involuntarily) before it can reap its children. These children processes become zombies, because nobody will ever query their exit code. The programs themselves are long gone from memory but the entry in the process table still exists and will do so sometimes until next reboot. It is difficult to remove them.

If your program creates such zombies on a regular basis then this is a case of very sloppy programming. I suggest beating your programmer with the print version of the AIX Programmers Reference on the head until he understands basic UNIX programming concepts.

I hope this helps.

bakunin
# 6  
Old 02-18-2013
Quote:
the paging-related columns (re, pi, po, fr, sr, cy, in, sy)
Actually, vmstat data is in 5 sections:
kthr - threads
  • r - running
  • b - blocked (by something).
memory
  • avm - addressable virtual memory
  • free - free frames in system memory
page
  • fi/fo (pages in/out of file system space - to/from file memory)
  • pi/po (pages in/out of paging space - to/from working memory)
  • fr/sr: frames freed/scanned (searched)
  • cy (not in -w output) clock cycles used by page strealer
faults
  • in - hardware interuptts
  • sy - system calls
  • cs - context switches
cpu
  • us/sy - user/system time BUSY
  • id/wa - IDLE nothing to do/waiting for io to finish before switch to busy


Code:
   kthr            memory                         page                       faults           cpu    
----------- --------------------- ------------------------------------ ------------------ -----------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa
  1   0   0     728783    1538788     0     0     0     0     0      0    15    202   163  0  0 99  0

What I suggest you use is one of the two following commands - perhaps 5 second intervals as a start, and move up as you get more insight.
Code:
# vmstat -I -w -P ALL 5 2

System configuration: mem=672MB

pgsz            memory                           page                 
----- -------------------------- ------------------------------------ 
           siz      avm      fre    fi    fo    pi    po    fr     sr 
   4K   109408   146232     2199     0     0     0     0     0      0 
  64K     3914     3855      100     0     0     0     0     0      0 

   4K   109408   146232     2199     0     0     0     0     0      0 
  64K     3914     3855      100     0     0     0     0     0      0

Code:
# vmstat -I -w -p ALL 5 2

System configuration: lcpu=4 mem=672MB ent=0.20

   kthr            memory                         page                       faults                 cpu          
----------- --------------------- ------------------------------------ ------------------ -----------------------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa    pc    ec
  0   0   0     207914       3782 22832  3704  3189 15404 31644 119281     2     84   188  1  1 98  0  0.01   2.5

        psz        avm        fre    fi    fo    pi    po    fr     sr     siz
         4K     146234       2182     0     0     0     0     0      0  109408 
        64K       3855        100     0     0     0     0     0      0    3914 

   kthr            memory                         page                       faults                 cpu          
----------- --------------------- ------------------------------------ ------------------ -----------------------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa    pc    ec
  0   0   0     207914       3782 22742  3690  3176 15343 31518 118806     3     43   193  0  1 99  0  0.00   1.7

        psz        avm        fre    fi    fo    pi    po    fr     sr     siz
         4K     146234       2182     0     0     0     0     0      0  109408 
        64K       3855        100     0     0     0     0     0      0    3914

Note the argument -w for wide, and -I for file activity

Last edited by MichaelFelt; 02-18-2013 at 11:29 AM..
# 7  
Old 02-18-2013
As I recall, vmstat needs some help to see all your disks and such. The default set may be a subset.

Zombies are more specifically caused when the parent is not honoring SIGCHLD, so the notification at the bitter end of child life cannot be passed. The rcp/rsh family was famous for this. I guess paranoid programmers block signals rather than accept one of the default handlers. Interactive shells can have a sort of zombie when background processes stop for terminal i/o or termination notification. Check out the PPID, any shared tty processes of the zombies to see if there is a pattern to them. They take up a process slot but do not have a lot of overhead, so do not get OCD about them when you have bigger fish to fry to fix your slow system. Smilie

I have seen systems crawl for desperate lack of swap space, but with all those zeros, swap seems out fo the picture. Check, though!

Is this Oracle slowness or shell ?

Last edited by DGPickett; 02-20-2013 at 01:40 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

C. To segmentation fault or not to segmentation fault, that is the question.

Oddities with gcc, 2.95.3 for the AMIGA and 4.2.1 for MY current OSX 10.14.1... I am creating a basic calculator for the AMIGA ADE *NIX emulator in C as it does not have one. Below are two very condensed snippets of which I have added the results inside the each code section. IMPORTANT!... (11 Replies)
Discussion started by: wisecracker
11 Replies

2. Linux

Vmstat

I m checking idle time using vmstat, below are the results var=$(ssh wmtmgr@$hostname vmstat | tail -1 | awk '{print $15}') 89 and now im subtracting 89 with 100 & im getting expected results expr 100 - $var 11 Now How can I get the result 11 in one line code? (4 Replies)
Discussion started by: sam@sam
4 Replies

3. Programming

Using gdb, ignore beginning segmentation fault until reproduce environment segmentation fault

I use a binary name (ie polo) it gets some parameter , so for debugging normally i do this : i wrote script for watchdog my app (polo) and check every second if it's not running then start it , the problem is , if my app , remain in state of segmentation fault for a while (ie 15 ... (6 Replies)
Discussion started by: pooyair
6 Replies

4. Shell Programming and Scripting

Prepend first line of section to each line until the next section header

I have searched in a variety of ways in a variety of places but have come up empty. I would like to prepend a portion of a section header to each following line until the next section header. I have been using sed for most things up until now but I'd go for a solution in just about anything--... (7 Replies)
Discussion started by: pagrus
7 Replies

5. AIX

fr and sr (from vmstat output) values are very high

Hi AIX Expert, the fr (page freed/page replacement) and sr (pages scanned by page-replacement algorithm) values from the vmstat output (see below please) are very high. I usually see this high value during the oracle database backup. In addition, the page scan/page steal/ page faults values... (7 Replies)
Discussion started by: Beginer0705
7 Replies

6. Shell Programming and Scripting

Extract section of file based on word in section

I have a list of Servers in no particular order as follows: virtualMachines="IIBSBS IIBVICDMS01 IIBVICMA01"And I am generating some output from a pre-existing script that gives me the following (this is a sample output selection). 9/17/2010 8:00:05 PM: Normal backup using VDRBACKUPS... (2 Replies)
Discussion started by: jelloir
2 Replies

7. Linux

vmstat help

Hi everyone, I need to see some VM manager performance/behavior information on some Linux boxes regarding pages scanned/activation of the paging algorithm in order to get an idea if a given server needs more memory and is actually paging. In Aix servers, by using the vmstat cmd you... (1 Reply)
Discussion started by: jcpetela
1 Replies

8. UNIX for Dummies Questions & Answers

vmstat

Hi I wanted to collect data by using vmstat -I 60 >xxxx.txt & using my own account It was stopped by it self after 2 hours try again same result We want to collect day date by succession how to collect data using vmstat for day Thank you (2 Replies)
Discussion started by: Syed_45
2 Replies

9. UNIX for Dummies Questions & Answers

vmstat

When I exeute vmstat (e.g. vmstat 30 2), in some machines I get some wierd result as the first line. like: -117% or 208% for CPU idle percentage. But the second line is alright. Could someone explain this please. Thanks ! Chaadana (4 Replies)
Discussion started by: chaandana
4 Replies

10. UNIX for Advanced & Expert Users

reset values for vmstat

How do you reset the values that vmstat displays? Vmstat displays a running average from the last the system was restarted on the first line, how do you reset these values without restarting the system? (Solaris 8) (3 Replies)
Discussion started by: kuczerp
3 Replies
Login or Register to Ask a Question