Vmstat fault section all values are 0 | Unix Linux Forums | AIX

  Go Back    


AIX AIX is IBM's industry-leading UNIX operating system that meets the demands of applications that businesses rely upon in today's marketplace.

Vmstat fault section all values are 0

AIX


Tags
aix 5.3, defunct processes, faults, vmstat

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 02-15-2013
Arief Winanto Arief Winanto is offline
Registered User
 
Join Date: Feb 2013
Last Activity: 19 February 2013, 11:25 PM EST
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
Vmstat fault section all values are 0

Hi all,

Recently I facing problem with my AIX server. we experience slowness on performance. there are some application installed in this server such as : Oracle 10g database, control-m client agent, and some monitoring tools.

when we're facing the problem we're noticing that vmstat value a bit strange,
below are the output :


Code:
$ vmstat 5

System configuration: lcpu=12 mem=53248MB

kthr    memory              page              faults        cpu
----- ----------- ------------------------ ------------ -----------
r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
1  0 5370602 22908   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370600 22910   0   0   0   0    0   0   0    0   0  3  0 97  0
1  0 5370594 22916   0   0   0   0    0   0   0    0   0  1  0 99  0
1  0 5370598 22911   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370596 22912   0   0   0   0    0   0   0    0   0  3  0 97  0
0  0 5370596 22912   0   0   0   0    0   0   0    0   0  0  0 99  0
0  0 5370593 22915   0   0   0   0    0   0   0    0   0  0  0 99  0
0  1 5370582 22926   0   0   0   0    0   0   0    0   0  3  1 95  1
0  0 5370589 22919   0   0   0   0    0   0   0    0   0  1  0 98  0
1  0 5370587 22921   0   0   0   0    0   0   0    0   0  2  0 98  0
1  0 5370586 22922   0   0   0   0    0   0   0    0   0  3  0 97  0
0  1 5370578 22298   0   0   0   0    0   0   0    0   0  1  0 97  2
0  0 5370579 22297   0   0   0   0    0   0   0    0   0  0  0 99  0
2  0 5370576 22299   0   0   0   0    0   0   0    0   0  3  0 97  0
1  0 5370573 22288   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370569 22150   0   0   0   0    0   0   0    0   0  1  0 98  1
1  0 5370570 22149   0   0   0   0    0   0   0    0   0  1  0 99  0
1  0 5370568 22151   0   0   0   0    0   0   0    0   0  2  0 98  0
0  0 5370536 22183   0   0   0   0    0   0   0    0   0  0  0 99  0

all value in faults section is 0 (in sy cs)

and my application starting to generate <defunct> process until reach maximum process for user. and force us to reboot the server to resolve this issue.


Code:
oracle 6426974       1   5                  0:00 <defunct>
  oracle 6431072       1   5                  0:00 <defunct>
  oracle 6435170       1   4                  0:00 <defunct>
  oracle 6439268       1   4                  0:00 <defunct>
  oracle 6443366       1   4                  0:00 <defunct>
  oracle 6447464       1   4                  0:00 <defunct>
  oracle 6451562       1   4                  0:00 <defunct>
  oracle 6455660       1   4                  0:00 <defunct>
  oracle 6459758       1   4                  0:00 <defunct>
  oracle 6463856       1   4                  0:00 <defunct>
  oracle 6467954       1   5                  0:00 <defunct>
  oracle 6472052       1   5                  0:00 <defunct>
  oracle 6476150       1   4                  0:00 <defunct>
  oracle 6480248       1   4                  0:00 <defunct>
  oracle 6484346       1   4                  0:00 <defunct>
  oracle 6488444       1   4                  0:00 <defunct>
  oracle 6492542       1   4                  0:00 <defunct>
  oracle 6496640       1   5                  0:00 <defunct>

my questions are :
1. is it normal for faults section in vmstat to have 0 value for all?
2. what could possibly caused this issue?
3. is there any log i could check?

appreciate if anyone could help me, because i'm newbie in AIX.
Thanks before.

Moderator's Comments:
Use code tags, thanks.

Last edited by zaxxon; 02-15-2013 at 06:59 AM.. Reason: code tags, see PM
Sponsored Links
    #2  
Old 02-15-2013
MadeInGermany MadeInGermany is offline Forum Advisor  
Registered User
 
Join Date: May 2012
Last Activity: 29 September 2014, 1:02 PM EDT
Location: Simplicity
Posts: 1,924
Thanks: 123
Thanked 561 Times in 509 Posts
The many zeros in vnstat are okay.
I assume your second sample is from ps command.
<defunct> with PPID=1 is bad; it looks like a fault in the kernel.
Watch out for a kernel patch!
I am Unix expert not AIX expert. I wonder the 2nd column has PID >99999 - this is quite high. Maybe too high?
Sponsored Links
    #3  
Old 02-15-2013
zaxxon's Avatar
zaxxon zaxxon is offline Forum Staff  
code tag tagger
 
Join Date: Sep 2007
Last Activity: 30 September 2014, 1:08 AM EDT
Location: St. Gallen, Switzerland
Posts: 6,233
Thanks: 121
Thanked 453 Times in 412 Posts
High PIDs on up-to-date AIX systems are ok. I remember it was a problem (or could have been) on AIX 4.3.3? 5.2?, but it's too long ago to say for sure, sorry.

I have not yet seen so many zeros in vmstat output. Must not mean much but looks strange for me.

The <defunct> processes are really not a good sign. They are zombies, failures of the program.
You say that you have so many processes so that the maximum number of processes per user is hit. How do you know this?
Do you have your box tuned with the recommendations for Oracle like setting at least maxuproc=4096. If it is on the default value, it will be too low most probably. That could be related to the Oracle error messages you get.
Here is a discussion about it:
https://forums.oracle.com/forums/thr...sageID=3445541

But you also find it in setup/tuning recommendations for Oracle on AIX.

Beside all that you should also have a look at the entries in the Error Report of AIX (errpt).

Last edited by zaxxon; 02-15-2013 at 08:06 AM.. Reason: phrasing
    #4  
Old 02-15-2013
MichaelFelt MichaelFelt is offline
Registered User
 
Join Date: Nov 2012
Last Activity: 11 December 2013, 7:33 AM EST
Location: on the road for work; home is private time
Posts: 311
Thanks: 6
Thanked 76 Times in 71 Posts
IMHO, this is not normal behavior. My first guess would be that a program has been restored, or a patch applied, and the libC and/or other shared library is not correct.

If I was on site and could look at other things I would recommend many other things - but for now, to remove many many variables in a short amount of time - AND to know if it is spurious or continous I would look at performing a reboot.

BUT!!! The other common cause of issues with libraries going bad, because they are cached in memory is either a disk gone bad (e.g. rootvg) so programs "run" but are in accurate because they cannot get/write to disk (e.g., a partition can run for hours even though it's rootvg is missing (VIOS is offline by accident) - or - that someone has done "rm -rf /..." by accident. So files are removed, but still open (shared libraries) so programs can still run "some".

Program to check: errpt


Code:
errpt | head

re: PID values. The long PID values imply that the 64-bit kernel is active so larger PID and TID values are normal


Code:
errpt -a | more


Code:
 
errpt -c

If you think the system will survive a reboot, and you can get a window to perform it - it is a serious option. But be careful - if your disk is bad and you cannot (re)boot you must decide beforehand what is worse: no availability or degraded integrity.

---------- Post updated at 04:25 PM ---------- Previous update was at 04:22 PM ----------

re: PID values. The 7-digit values imply that a 64-bit kernel is active.
Sponsored Links
    #5  
Old 02-16-2013
bakunin bakunin is offline Forum Staff  
Bughunter Extraordinaire
 
Join Date: May 2005
Last Activity: 30 September 2014, 7:07 AM EDT
Location: In the leftmost byte of /dev/kmem
Posts: 4,265
Thanks: 45
Thanked 820 Times in 647 Posts
OK, let us go over your provided outputs.

Quote:
Originally Posted by Arief Winanto View Post

Code:
$ vmstat 5

System configuration: lcpu=12 mem=53248MB

kthr    memory              page              faults        cpu
----- ----------- ------------------------ ------------ -----------
r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
1  0 5370602 22908   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370600 22910   0   0   0   0    0   0   0    0   0  3  0 97  0
1  0 5370594 22916   0   0   0   0    0   0   0    0   0  1  0 99  0
1  0 5370598 22911   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370596 22912   0   0   0   0    0   0   0    0   0  3  0 97  0
0  0 5370596 22912   0   0   0   0    0   0   0    0   0  0  0 99  0
0  0 5370593 22915   0   0   0   0    0   0   0    0   0  0  0 99  0
0  1 5370582 22926   0   0   0   0    0   0   0    0   0  3  1 95  1
0  0 5370589 22919   0   0   0   0    0   0   0    0   0  1  0 98  0
1  0 5370587 22921   0   0   0   0    0   0   0    0   0  2  0 98  0
1  0 5370586 22922   0   0   0   0    0   0   0    0   0  3  0 97  0
0  1 5370578 22298   0   0   0   0    0   0   0    0   0  1  0 97  2
0  0 5370579 22297   0   0   0   0    0   0   0    0   0  0  0 99  0
2  0 5370576 22299   0   0   0   0    0   0   0    0   0  3  0 97  0
1  0 5370573 22288   0   0   0   0    0   0   0    0   0  0  0 99  0
1  0 5370569 22150   0   0   0   0    0   0   0    0   0  1  0 98  1
1  0 5370570 22149   0   0   0   0    0   0   0    0   0  1  0 99  0
1  0 5370568 22151   0   0   0   0    0   0   0    0   0  2  0 98  0
0  0 5370536 22183   0   0   0   0    0   0   0    0   0  0  0 99  0

First: all the paging-related columns (re, pi, po, fr, sr, cy, in, sy) being 0 means that the machine has very much memory compared o what it needs. Its kernel doesn't even bother to look for pages it could steal, so the machine really must have plenty. Not even the file-cache seems to reach its saturation. Post the output of svmon -G and we could perhaps show you how much the machine needs and hw much it really has in comparison.

Second: if you look at the columns with the run- and blocked-queue (leftmost, "r" and "b") you see occasional 1s in the blocked-column. This is not a problem in and of itself, but one starts to wonder where it comes from. Nonzero entries in "b" mean that there is a process ready to run, which can't because of some outside factor prohibiting it. Usually this is a side effect of paging (the process waits until its memory is paged in again), but this is not the case here.

Third: now we inspect the rightmost part of the output, which shows how the processor(s) is used. "us" (process spends time in user space) and "sy" (process spends time in system space) are near 0, so the system does next to nothing. But "wa" (wait) is non-zero and this corresponds to the blocked-entries. It means that a process, otherwise ready to run, is waiting for I/O. So it looks like the machine is slightly I/O-bound. This could come from:

- disks (or SAN, whatever) pose a bottleneck
- network over which data are transferred is slow
- another I/O-path - serial line, whatever - is the culprit

Now to the Zombie-problem: when a process ends, it sets an exit code. If you run a system command at the shell level and query the errorlevel you query in fact the exit code of the program. When a program now calls another program (a "fork") it usually does so in a way that it gets this exit code upon termination of the child process. As long as the exit code is not queried by the parent process the entry in the process table remains.

Now it happens sometimes that a parent process terminates (voluntarily or involuntarily) before it can reap its children. These children processes become zombies, because nobody will ever query their exit code. The programs themselves are long gone from memory but the entry in the process table still exists and will do so sometimes until next reboot. It is difficult to remove them.

If your program creates such zombies on a regular basis then this is a case of very sloppy programming. I suggest beating your programmer with the print version of the AIX Programmers Reference on the head until he understands basic UNIX programming concepts.

I hope this helps.

bakunin
Sponsored Links
    #6  
Old 02-18-2013
MichaelFelt MichaelFelt is offline
Registered User
 
Join Date: Nov 2012
Last Activity: 11 December 2013, 7:33 AM EST
Location: on the road for work; home is private time
Posts: 311
Thanks: 6
Thanked 76 Times in 71 Posts
Quote:
the paging-related columns (re, pi, po, fr, sr, cy, in, sy)
Actually, vmstat data is in 5 sections:
kthr - threads
  • r - running
  • b - blocked (by something).
memory
  • avm - addressable virtual memory
  • free - free frames in system memory
page
  • fi/fo (pages in/out of file system space - to/from file memory)
  • pi/po (pages in/out of paging space - to/from working memory)
  • fr/sr: frames freed/scanned (searched)
  • cy (not in -w output) clock cycles used by page strealer
faults
  • in - hardware interuptts
  • sy - system calls
  • cs - context switches
cpu
  • us/sy - user/system time BUSY
  • id/wa - IDLE nothing to do / waiting for io to finish before switch to busy



Code:
   kthr            memory                         page                       faults           cpu    
----------- --------------------- ------------------------------------ ------------------ -----------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa
  1   0   0     728783    1538788     0     0     0     0     0      0    15    202   163  0  0 99  0

What I suggest you use is one of the two following commands - perhaps 5 second intervals as a start, and move up as you get more insight.

Code:
# vmstat -I -w -P ALL 5 2

System configuration: mem=672MB

pgsz            memory                           page                 
----- -------------------------- ------------------------------------ 
           siz      avm      fre    fi    fo    pi    po    fr     sr 
   4K   109408   146232     2199     0     0     0     0     0      0 
  64K     3914     3855      100     0     0     0     0     0      0 

   4K   109408   146232     2199     0     0     0     0     0      0 
  64K     3914     3855      100     0     0     0     0     0      0


Code:
# vmstat -I -w -p ALL 5 2

System configuration: lcpu=4 mem=672MB ent=0.20

   kthr            memory                         page                       faults                 cpu          
----------- --------------------- ------------------------------------ ------------------ -----------------------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa    pc    ec
  0   0   0     207914       3782 22832  3704  3189 15404 31644 119281     2     84   188  1  1 98  0  0.01   2.5

        psz        avm        fre    fi    fo    pi    po    fr     sr     siz
         4K     146234       2182     0     0     0     0     0      0  109408 
        64K       3855        100     0     0     0     0     0      0    3914 

   kthr            memory                         page                       faults                 cpu          
----------- --------------------- ------------------------------------ ------------------ -----------------------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa    pc    ec
  0   0   0     207914       3782 22742  3690  3176 15343 31518 118806     3     43   193  0  1 99  0  0.00   1.7

        psz        avm        fre    fi    fo    pi    po    fr     sr     siz
         4K     146234       2182     0     0     0     0     0      0  109408 
        64K       3855        100     0     0     0     0     0      0    3914

Note the argument -w for wide, and -I for file activity

Last edited by MichaelFelt; 02-18-2013 at 10:29 AM..
Sponsored Links
    #7  
Old 02-18-2013
DGPickett DGPickett is offline Forum Advisor  
Registered User
 
Join Date: Oct 2010
Last Activity: 25 September 2014, 5:44 PM EDT
Location: Southern NJ, USA (Nord)
Posts: 4,422
Thanks: 8
Thanked 541 Times in 519 Posts
As I recall, vmstat needs some help to see all your disks and such. The default set may be a subset.

Zombies are more specifically caused when the parent is not honoring SIGCHLD, so the notification at the bitter end of child life cannot be passed. The rcp/rsh family was famous for this. I guess paranoid programmers block signals rather than accept one of the default handlers. Interactive shells can have a sort of zombie when background processes stop for terminal i/o or termination notification. Check out the PPID, any shared tty processes of the zombies to see if there is a pattern to them. They take up a process slot but do not have a lot of overhead, so do not get OCD about them when you have bigger fish to fry to fix your slow system.

I have seen systems crawl for desperate lack of swap space, but with all those zeros, swap seems out fo the picture. Check, though!

Is this Oracle slowness or shell ?

Last edited by DGPickett; 02-20-2013 at 12:40 PM..
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Using gdb, ignore beginning segmentation fault until reproduce environment segmentation fault pooyair Programming 6 10-13-2012 04:17 AM
Prepend first line of section to each line until the next section header pagrus Shell Programming and Scripting 7 10-24-2011 08:56 PM
fr and sr (from vmstat output) values are very high Beginer0705 AIX 7 02-05-2011 11:21 PM
Extract section of file based on word in section jelloir Shell Programming and Scripting 2 09-20-2010 02:16 AM
reset values for vmstat kuczerp UNIX for Advanced & Expert Users 3 06-27-2003 04:55 PM



All times are GMT -4. The time now is 07:49 AM.