|
|||||||
| Forums | Search Forums | Register | Forum Rules | Man Pages | Albums | FAQ | Members | Calendar | Search | Today's Posts | Mark Forums Read |
| AIX AIX is IBM's industry-leading UNIX operating system that meets the demands of applications that businesses rely upon in today's marketplace. |
|
|
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
||||
|
||||
|
Vmstat fault section all values are 0
Hi all, Recently I facing problem with my AIX server. we experience slowness on performance. there are some application installed in this server such as : Oracle 10g database, control-m client agent, and some monitoring tools. when we're facing the problem we're noticing that vmstat value a bit strange, below are the output : Code:
$ vmstat 5 System configuration: lcpu=12 mem=53248MB kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------- r b avm fre re pi po fr sr cy in sy cs us sy id wa 1 0 5370602 22908 0 0 0 0 0 0 0 0 0 0 0 99 0 1 0 5370600 22910 0 0 0 0 0 0 0 0 0 3 0 97 0 1 0 5370594 22916 0 0 0 0 0 0 0 0 0 1 0 99 0 1 0 5370598 22911 0 0 0 0 0 0 0 0 0 0 0 99 0 1 0 5370596 22912 0 0 0 0 0 0 0 0 0 3 0 97 0 0 0 5370596 22912 0 0 0 0 0 0 0 0 0 0 0 99 0 0 0 5370593 22915 0 0 0 0 0 0 0 0 0 0 0 99 0 0 1 5370582 22926 0 0 0 0 0 0 0 0 0 3 1 95 1 0 0 5370589 22919 0 0 0 0 0 0 0 0 0 1 0 98 0 1 0 5370587 22921 0 0 0 0 0 0 0 0 0 2 0 98 0 1 0 5370586 22922 0 0 0 0 0 0 0 0 0 3 0 97 0 0 1 5370578 22298 0 0 0 0 0 0 0 0 0 1 0 97 2 0 0 5370579 22297 0 0 0 0 0 0 0 0 0 0 0 99 0 2 0 5370576 22299 0 0 0 0 0 0 0 0 0 3 0 97 0 1 0 5370573 22288 0 0 0 0 0 0 0 0 0 0 0 99 0 1 0 5370569 22150 0 0 0 0 0 0 0 0 0 1 0 98 1 1 0 5370570 22149 0 0 0 0 0 0 0 0 0 1 0 99 0 1 0 5370568 22151 0 0 0 0 0 0 0 0 0 2 0 98 0 0 0 5370536 22183 0 0 0 0 0 0 0 0 0 0 0 99 0 all value in faults section is 0 (in sy cs) and my application starting to generate <defunct> process until reach maximum process for user. and force us to reboot the server to resolve this issue. Code:
oracle 6426974 1 5 0:00 <defunct> oracle 6431072 1 5 0:00 <defunct> oracle 6435170 1 4 0:00 <defunct> oracle 6439268 1 4 0:00 <defunct> oracle 6443366 1 4 0:00 <defunct> oracle 6447464 1 4 0:00 <defunct> oracle 6451562 1 4 0:00 <defunct> oracle 6455660 1 4 0:00 <defunct> oracle 6459758 1 4 0:00 <defunct> oracle 6463856 1 4 0:00 <defunct> oracle 6467954 1 5 0:00 <defunct> oracle 6472052 1 5 0:00 <defunct> oracle 6476150 1 4 0:00 <defunct> oracle 6480248 1 4 0:00 <defunct> oracle 6484346 1 4 0:00 <defunct> oracle 6488444 1 4 0:00 <defunct> oracle 6492542 1 4 0:00 <defunct> oracle 6496640 1 5 0:00 <defunct> my questions are : 1. is it normal for faults section in vmstat to have 0 value for all? 2. what could possibly caused this issue? 3. is there any log i could check? appreciate if anyone could help me, because i'm newbie in AIX. Thanks before.
Last edited by zaxxon; 02-15-2013 at 06:59 AM.. Reason: code tags, see PM |
| Sponsored Links | ||
|
|
#2
|
|||
|
|||
|
The many zeros in vnstat are okay.
I assume your second sample is from ps command. <defunct> with PPID=1 is bad; it looks like a fault in the kernel. Watch out for a kernel patch! I am Unix expert not AIX expert. I wonder the 2nd column has PID >99999 - this is quite high. Maybe too high? |
| Sponsored Links | ||
|
|
#3
|
||||
|
||||
|
High PIDs on up-to-date AIX systems are ok. I remember it was a problem (or could have been) on AIX 4.3.3? 5.2?, but it's too long ago to say for sure, sorry.
I have not yet seen so many zeros in vmstat output. Must not mean much but looks strange for me. The <defunct> processes are really not a good sign. They are zombies, failures of the program. You say that you have so many processes so that the maximum number of processes per user is hit. How do you know this? Do you have your box tuned with the recommendations for Oracle like setting at least maxuproc=4096. If it is on the default value, it will be too low most probably. That could be related to the Oracle error messages you get. Here is a discussion about it: https://forums.oracle.com/forums/thr...sageID=3445541 But you also find it in setup/tuning recommendations for Oracle on AIX. Beside all that you should also have a look at the entries in the Error Report of AIX (errpt). Last edited by zaxxon; 02-15-2013 at 08:06 AM.. Reason: phrasing |
|
#4
|
|||
|
|||
|
IMHO, this is not normal behavior. My first guess would be that a program has been restored, or a patch applied, and the libC and/or other shared library is not correct. If I was on site and could look at other things I would recommend many other things - but for now, to remove many many variables in a short amount of time - AND to know if it is spurious or continous I would look at performing a reboot. BUT!!! The other common cause of issues with libraries going bad, because they are cached in memory is either a disk gone bad (e.g. rootvg) so programs "run" but are in accurate because they cannot get/write to disk (e.g., a partition can run for hours even though it's rootvg is missing (VIOS is offline by accident) - or - that someone has done "rm -rf /..." by accident. So files are removed, but still open (shared libraries) so programs can still run "some". Program to check: errpt Code:
errpt | head re: PID values. The long PID values imply that the 64-bit kernel is active so larger PID and TID values are normal Code:
errpt -a | more Code:
errpt -c If you think the system will survive a reboot, and you can get a window to perform it - it is a serious option. But be careful - if your disk is bad and you cannot (re)boot you must decide beforehand what is worse: no availability or degraded integrity. ---------- Post updated at 04:25 PM ---------- Previous update was at 04:22 PM ---------- re: PID values. The 7-digit values imply that a 64-bit kernel is active. |
| Sponsored Links | |
|
|
#5
|
|||
|
|||
|
OK, let us go over your provided outputs.
Quote:
Second: if you look at the columns with the run- and blocked-queue (leftmost, "r" and "b") you see occasional 1s in the blocked-column. This is not a problem in and of itself, but one starts to wonder where it comes from. Nonzero entries in "b" mean that there is a process ready to run, which can't because of some outside factor prohibiting it. Usually this is a side effect of paging (the process waits until its memory is paged in again), but this is not the case here. Third: now we inspect the rightmost part of the output, which shows how the processor(s) is used. "us" (process spends time in user space) and "sy" (process spends time in system space) are near 0, so the system does next to nothing. But "wa" (wait) is non-zero and this corresponds to the blocked-entries. It means that a process, otherwise ready to run, is waiting for I/O. So it looks like the machine is slightly I/O-bound. This could come from: - disks (or SAN, whatever) pose a bottleneck - network over which data are transferred is slow - another I/O-path - serial line, whatever - is the culprit Now to the Zombie-problem: when a process ends, it sets an exit code. If you run a system command at the shell level and query the errorlevel you query in fact the exit code of the program. When a program now calls another program (a "fork") it usually does so in a way that it gets this exit code upon termination of the child process. As long as the exit code is not queried by the parent process the entry in the process table remains. Now it happens sometimes that a parent process terminates (voluntarily or involuntarily) before it can reap its children. These children processes become zombies, because nobody will ever query their exit code. The programs themselves are long gone from memory but the entry in the process table still exists and will do so sometimes until next reboot. It is difficult to remove them. If your program creates such zombies on a regular basis then this is a case of very sloppy programming. I suggest beating your programmer with the print version of the AIX Programmers Reference on the head until he understands basic UNIX programming concepts. I hope this helps. bakunin |
| Sponsored Links | |
|
|
#6
|
|||
|
|||
|
Quote:
kthr - threads
Code:
kthr memory page faults cpu ----------- --------------------- ------------------------------------ ------------------ ----------- r b p avm fre fi fo pi po fr sr in sy cs us sy id wa 1 0 0 728783 1538788 0 0 0 0 0 0 15 202 163 0 0 99 0 What I suggest you use is one of the two following commands - perhaps 5 second intervals as a start, and move up as you get more insight. Code:
# vmstat -I -w -P ALL 5 2
System configuration: mem=672MB
pgsz memory page
----- -------------------------- ------------------------------------
siz avm fre fi fo pi po fr sr
4K 109408 146232 2199 0 0 0 0 0 0
64K 3914 3855 100 0 0 0 0 0 0
4K 109408 146232 2199 0 0 0 0 0 0
64K 3914 3855 100 0 0 0 0 0 0Code:
# vmstat -I -w -p ALL 5 2
System configuration: lcpu=4 mem=672MB ent=0.20
kthr memory page faults cpu
----------- --------------------- ------------------------------------ ------------------ -----------------------
r b p avm fre fi fo pi po fr sr in sy cs us sy id wa pc ec
0 0 0 207914 3782 22832 3704 3189 15404 31644 119281 2 84 188 1 1 98 0 0.01 2.5
psz avm fre fi fo pi po fr sr siz
4K 146234 2182 0 0 0 0 0 0 109408
64K 3855 100 0 0 0 0 0 0 3914
kthr memory page faults cpu
----------- --------------------- ------------------------------------ ------------------ -----------------------
r b p avm fre fi fo pi po fr sr in sy cs us sy id wa pc ec
0 0 0 207914 3782 22742 3690 3176 15343 31518 118806 3 43 193 0 1 99 0 0.00 1.7
psz avm fre fi fo pi po fr sr siz
4K 146234 2182 0 0 0 0 0 0 109408
64K 3855 100 0 0 0 0 0 0 3914Note the argument -w for wide, and -I for file activity Last edited by MichaelFelt; 02-18-2013 at 10:29 AM.. |
| Sponsored Links | |
|
|
#7
|
|||
|
|||
|
As I recall, vmstat needs some help to see all your disks and such. The default set may be a subset.
Zombies are more specifically caused when the parent is not honoring SIGCHLD, so the notification at the bitter end of child life cannot be passed. The rcp/rsh family was famous for this. I guess paranoid programmers block signals rather than accept one of the default handlers. Interactive shells can have a sort of zombie when background processes stop for terminal i/o or termination notification. Check out the PPID, any shared tty processes of the zombies to see if there is a pattern to them. They take up a process slot but do not have a lot of overhead, so do not get OCD about them when you have bigger fish to fry to fix your slow system. ![]() I have seen systems crawl for desperate lack of swap space, but with all those zeros, swap seems out fo the picture. Check, though! Is this Oracle slowness or shell ? Last edited by DGPickett; 02-20-2013 at 12:40 PM.. |
| Sponsored Links | ||
|
![]() |
| Tags |
| aix 5.3, defunct processes, faults, vmstat |
| Thread Tools | Search this Thread |
| Display Modes | |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Using gdb, ignore beginning segmentation fault until reproduce environment segmentation fault | pooyair | Programming | 6 | 10-13-2012 04:17 AM |
| Prepend first line of section to each line until the next section header | pagrus | Shell Programming and Scripting | 7 | 10-24-2011 08:56 PM |
| fr and sr (from vmstat output) values are very high | Beginer0705 | AIX | 7 | 02-05-2011 11:21 PM |
| Extract section of file based on word in section | jelloir | Shell Programming and Scripting | 2 | 09-20-2010 02:16 AM |
| reset values for vmstat | kuczerp | UNIX for Advanced & Expert Users | 3 | 06-27-2003 04:55 PM |
|
|