I have a 2-node Power 7 - 16cpu - 32gb RAM - AIX 6L cluster. The production node has 10 physical cpus (40 cores) and 24gb of RAM.
The cluster has been live for 6 weeks and I'm seeing some things on the production node that I think could be of concern but wanted to get some opinions.
The application running on the server is an interface engine that is very IO Write intensive. The application does not have built-in IO caching, so we are allowing UNIX to do the disk caching.
First VMSTAT, TOPAS, NMON are showing that the OS is using far more CPU than the user processes - 5 times more. Also, the harmad process has been growing in size. It has doubled from 100mb in size to over 250mb in size in less than a month and is the largest process on the server in terms of memory usage. (see output below)
Any help/suggestions/opinions are appreciated!
Troy Morton
I have no idea why harmad has grown to 250MB. On the other hand 250MB is not that big compared to the memory size of your machine. I'd watch it closely but wouldn't be too concerned for the moment.
Your layout is somewhat astonishing insofar as the system has relatively many CPUs compared to the size of the main memory. As a rule of thumb a modern processor core can efficiently work programs fitting in roughly 1-1.5GB of memory. This of course says nothing about your system at all, rules of thumb don't necessarily cover a specific case.
Looking at your vmstat output your system is near to being memory-bound. The "avm"- and "fre"-columns are in memory pages and "fre" shows roughly 160MB memory to be free - not too much, considering the overall size of the system. For further investigation issue
(only as root) and look at the results (compare with this thread), maybe there is a memory shortage on your system.
Further places to investigate would be the tuning of the system: analyse (and/or post) the output of
To find out which tuning parameters are in effect. The values of tunables are also stored in the files "/etc/tunables/[lastboot]|[nextboot]". Also a place to investigate is
which might tell you about I/O-problems (see # of "filesystem I/Os blocked with no fsbuf").
Also something to worry about is the blocked-queue (column "b"), which is non-zero. This corresponds with some (light) paging activity and some wait (rightmost column "wa"). Generally this column being non-zero means some process could run, but has to wait for some reason, usually memory to become freed.
As your "id" column is most times quite high you seem to have no CPU problems at all and your "iostat" output shows relatively low I/O bandwidth. I'd not aggregate over adapters but look at the disk statistics instead, you might want to look at the output of
instead to identify probably hostspots. On the other hand the data you presented don't suggest any I/O-problems at all.
Another point is shared memory: "ps -o vsz" will tell you only about memory which belongs to one process, but will neglect shared memory. You might want to issue a
and investigate possible shared memory pools which might consume large amounts of memory.
Well, the server is not REALLY memory bound according to the output of the top command below. My application just does so much I/O that 8gb of the 24gb are allocated to I/O Buffers by the OS. Are there parameters to limit the amount of memory used for I/O Buffers?
The "lm" processes are database lock managers. The "hciengine" processes are the interface engines running which process transactions and route/tranform/send them to their destinations. The interface engines write a copy of the transaction to a Raima database about 15 times during the process of getting them where they are going, thus the large amount of write activity large amount memory used for I/O Buffers.
Here is the output from the commands suggested. I'm not familiar with most of these statistics/settings, so hopefully someone will be nice enough to explain. :-)
To me it looks like the I/Os Blocked with no PBuf is not too high. This server has been running since May 16th. The one thing I'm not too sure about is the min/max tunables pin, perm and client. I think we left these at the default settings when AIX was installed.
The disk activity is not all that high MOST of the time. We do have some archiving that happens 4 times a day, however, even then I don't see much IO Wait going on. It does seem like Disk0 and Disk1 are the busiest. This server is attached to a SAN, so there are many virtual hdisks.
Again, I'm do not know how to read the output of this command, but here it is in case anyone can help.
Thanks for the help and suggestions!
I will need a bit of time to analyze the data since right now I'm facing some time constraints jobwise. In the meantime i will transfer the thread to the AIX board of the forum because i think there are more AIX-knowledgables there than here and the problem (if there is one) is not really HPC-related.
Slightly off topic but I am missing something like
in the output of vmo. That could be one reason why you sometimes have I/O to paging space which is not good and will be most probably be noticeable by users.
Your vmstat output:
and vmstat -v:
Show a
instead please, ty.
---------- Post updated at 04:03 PM ---------- Previous update was at 03:55 PM ----------
For the I/O on disks I guess your rootvg is on hdisk0 and hdisk1 (lspv| grep rootvg)? All the other hdisks are the san disks? If yes, the SAN disks do not seem very busy looking at the iostat. That traffic on rootvg disks should be evaded as it slows down the system. Maybe a filemon can tell more (if it is not just the paging).
I don't have AIX 6 at hand but I read that they now have a basic tuning in the default installation. So I guess that your lru_file_repage is already = 0. The other thing I guess why it is using paging space is simply that is has too few memory.
Maybe you have spare Memory on this Managed System or can use CoD (Capacity on Demand) and check it out how it works with 48 GB instead of 24 GB for some days? I think that would be worth a try.
If this would be my cluster, I would try to switch on async IO (not sure why it is off on your box as the default is on).
I would like to see your output of vmstat -s
I completely agree with Bakunin that your memory is overcommitted. Your AVM value is way too high, your freelist way too low for a DB box. IBM recommends an avm value between 70 and 80% for a reason. DBs are doing the 'work' in memory - not on disk. The smaller the free list, the more the system has to do - as the read and write from memory to disk are much higher than necessary - in an ideal world the memory holds the database - or at least the parts that are frequently accessed. That this is not the case on your system are shown by the very high system cpu usage levels.
... The sy column details the percentage of time the CPU was executing a process in system mode. ... reading or writing of a file requires kernel resources to open the file, seek a specific location, and read or write data, unless memory mapped files are used...
For me it looks like your system is artificially write intense - and that it would be much calmer if it would have the memory it would like to have for smooth operations. If it would be my box, I would add memory until your free list goes into the 6 digit area at least - you will most likely experience a significant drop in system cpu usage.
Is the output you have shown us from vmstat taken at a rather busy or rather idle time?
I cannot see enough processes in your runqueue that would even justify 20 cores - why do you have 40? And how many of them are folded all day / how many really ever used?
Kind regards
zxmaus
Hello,
I am working on applications on an AIX 6.1 two-node cluster, with an active and passive node. Is there a command that will show me which mount points / file systems are shared and 'swing' from one node to the other when the active node changes, and which mount points are truly local to... (6 Replies)
As i have updated a lot of HACMP-nodes lately the question arises how to do it with minimal downtime. Of course it is easily possible to have a downtime and do the version update during this. In the best of worlds you always get the downtime you need - unfortunately we have yet to find this best of... (4 Replies)
Hello experts -
I am planning to install a Sun cluster 4.0 zone cluster fail-over. few basic doubts.
(1) Where should i install the cluster s/w binaries ?. ( global zone or the container zone where i am planning to install the zone fail-over)
(2) Or should i perform the installation on... (0 Replies)
Yesterday my customer told me to expect a vcs upgrade to happen in the future. He also plans to stop using HDS and move to EMC.
Am thinking how to migrate to sun cluster setup instead.
My plan as follows leave the existing vcs intact as a fallback plan.
Then install and build suncluster on... (5 Replies)
hi guys,
I am new to linux. I want to install it on my home computer. I have a few questions.
1) if an exploit is found on linux, how long is it before it gets patched up? My worry is that because there are not many linux users, if a big is found, then it will be a long time before others... (4 Replies)
Hello,
I was wondering if I have 3 nodes (A, B, C) all configured to startup with HACMP, but I would like to configure HACMP in such a way:
1) Node B should startup first. After the cluster successfully starts up and mounts all the filesystems, then
2) Node A, and Node C should startup !
... (4 Replies)
We run two p5 nodes running AIX 5L in a cluster mode (HACMP), both the nodes share external disk arrays. Only the primary node can access the shared disks at a given point of time.
We are in the process of adding two new disks to the disk arrays so as to make them available to the existing... (3 Replies)