AIX 6.1 Power6 - Sys CPU utilization twice that of User

01-25-2010

Registered User

28, 1

Join Date: Aug 2009

Last Activity: 17 August 2015, 3:58 PM EDT

Location: St. Louis, Missouri, USA

Posts: 28

Thanks Given: 0

Thanked 1 Time in 1 Post

AIX 6.1 Power6 - Sys CPU utilization twice that of User

Hello,

We just purchased two new 4-way (one active one failover) 5Ghz Power6 Servers (failover) with 64GB RAM (32GB per node) runing AIX 6.1 with two LPARs per node connected to our SAN with two 4GB HBAs. The PROD LPAR has 2 dedicated CPUs (4 virtual) and the TEST LPAR has 2 dedicated CPUs.

When we started parallel testing to move our production application to this server, I noticed that it didn't seem to be performing as fast as I thought it should compared to our existing server.

Our exisiting server is an 8-way, 1.6Ghz Power5 with 32GB RAM (16GB per node) connected to our SAN with two 2GB HBAs. We have 5 physical CPUs dedicated to the PROD LPAR adn two dedicated to the TEST LPAR.

I started by running the common performance monitoring tools during our parallel testing, like VMSTAT, MPSTAT, etc. For some reason, the System/OS is using about twice the CPU as the User Processes. Everything I've ever seen or been told about UNIX Administration says that the System should not use more CPU than the User Processes. If it does, the OS needs to be better tuned for the application its running or there is some kind of bottleneck somewhere (CPU, I/O, Network).

So, the vendor (Not IBM) that installed the servers for us has not been able to explain or correct this after numerous changes to the filesystem, kernel settings, I/O buffers, etc.

VMSTAT does not show any obvious bottlenecks other than the OS seems to be using way too much CPU compared to the User Processes. r & b are less than the number of CPUs for the most part. wt is very low. pi/po are zero.

Here is a sample of the VMSTAT output during a test which represented about 20% of our production transaction volume going through the new server.

Code:

/>vmstat -w 5
 
System configuration: lcpu=8 mem=24576MB
 
kthr memory page faults cpu
r b avm fre re pi po fr sr cy in sy cs us sy id wa
1 0 3340754 1940960 0 0 0 0 0 0 65  21903  1207 2 3 95 0
2 0 3340770 1940918 0 0 0 0 0 0 260 43677  1654 3 6 90 1
2 0 3340885 1940771 0 0 0 0 0 0 125 37038  1601 3 8 89 0
1 0 3340742 1940897 0 0 0 0 0 0 75  24788  1290 2 5 93 0
1 0 3340699 1940913 0 0 0 0 0 0 99  38021  1375 2 6 92 0
1 0 3340685 1940898 0 0 0 0 0 0 97  34672   1424 2 5 93 0
1 0 3340673 1940881 0 0 0 0 0 0 137 23928  1640 3 8 89 0
1 0 3340634 1940881 0 0 0 0 0 0 135 39418  1615 3 6 91 0
1 0 3341393 1940054 0 0 0 0 0 0 166 26856  1749 4 7 88 0
1 0 3341378 1940035 0 0 0 0 0 0 106 35104  1301 2 5 93 0
1 0 3341381 1940008 0 0 0 0 0 0 73  36011  1171 2 3 95 0
1 0 3341407 1939948 0 0 0 0 0 0 101 23827  1330 2 5 93 0
1 0 3341377 1939933 0 0 0 0 0 0 143 33983  1638 3 7 90 0
0 0 3341394 1939876 0 0 0 0 0 0 249 38386  1634 3 6 90 0

As we put more load on the machine, I thought that this might even out, but it didn't. Below is a VMSTAT from a test that represented about 200% of our production volume being processed by the new server.

Code:

System configuration: lcpu=8 mem=24576MB
 
kthr memory page faults cpu
r b avm fre re pi po fr sr cy in sy cs us sy id wa
1 1 2323028 3038088 0 0 0 0 0 0 731 53814 7149 18 45 37 1
3 0 2324120 3036759 0 0 0 0 0 0 825 54887 7107 20 43 36 1
3 0 2324346 3036422 0 0 0 0 0 0 758 45717 5610 16 41 42 2
2 1 2324357 3036295 0 0 0 0 0 0 932 52869 7709 17 46 36 1
2 0 2324395 3036165 0 0 0 0 0 0 774 46603 5759 16 42 42 1
2 0 2323100 3037244 0 0 0 0 0 0 893 52706 7509 17 45 37 2
4 0 2324297 3035931 0 0 0 0 0 0 737 45806 5381 15 38 46 1
3 0 2324751 3035377 0 0 0 0 0 0 773 53345 7091 18 46 35 1
3 0 2324801 3035185 0 0 0 0 0 0 773 52399 7071 17 43 39 1
2 0 2325211 3034652 0 0 0 0 0 0 615 46806 5469 17 41 42 1
2 1 2325890 3033848 0 0 0 0 0 0 757 50556 6565 21 43 35 1
2 0 2324992 3034627 0 0 0 0 0 0 712 51243 7530 13 41 45 1
3 0 2325939 3033444 0 0 0 0 0 0 655 46586 5832 17 39 42 1
3 1 2325297 3033969 0 0 8 0 0 0 659 52255 6002 19 42 38 1
3 0 2325296 3033879 0 0 0 0 0 0 705 51447 6256 18 45 36 1
4 0 2326345 3032446 0 0 0 0 0 0 566 58858 9930 13 43 44 1
4 0 2326502 3032220 0 0 0 0 0 0 371 39132 3743 10 37 53 0
3 1 2329518 3029111 0 0 0 0 0 0 595 55473 6341 22 45 33 1

Is this normal? Am I just wrong about what normal CPU utilization should be in an AIX LPAR environment?

Thanks so much!
Troy

troym72

View Public Profile for troym72

Find all posts by troym72

01-25-2010

Moderator

869, 117

Join Date: May 2008

Last Activity: 3 June 2020, 5:57 PM EDT

Location: Lone Star State, USA

Posts: 869

Thanks Given: 26

Thanked 117 Times in 94 Posts

What type of application are you using? If this is i.e. a sybase DB, I'd say cut the number of engines in half, since they are too idle and are spinning cpu because they're entirely bored... In addition cpus more than twice as fast doesnt mean that your apps are running twice as fast - It rather means you can run twice as much apps in the same time

Show us the output of vmstat -v too, please.

In addition why don't you put all your cpus into a pool and run your lpars uncapped. This would make much better use of the resources you have and gives the system the chance to unfold cpus it's not using what would give you a much clearer picture than this.

Kind regards
zxmaus

zxmaus

View Public Profile for zxmaus

Find all posts by zxmaus

01-26-2010

Registered User

28, 1

Join Date: Aug 2009

Last Activity: 17 August 2015, 3:58 PM EDT

Location: St. Louis, Missouri, USA

Posts: 28

Thanks Given: 0

Thanked 1 Time in 1 Post

The application is an interfacing application (Healthvision Cloverleaf) that receives Helathcare HL7 transactions via TCP/IP from various applications and routes them to the appropriate destination application(s). From the time the transaction is received until it is sent out of the interface engine, it could be translated (via Tcl programs) several times. Translation consists of transaction re-formatting, field reformatting, table maps, transaction filtering logic and other types of data massaging.

While being routed and translated, the transactions are stored temporarily in a Raima database (Healthvision's 3rd party Db agreement) for disaster recovery purposes. If someting dies or is stopped, the undelivered messages are read from the database and the engine continues where it left off. There are 15 points at which the transactions are saved to the database during their journey from the source to the destination.

So, the application is pretty I/O intensive. Each transaction is betweeen 1k and 2k and its written to the Raima Db at least 15 times. Our production environment processes about 1.2 million of these transactions per day on average. We are expecting our volume of transactions to roughly double in the next four years (hence the new server).

We changed our min and max to the values suggested by the vendor, Healthvision.

Thanks!!

Here is the output of the vmstat -v command:

Code:

/>uptime
  9:03am  up 12 days,  13:38,  3 users,  load average:  1.16, 1.07, 0.93
/>vmstat -v
              6291456 memory pages
              6080336 lruable pages
               225542 free pages
                    1 memory pools
              1007176 pinned pages
                 80.0 maxpin percentage
                  3.0 minperm percentage
                 90.0 maxperm percentage
                 48.1 numperm percentage
              2930676 file pages
                  0.0 compressed percentage
                    0 compressed pages
                 48.1 numclient percentage
                 90.0 maxclient percentage
              2930676 client pages
                    0 remote pageouts scheduled
                   13 pending disk I/Os blocked with no pbuf
                    0 paging space I/Os blocked with no psbuf
                 2484 filesystem I/Os blocked with no fsbuf
                    0 client filesystem I/Os blocked with no fsbuf
                28690 external pager filesystem I/Os blocked with no fsbuf

---------- Post updated at 09:09 AM ---------- Previous update was at 09:05 AM ----------

I'm not sure if our Admins will allow us to run our CPU in a pool, with LPARs uncapped.

I think the concern is that if there is resource intensive processing on the TEST node it might steal too much resources from the PROD node. We do, on occasion, perform large production transaction re-sends from our TEST node. We have to do this when a destination application didn't process the transactions correctly for an extended period of time or if the application was down for an extended period and we couldn't allow the transactions to queue in our engine that long.

However, I will discuss this option with admins and let you know their feedback.

Thanks again.

troym72

View Public Profile for troym72

Find all posts by troym72

01-26-2010

Registered User

6,575, 572

Join Date: Sep 2007

Last Activity: 5 November 2019, 9:08 AM EST

Location: St. Gallen, Switzerland

Posts: 6,575

Thanks Given: 179

Thanked 572 Times in 484 Posts

That sy is higher than us just means the kernel has much more to do. That should be because the software is written like this, I would guess. To check in detail what is going on CPU-wise, have look at tprof maybe:

AIX 5.2 performance tools update, Part 3

I have no experience with tprof myself but maybe you get something of worth out of it analysing it.

You could also try again with enabling/disabling SMT (smtctl [on|off]) check for different behaviour, depending if the application has lot's of processes or is written mulithreaded (check with svmon -P| grep -p Pid). If SMT is working fine ie. dispatching works smooth, can be checked with "mpstat -s 1" (see System p education).

Checking how the work is distributed on the different (logical/virtual) CPUs can be done with sar -P ALL 1 9999 for example.

This one might be interessting for you too:
http://www.ibm.com/developerworks/wi...len+CPU+cycles

zaxxon

View Public Profile for zaxxon

Find all posts by zaxxon

AIX

AIX 6.1 Power6 - Sys CPU utilization twice that of User

10 More Discussions You Might Find Interesting

1. Red Hat

CPU Utilization and Memory Utilization of Services and Applications

Discussion started by: nossam

2. AIX

What happened if CPU utilization is near to 100 % in AIX 6.1?

Discussion started by: MKJ

3. Shell Programming and Scripting

Cpu utilization by a process has to be mailed if more than 5% on AIX

Discussion started by: arorap

4. Cybersecurity

Limit CPU and RAM utilization for new user in RedHat

Discussion started by: vaibhavvsk

5. AIX

How to calculate AIX CPU utilization using lparstat command

Discussion started by: maruthu

6. Shell Programming and Scripting

Perl using modules CPU SYS and ENV

Discussion started by: thiedi16

7. UNIX for Advanced & Expert Users

Help! CPU consumption - %usr and %sys ??

Discussion started by: gomes1333

8. UNIX for Dummies Questions & Answers

how to get persistant cpu utilization values per process per cpu in linux (! top,ps)

Discussion started by: pankajd

9. Shell Programming and Scripting

script for cpu utilization for each user

Discussion started by: rajusa10

10. Shell Programming and Scripting

CPU Utilization

Discussion started by: bullz26