How to fix the CPU bound issues on AIX?

Login or Register to Ask a Question and Join Our Community

How to fix the CPU bound issues on AIX?

Login to Discuss or Reply to this Discussion in Our Community

Homework and Emergencies Emergency UNIX and Linux Support How to fix the CPU bound issues on AIX?

10-02-2013

Registered User

104, 3

Join Date: May 2013

Last Activity: 27 November 2019, 2:22 PM EST

Location: USA

Posts: 104

Thanks Given: 54

Thanked 3 Times in 3 Posts

How to fix the CPU bound issues on AIX?

Hi All,

Can you please answer my question.
i see lot of CPU utilization on AIX LPARs. i am able to find the cause of the probelm. But i do not know how to mitigate or fix the problem.

for instance,

i found the process which is consuming most of CPU. i informed the responsible team.
how exactly needs to be fixed ?

Issue:
java (websphere) jvm is consuming 96% CPU
and other server database process consuming 98% CPU

thanks,

System Admin 77

View Public Profile for System Admin 77

Find all posts by System Admin 77

10-02-2013

Registered User

545, 114

Join Date: Jul 2013

Last Activity: 5 January 2020, 9:33 PM EST

Location: Dallas, Texas

Posts: 545

Thanks Given: 14

Thanked 114 Times in 111 Posts

You need to collect more data and answer some basic questions. Is this really a problem? Is the system slow. How many process/threads are in the run queue? How long has the process been running, how many threads are running for the process (dbx) and what are their states. What does vmstat reveal. How long has the system been up?

blackrageous

View Public Profile for blackrageous

Find all posts by blackrageous

10-02-2013

Registered User

104, 3

Join Date: May 2013

Last Activity: 27 November 2019, 2:22 PM EST

Location: USA

Posts: 104

Thanks Given: 54

Thanked 3 Times in 3 Posts

Yes, this is a real problem. Many Application users reported the slowness. It is slow.
9 threads in the ABC database and only one is executing and the remaining threads are just waiting.

How long has the process been running
I'm not sure about this. Please tell me how to check this.

I do not see any issues with websphere servers now. But on Database server, i see utilixation
vmstat o/p

Code:

System configuration: lcpu=10 mem=24576MB ent=1.00

 kthr          memory                         page                       faults                 cpu
------- --------------------- ------------------------------------ ------------------ -----------------------
  r   b        avm        fre    re    pi    po    fr     sr    cy    in     sy    cs us sy id wa    pc    ec
  2   0    2462914      41549     0     0     0     0      0     0    74   7787  1433 95  3  2  0  0.99  99.3
  1   0    2462914      41547     0     0     0     0      0     0   115   9376  1860 94  3  3  0  0.98  98.3
  2   0    2462930      41530     0     0     0     0      0     0    98  11155  1549 95  3  1  0  1.00  99.5
  2   0    2462930      41530     0     0     0     0      0     0   111   7664  1804 89  3  8  0  0.93  93.4
  2   0    2462930      41528     0     0     0     0      0     0    79  10896  1944 96  3  1  0  1.00  99.5

Last edited by Scott; 10-02-2013 at 03:53 PM.. Reason: Code tags for code blocks, not icode tags

System Admin 77

View Public Profile for System Admin 77

Find all posts by System Admin 77

10-05-2013

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

OK, this clearly looks CPU-bound:

First, have a look at the "us", "sy", "id" and "wa" columns of the "cpu" part: these are percentages, denoting the time the processors spent (on average) in the "users", "system", "idle" and "wait" parts of processing: "user" is roughly your programs, "system" is kernel activity and other system services, "idle" is when no process is running and "wait" is like idle, but with I/O operations outstanding. If you would have high "wait" percentages it would hint to a I/O-bound system, but this isn't the case here. In fact your system is busy to saturation running your application, which is as it should be. If it is too slow the only thing that helps is more processing power.

Alas the system cannot get more processors right now. The last column, "ec" is the "entitled capacity" and it is at near 100(%) too. LPARs get some share of the systems processors per default, but can be entitled to some bigger amount should the necessity arise. These additional resources are dynamically added should the system get near saturation and are dynamically relinquished once the situation gets less demanding. This system already has already allocated as much as it will ever get and this still isn't enough.

Now, lets look at the top line of the output: you have 10 logical CPUs. What a "logical CPU" comprises (some fraction of a physical CPU) depends on the physical CPU backing it and ultimately on the hard you run: POWER5? POWER6? POWER7? It might be that 10 lCPUs are a poor layout for your underlying hardware and overtax the physical CPUs with too many context switches.

Anyway, you definitely have to add CPUs to this LPAR: at the HMC modify the LPAR profile to add more (physical) CPUs as "desired" and also increase the "maximum" processors to a new sensible value. To know what a "sensible value" for "maximum" is you probably will have to monitor the system for a while, so go with a good estimation and change that after a few days. After you changed the profile you will have to reboot (cold reboot/power cycle - simple "shutdown -r" won't help) to have the new profile used.

I hope this helps.

bakunin

These 2 Users Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

10-07-2013

Registered User

104, 3

Join Date: May 2013

Last Activity: 27 November 2019, 2:22 PM EST

Location: USA

Posts: 104

Thanks Given: 54

Thanked 3 Times in 3 Posts

@bakunin

Thank you very much for your analysis. Appreciate your time. This really helps many people like me.

In my Case,
I understand that, i need to increase physical processors (Desired) from HMC. But i see suddenly the CPU usage went down, today it is

Code:

System configuration: lcpu=10 mem=24576MB ent=1.00

 kthr          memory                         page                       faults                 cpu
------- --------------------- ------------------------------------ ------------------ -----------------------
  r   b        avm        fre    re    pi    po    fr     sr    cy    in     sy    cs us sy id wa    pc    ec
  9   0    2568885      36174     0     0     0     0      0     0   121   1184   703  1  1 97  0  0.03   3.5
 13   0    2568885      36172     0     0     0     0      0     0    42    906   522  1  1 98  0  0.03   2.7
  5   0    2568885      36172     0     0     0     0      0     0    10    814   485  0  1 99  0  0.02   1.7
  5   0    2568885      36172     0     0     0     0      0     0     8    815   492  0  1 99  0  0.02   1.7
  5   0    2568885      36169     0     0     0     0      0     0    11   2153   482  1  2 97  0  0.03   3.4

I know that, a particular JVM or DB process consumed lot of CPU (by ruuning topas)
But am not sure, how to tune it. (*Not sure why it went down)

Another Question,

How can we set/decide the number of Virtual CPUs in any LPAR. I mean on what basis ?
in my case,
1 physical --ent
5 virtual ==>> 10 logical CPU

Please give your ideas.

Moderator's Comments:

Mod Comment

edit by bakunin: changed "ICODE"-tags to "CODE"-tags. It is easier to read that way.

Last edited by bakunin; 10-08-2013 at 09:22 AM..

System Admin 77

View Public Profile for System Admin 77

Find all posts by System Admin 77

10-08-2013

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by System Admin 77

I understand that, i need to increase physical processors (Desired) from HMC.

Yes and no: what you first need to do is to understand your system. This means (among other things) to understand the patterns of resource consumption there are. You might have a relatively stable demand for CPU or a widely varying one. You may have predictable ups and downs (for instance: day=high, night=low, etc.) or event-triggered ones. If your consumption is varying it might be by a small factor or a big one. All these things you can only find out through careful, long-term study of the system. I know these things even less than you, because i know even less about your system. So, please, bear with me for being somewhat general in my suggestions.

Set up and run sar (or nmon or whatever else you like) to monitor consumed resources (memory, CPU, I/O, net, ...) over some time to get a good impression about these usage patterns. The tool you use doesn't matter asl long as it provides the data you are interested.

Run a ps (or top or something alike) to learn about the most demanding processes in terms of memory and CPU. Maybe they run all day, maybe they run only during a certain time of the day. Maybe they run all day but only need very much memory/CPU power during a short time. Maybe ... You see, there is a lot of things not known about your system.

Performance tuning is a very simple task once you have understood where the bottleneck is. Finding out the bottleneck, though, can be extremely difficult. I suggest you read the little tutorial i wrote to get some pointers.

Quote:

Originally Posted by System Admin 77

But i see suddenly the CPU usage went down, today it is

Code:

System configuration: lcpu=10 mem=24576MB ent=1.00

 kthr          memory                         page                       faults                 cpu
------- --------------------- ------------------------------------ ------------------ -----------------------
  r   b        avm        fre    re    pi    po    fr     sr    cy    in     sy    cs us sy id wa    pc    ec
  9   0    2568885      36174     0     0     0     0      0     0   121   1184   703  1  1 97  0  0.03   3.5
 13   0    2568885      36172     0     0     0     0      0     0    42    906   522  1  1 98  0  0.03   2.7
  5   0    2568885      36172     0     0     0     0      0     0    10    814   485  0  1 99  0  0.02   1.7
  5   0    2568885      36172     0     0     0     0      0     0     8    815   492  0  1 99  0  0.02   1.7
  5   0    2568885      36169     0     0     0     0      0     0    11   2153   482  1  2 97  0  0.03   3.4

I know that, a particular JVM or DB process consumed lot of CPU (by ruuning topas)
But am not sure, how to tune it. (*Not sure why it went down)

How to tune Java processes or databases is beyond my area of expertise. I take them as they are and leave the tuning to the DBAs and application engineers.

However, we have now seen two situations of your system: one in which it choked under the load and onw where it is (almost) idle. Again: what you need is to find out the pattern behind it.

In general there are three values to every resource you can define in the HMC profile: "minimum", "desired" and "maximum".

"Minimum" is the minimum amount the LPAR needs to allocate, otherwise it won't start.

"Desired" is how much the LPAR grabs if that much is available. This is the normal amount an LPAR has when it starts.

"Maximum" is how much the LPAR can additionally allocate should it be necessary. This additional resources (the difference between "desired" and "max") will be allocated only during runtime.

The reason why this is done that way is that you can "overcommit" the systems resources. If you have 100GB memory installed you can create LPAR profiles worth 150 GB in total. You leave some of them unstarted and/or the last one will only start with something between "minimum" and "desired" in this case.

What you have to do now is to find sensible values for "desired" and "maximum". This, again, can only be done in monitoring the system for some time.

Quote:

Originally Posted by System Admin 77

How can we set/decide the number of Virtual CPUs in any LPAR. I mean on what basis ?

Basically, a "physical CPU" is what you know as a CPU: a processor you can touch. From one such physical CPU one or several "virtual CPUs" are created. The more virtual CPUs are created from one physical CPU the "smaller" the virtual CPUs become. You allocate a number of physical CPUs to an LPAR and state in the LPAR profile how many virtual CPUs to create from these. If you change the allocated number of processors (physical CPUs) this number of virtual CPUs will not change, they will just get more (or less) powerful.

You need one CPU to run a thread (or - the same - a single-threaded process). Still, these threads may have different demands on processing power. Choose as many cirtual CPUs to satisfy all threads and keep them as small as possible, yet as big as necessary - that is the basic idea. What exactly "necessary", "possible", etc., means: see above, monitor and find out.

About threads/processes: in the vmstat output you see "r" and "b" on the left side. If you regularily see big numbers in "r" the system might profit from a raised number of virtual CPUs, even if they are smaller than now. If there are only low numbers you might be able to reduce on the number of lCPUs. Again: not enough data right now to suggest either.

As an afterthought: when you compare the first and second vmstat output you can notice that the numbers in the run-queue ("r") were low in the first but are high in the second. That basically means: there were few but "CPU-heavy" processes running when the first snapshot was taken but many (very lightweight) processes ran during the second. It would be interesting to know which processes these were/are and if there are dependencies. If (for the last time: this is NOT a suggestion, but it might become one if the data back it up) during times of heavy taxation only few, heavy processes run the machine might profit from fewer (but more potent) lCPUs.

I hope this helps.

bakunin

Last edited by bakunin; 10-08-2013 at 09:40 AM..

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

10-16-2013

Registered User

104, 3

Join Date: May 2013

Last Activity: 27 November 2019, 2:22 PM EST

Location: USA

Posts: 104

Thanks Given: 54

Thanked 3 Times in 3 Posts

@bakunin

Thanks much for your time and analysis. Currently we've 1 physical CPU and 24G Memory

Desired /ent phyisical CPU --> 1
Number Of Processors: 5 (5 virtual CPUs ==> 10 logical CPUs)

Again i saw heavy CPU utilization . So' in my case i feel that, decreasing Vcpus is a better idea. (I will give a try, Please correct me if i am wrong)
vmstat o/p

Code:

System configuration: lcpu=10 mem=24576MB ent=1.00

 kthr          memory                         page                       faults                 cpu
------- --------------------- ------------------------------------ ------------------ -----------------------
  r   b        avm        fre    re    pi    po    fr     sr    cy    in     sy    cs us sy id wa    pc    ec
  2   0    2708260      51589     0     0     0     0      0     0    93   9670  1484 96  3  1  0  1.00  99.9
  3   0    2708260      51588     0     0     0     0      0     0    77  10082  2037 96  3  1  0  1.00  99.7
  3   0    2708260      51589     0     0     0     0      0     0    44   8894  1459 97  2  1  0  1.00 100.1
  2   0    2708260      51587     0     0     0     0      0     0    72   8744  1553 96  3  1  0  1.00  99.7
  1   0    2708260      51587     0     0     0     0      0     0    79   7693  1899 91  3  6  0  0.96  95.6
  2   0    2708260      51586     0     0     0     0      0     0    47  10890  1915 96  3  1  0  1.00  99.6
  2   0    2708260      51587     0     0     0     0      0     0    60   9296  1407 96  3  1  0  1.00  99.6
  2   0    2708260      51587     0     0     0     0      0     0    44   5306  1233 94  2  3  0  0.97  97.5
  4   0    2708260      51587     0     0     0     0      0     0    49  10082  1802 95  3  1  0  1.00  99.8

And
sar command output

Code:

#sar 2 15
System configuration: lcpu=10 ent=1.00 mode=Capped

10:36:31    %usr    %sys    %wio   %idle   physc   %entc
10:36:33      87       3       0      10    0.98    97.8
10:36:35      95       4       0       1    1.00    99.6
10:36:37      97       2       0       1    1.00    99.5
10:36:39      90       4       0       6    0.95    95.3
10:36:41      96       3       0       1    1.00    99.7
10:36:43      97       2       0       1    1.00    99.8
10:36:45      96       3       0       1    1.00    99.8
10:36:47      89       2       0       8    0.99    99.4
10:36:49      96       3       0       1    1.00    99.8
10:36:51      97       2       0       1    1.00    99.7
10:36:53      87       3       0      10    0.98    97.7
10:36:55      58       3       0      39    0.67    67.0
10:36:57      96       3       0       1    1.01   100.7
10:36:59      97       3       0       1    1.00    99.7
10:37:01      88       2       0      10    0.98    97.6

Average       91       3       0       6    0.97    96.8

Thank you, really appreciate your time and ideas.

Last edited by System Admin 77; 10-16-2013 at 11:38 AM..

System Admin 77

View Public Profile for System Admin 77

Find all posts by System Admin 77

Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. AIX

AIX CPU waits

Guys, I have a question - when nmon reports a sizeable %CPU wait, does that mean - 1) IO operations are slowing CPU down, OR 2) paging slowing the CPU down, OR 3) one cant tell?? I thought the nmon documentation clearly suggested that CPU waits reported in nmon were from disk...

2. AIX

AIX Fix Pack update

Hi All, i'm try to update my aix 6100.06.05 to 6100.07.00. i download the 4,5 GB of FixPack buy i don't have a required package (devices.chrp.pci.rte 6.1.7.0) This package does not exist on the fix pack (i've check in the .toc file and in the .bff files) On ibm website i see that this...

3. AIX

AIX CPU use

Hi Could somebody explain me how AIX is using CPU?? For example when we have 2 processors system is giving all task to one of them till 100% is used ?? Or it's depend on configuration or anything else ?? Best regards enda

4. AIX

APAR fix on AIX 53tl9

We have tried to install an APAR fix IZ20298 on a AIX test server. It is requiring a base level of bos.adt.prof of 5.3.0.0 I cannot find this file anywhere. I fould 5.3.0.1 and it still will not install without the base install. Any ideas where I can find bos.adt.prof 5.3.0.0?

5. AIX

How to install AIX Fix Pack 5300-06-06-0811

Hi All, I have this fix for AIX (5300-06-06-0811) and i need to install it. How can i do this? What are the prerequisites for this fix? Thanks

6. AIX

AIX filter Issues

Hi, I want to print from AIX 5.3/6.1 using 'pr' preprocessing filter and 'PCL' print file type. Steps: 1. Smitty 2. Print Spooling 3. Create a print queue(remote->Generic) 4. change the attributes for that print queue. 5. Change print file type to PCL and...

7. AIX

IY17981 fix required for aix 4.3.3 to aix 5L migration but not found

Hi, redbook documentation is telling that IY17981 fix is required for aix 4.3.3 to aix 5L migration. But there is no mention about that fix in any ML installation packages. - My system is ML11 : oslevel –r 4330-11 - But xlC.rte is on wrong version : lslpp -L xlC.rte xlC.rte ...

8. AIX

If the AIX need reboot after install fix pack or APAR?

After install fix pack or APAR, if aix need reboot? if not, do we need stop database and all applications before we install fix pack or APAR?

9. AIX

AIX 5.3 Issues

We are planning to move to AIX 5.3 and we would like to know if someone has had any 'bad' experiences with it. We have a 32PE p690 Regatta and currently we are running the latest AIX 5.2 with the latest patches. Has anyone any interesting points to mention when transitioning to AXI 5.3? Is...

Login or Register to Ask a Question