Runaway Processes (I think)

09-13-2011

Registered User

98, 6

Join Date: Mar 2010

Last Activity: 16 November 2015, 1:48 PM EST

Posts: 98

Thanks Given: 4

Thanked 6 Times in 6 Posts

Runaway Processes (I think)

Folks I suck a lot of things and performance issue is one of them.

After upgrading from 5300-06-03 to 5300-12-04 we started seeing an issue with some runaway processes. It varies as some of these processes have a TTY accociated with them and some do not. If you could give me any idea of what to look for it would be most appreciated. I did contact IBM support and provided a perfpmr but all I have so far is your machine is CPU bound and here are the top processes contact your applications people. Well I kind of already knew that. Just expecting some guidance on what might have changed to cause it.

"ps aux" yeilds:

Code:

USER         PID %CPU %MEM   SZ  RSS    TTY STAT    STIME  TIME COMMAND
jufackle 1022780  7.7  0.0 2452 2624 pts/118 A    13:52:51 672:41 /4js/runtime2.0
jufackle   26936  7.3  0.0 2760 2932 pts/118 A    12:38:20 682:48 /4js/runtime2.0
jufackle 1289150  7.3  0.0 2760 2932 pts/118 A    12:38:17 679:46 /4js/runtime2.0
root        4128  3.6  0.0   12   12      - A      Sep 11 743:06 wait
root        3354  3.6  0.0   12   12      - A      Sep 11 734:55 wait
root        3096  3.5  0.0   12   12      - A      Sep 11 718:31 wait
root        4386  3.5  0.0   12   12      - A      Sep 11 714:56 wait
root        3612  3.3  0.0   12   12      - A      Sep 11 681:29 wait
root        3870  3.3  0.0   12   12      - A      Sep 11 672:54 wait
root        2838  2.9  0.0   12   12      - A      Sep 11 603:28 wait
root         516  2.9  0.0   12   12      - A      Sep 11 601:12 wait

Looking at specifics about the processes being run by this user.

Code:

root@foobar:/ $ ps -fu jufackle
     UID     PID    PPID   C    STIME    TTY  TIME CMD
jufackle   26936       1  98 12:38:20 pts/118 686:26 /4js/runtime2.02.10/lib/fglrun-bin initmenu.42r
jufackle  191378  194832   0   Sep 12 pts/118  0:00 /bin/ksh /usr/local/bin/spi_startup.ksh -c localhost 42639
jufackle  194832  192060   0   Sep 12      -  0:01 sshd: jufackle@pts/118
jufackle  195414  191378   0   Sep 12 pts/118  0:00 /usr/local/bin/apscnmenu.4ge
jufackle  268184 1283180   0 13:52:51 pts/118  0:00 /bin/sh /spi/spishare/men4.3/bin/use_132.sh /usr/users/jufackle/schedu
jufackle 1022780  268184  91 13:52:51 pts/118 676:19 /4js/runtime2.02.10/lib/fglrun-bin view_lib.42r /usr/users/jufackle/sc
jufackle 1283180       1   0 13:39:37 pts/118  0:00 /4js/runtime2.02.10/lib/fglrun-bin schd24 runmode=schedule
jufackle 1289150       1  93 12:38:17 pts/118 683:28 /4js/runtime2.02.10/lib/fglrun-bin initmenu.42r

There have been no new processes spun off by the main initmenu.42r process since yesterday at 13:39. I am thinking the user still has the session open on his/her computer but what is it doing to use that much CPU.

The second scenario with no TTY associated with the process looks like this.

Code:

USER        PID %CPU %MEM   SZ  RSS    TTY STAT    STIME  TIME COMMAND
willic   256790  4.0  0.0 1460  648      - A    12:05:32 192:42 /peifas/apscn/si
kathyg   305536  3.9  0.0 1560 1204      - A    13:58:03 171:55 /peifas/apscn/si
dfrankli  40808  3.7  0.0  640  572      - A    13:09:34 168:59 /usr/local/bin/a
jquast   326370  3.4  0.0  640  536      - A    11:43:10 167:41 /usr/local/bin/a
batesj   470148  2.9  0.0  640  236      - A    08:57:05 160:06 /usr/local/bin/a
crenuart 383470  2.8  0.0  640  316      - A    08:21:54 160:07 /usr/local/bin/a
montiem  404560  2.8  0.0  640  532      - A    08:26:16 159:15 /usr/local/bin/a

Code:

root@foobar:/dump/perfpmr $ ps -fu willic
     UID    PID   PPID   C    STIME    TTY  TIME CMD
  willic 256790      1 120 12:05:32      - 192:53 /peifas/apscn/sis/../sped/bin/sped.4ge

Thanks for any guidance you could provide or should I say will provide.

If I am missing some data you might need please let me know.

Justin

juredd1

View Public Profile for juredd1

Find all posts by juredd1

09-13-2011

Registered User

6,575, 572

Join Date: Sep 2007

Last Activity: 5 November 2019, 9:08 AM EST

Location: St. Gallen, Switzerland

Posts: 6,575

Thanks Given: 179

Thanked 572 Times in 484 Posts

IBM can't tell more than what they see in the snaps etc. from the OS gathered by perfpmr. Since they didn't see anything strange, they ask for the application as they can't have a clue how the application works, that's right.

Are you sure it didn't look the same before the upgrade? Question might sound stupid, but just to make sure.
High C shows them currently active but the percentage is average since start so can you talk with any of the users and ask what they are doing or if they are doing different things than usual (at least jufackle?). Sometimes there is a perdiodic run of other tasks because of business things like gathering end-of-the-month statistics or whatever could be the reason to produce a peak - you'd better know or maybe your users than I.
Is the box being pressed against the wall vmstat-wise?

Killing or stopping the application by one user and starting anew will have the same effect that C rises up that high immediately?

You could check (awful work) what enhancements or fixes the difference between 5300-06-03 and 5300-12-04 has brought.

Sorry to have no better idea at the moment to help you.

Do you have nmon-monitoring up maybe to check pre-update data with current cpu/process wise? If not it could be helpful in the future.

zaxxon

View Public Profile for zaxxon

Find all posts by zaxxon

09-13-2011

Registered User

568, 47

Join Date: Jun 2008

Last Activity: 22 August 2017, 9:15 AM EDT

Posts: 568

Thanks Given: 7

Thanked 47 Times in 45 Posts

As you have no assure baseline performance records from the previous AIX TL level It will be no easy task to determine if is the application the cause or the OS.
For now just collect performance statistics in various time intervals and do some comparison between statistics gathered all day.

h@foorsa.biz

View Public Profile for h@foorsa.biz

Find all posts by h@foorsa.biz

09-14-2011

Registered User

98, 6

Join Date: Mar 2010

Last Activity: 16 November 2015, 1:48 PM EST

Posts: 98

Thanks Given: 4

Thanked 6 Times in 6 Posts

I apologize for the delay in responding and want to thank you both for responding. I am positive it did not look the same before the upgrade. I actually rolled a server back to AIX 5.3TL06 so I would have something to compare against. All is well on that server and it's a full time job keeping the runaway processed killed on the other 7 servers so it does not crash or become unresponsive.

Unless I am reading the vmstat output wrong yes the box is being pressed pretty good. I played with the headers a bit trying to line up the colmuns for easier reading.

Code:

root@foobar:/ $ vmstat 5

System configuration: lcpu=8 mem=32000MB

kthr    memory              page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm        fre         re   pi  po  fr   sr  cy  in   sy           cs    us sy  id wa
 6  0 479587 7652047   0   0   0   0    0   0  39 1953234 382 48 52  0  0
 6  0 479587 7652047   0   0   0   0    0   0  34 1951368 354 48 52  0  0
 6  0 479587 7652047   0   0   0   0    0   0  39 1947601 385 48 52  0  0

I provided IBM support with perfpmr data and it took them a bit but came back with a possible bug. After getting a core dump of the process it was confirmed that there is an APAR in the works from a previous PMR. Below is the APAR discription. This matches up with the report they sent me from the perfpmr data.

A SIGHUP'D PROCESS HANGS, REPEATEDLY CALLING PTHREAD_YIELD

An ifix is currently in the works. I just hope and pray this is the issue.

Not sure it applied to anyone but can update when it's applied if that is preferred.

juredd1

View Public Profile for juredd1

Find all posts by juredd1

09-15-2011

Registered User

6,575, 572

Join Date: Sep 2007

Last Activity: 5 November 2019, 9:08 AM EST

Location: St. Gallen, Switzerland

Posts: 6,575

Thanks Given: 179

Thanked 572 Times in 484 Posts

Thanks for the feedback - indeed the vmstat looks bad CPU wise and has lot's of unused memory.
Btw. you can use vmstat's switch -w to have the columns aligned. When you even add -t you'll get a time stamp (sometimes helpful).

We are still on 5300-11-04-1015 so I can't tell of any bad experience with your level of updates.

Glad to hear they found something and usually they are fast with responses for hotfixes once you... persuaded them to have a look again ^^ (At least my experience too way back with some other IBM software).

zaxxon

View Public Profile for zaxxon

Find all posts by zaxxon

AIX

Runaway Processes (I think)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to solve awk: line 1: runaway string constant error?

Discussion started by: Akshay Hegde

2. Shell Programming and Scripting

Finding the age of a unix process, killing old processes, killing zombie processes

Discussion started by: sukerman

3. Shell Programming and Scripting

Runaway String Problem

Discussion started by: gregarion

4. UNIX for Dummies Questions & Answers

Runaway process

Discussion started by: elwoodblues47

5. Solaris

Identifying and grouping OS processes and APP processes

Discussion started by: wilsonee

6. Programming

Runaway SIGALRM signal handler

Discussion started by: stewartw

7. UNIX for Advanced & Expert Users

Monitoring Processes - Killing hung processes

Discussion started by: ukndoit

8. Shell Programming and Scripting

I need some example of Co-Processes

Discussion started by: javalee

9. UNIX for Dummies Questions & Answers

Runaway processes killed (Really need help)

Discussion started by: Micz

10. UNIX for Advanced & Expert Users

Runaway process. Opinions needed

Discussion started by: TRUEST