Runaway Processes (I think)


 
Thread Tools Search this Thread
Operating Systems AIX Runaway Processes (I think)
# 1  
Old 09-13-2011
Runaway Processes (I think)

Folks I suck a lot of things and performance issue is one of them.

After upgrading from 5300-06-03 to 5300-12-04 we started seeing an issue with some runaway processes. It varies as some of these processes have a TTY accociated with them and some do not. If you could give me any idea of what to look for it would be most appreciated. I did contact IBM support and provided a perfpmr but all I have so far is your machine is CPU bound and here are the top processes contact your applications people. Well I kind of already knew that. Just expecting some guidance on what might have changed to cause it.

"ps aux" yeilds:

Code:
USER         PID %CPU %MEM   SZ  RSS    TTY STAT    STIME  TIME COMMAND
jufackle 1022780  7.7  0.0 2452 2624 pts/118 A    13:52:51 672:41 /4js/runtime2.0
jufackle   26936  7.3  0.0 2760 2932 pts/118 A    12:38:20 682:48 /4js/runtime2.0
jufackle 1289150  7.3  0.0 2760 2932 pts/118 A    12:38:17 679:46 /4js/runtime2.0
root        4128  3.6  0.0   12   12      - A      Sep 11 743:06 wait
root        3354  3.6  0.0   12   12      - A      Sep 11 734:55 wait
root        3096  3.5  0.0   12   12      - A      Sep 11 718:31 wait
root        4386  3.5  0.0   12   12      - A      Sep 11 714:56 wait
root        3612  3.3  0.0   12   12      - A      Sep 11 681:29 wait
root        3870  3.3  0.0   12   12      - A      Sep 11 672:54 wait
root        2838  2.9  0.0   12   12      - A      Sep 11 603:28 wait
root         516  2.9  0.0   12   12      - A      Sep 11 601:12 wait

Looking at specifics about the processes being run by this user.
Code:
root@foobar:/ $ ps -fu jufackle
     UID     PID    PPID   C    STIME    TTY  TIME CMD
jufackle   26936       1  98 12:38:20 pts/118 686:26 /4js/runtime2.02.10/lib/fglrun-bin initmenu.42r
jufackle  191378  194832   0   Sep 12 pts/118  0:00 /bin/ksh /usr/local/bin/spi_startup.ksh -c localhost 42639
jufackle  194832  192060   0   Sep 12      -  0:01 sshd: jufackle@pts/118
jufackle  195414  191378   0   Sep 12 pts/118  0:00 /usr/local/bin/apscnmenu.4ge
jufackle  268184 1283180   0 13:52:51 pts/118  0:00 /bin/sh /spi/spishare/men4.3/bin/use_132.sh /usr/users/jufackle/schedu
jufackle 1022780  268184  91 13:52:51 pts/118 676:19 /4js/runtime2.02.10/lib/fglrun-bin view_lib.42r /usr/users/jufackle/sc
jufackle 1283180       1   0 13:39:37 pts/118  0:00 /4js/runtime2.02.10/lib/fglrun-bin schd24 runmode=schedule
jufackle 1289150       1  93 12:38:17 pts/118 683:28 /4js/runtime2.02.10/lib/fglrun-bin initmenu.42r

There have been no new processes spun off by the main initmenu.42r process since yesterday at 13:39. I am thinking the user still has the session open on his/her computer but what is it doing to use that much CPU.

The second scenario with no TTY associated with the process looks like this.

Code:
USER        PID %CPU %MEM   SZ  RSS    TTY STAT    STIME  TIME COMMAND
willic   256790  4.0  0.0 1460  648      - A    12:05:32 192:42 /peifas/apscn/si
kathyg   305536  3.9  0.0 1560 1204      - A    13:58:03 171:55 /peifas/apscn/si
dfrankli  40808  3.7  0.0  640  572      - A    13:09:34 168:59 /usr/local/bin/a
jquast   326370  3.4  0.0  640  536      - A    11:43:10 167:41 /usr/local/bin/a
batesj   470148  2.9  0.0  640  236      - A    08:57:05 160:06 /usr/local/bin/a
crenuart 383470  2.8  0.0  640  316      - A    08:21:54 160:07 /usr/local/bin/a
montiem  404560  2.8  0.0  640  532      - A    08:26:16 159:15 /usr/local/bin/a

Code:
root@foobar:/dump/perfpmr $ ps -fu willic
     UID    PID   PPID   C    STIME    TTY  TIME CMD
  willic 256790      1 120 12:05:32      - 192:53 /peifas/apscn/sis/../sped/bin/sped.4ge

Thanks for any guidance you could provide or should I say will provide.Smilie

If I am missing some data you might need please let me know.

Justin
# 2  
Old 09-13-2011
IBM can't tell more than what they see in the snaps etc. from the OS gathered by perfpmr. Since they didn't see anything strange, they ask for the application as they can't have a clue how the application works, that's right.

Are you sure it didn't look the same before the upgrade? Question might sound stupid, but just to make sure.
High C shows them currently active but the percentage is average since start so can you talk with any of the users and ask what they are doing or if they are doing different things than usual (at least jufackle?). Sometimes there is a perdiodic run of other tasks because of business things like gathering end-of-the-month statistics or whatever could be the reason to produce a peak - you'd better know or maybe your users than I.
Is the box being pressed against the wall vmstat-wise?

Killing or stopping the application by one user and starting anew will have the same effect that C rises up that high immediately?

You could check (awful work) what enhancements or fixes the difference between 5300-06-03 and 5300-12-04 has brought.

Sorry to have no better idea at the moment to help you.

Do you have nmon-monitoring up maybe to check pre-update data with current cpu/process wise? If not it could be helpful in the future.
# 3  
Old 09-13-2011
As you have no assure baseline performance records from the previous AIX TL level It will be no easy task to determine if is the application the cause or the OS.
For now just collect performance statistics in various time intervals and do some comparison between statistics gathered all day.
# 4  
Old 09-15-2011
I apologize for the delay in responding and want to thank you both for responding. I am positive it did not look the same before the upgrade. I actually rolled a server back to AIX 5.3TL06 so I would have something to compare against. All is well on that server and it's a full time job keeping the runaway processed killed on the other 7 servers so it does not crash or become unresponsive.

Unless I am reading the vmstat output wrong yes the box is being pressed pretty good. I played with the headers a bit trying to line up the colmuns for easier reading.

Code:
root@foobar:/ $ vmstat 5

System configuration: lcpu=8 mem=32000MB

kthr    memory              page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm        fre         re   pi  po  fr   sr  cy  in   sy           cs    us sy  id wa
 6  0 479587 7652047   0   0   0   0    0   0  39 1953234 382 48 52  0  0
 6  0 479587 7652047   0   0   0   0    0   0  34 1951368 354 48 52  0  0
 6  0 479587 7652047   0   0   0   0    0   0  39 1947601 385 48 52  0  0

I provided IBM support with perfpmr data and it took them a bit but came back with a possible bug. After getting a core dump of the process it was confirmed that there is an APAR in the works from a previous PMR. Below is the APAR discription. This matches up with the report they sent me from the perfpmr data.

A SIGHUP'D PROCESS HANGS, REPEATEDLY CALLING PTHREAD_YIELD

An ifix is currently in the works. I just hope and pray this is the issue.

Not sure it applied to anyone but can update when it's applied if that is preferred.
# 5  
Old 09-15-2011
Thanks for the feedback - indeed the vmstat looks bad CPU wise and has lot's of unused memory.
Btw. you can use vmstat's switch -w to have the columns aligned. When you even add -t you'll get a time stamp (sometimes helpful).

We are still on 5300-11-04-1015 so I can't tell of any bad experience with your level of updates.

Glad to hear they found something and usually they are fast with responses for hotfixes once you... persuaded them to have a look again ^^ (At least my experience too way back with some other IBM software).
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to solve awk: line 1: runaway string constant error?

Hi All ! I am just trying to print bash variable in awk statement as string here is my script n=1 for file in `ls *.tk |sort -t"-" -k2n,2`; do ak=`(awk 'FNR=='$n'{print $0}' res.dat)` awk '{print "'$ak'",$0}' OFS="\t" $file n=$((n+1)) unset ak doneI am getting following error awk:... (7 Replies)
Discussion started by: Akshay Hegde
7 Replies

2. Shell Programming and Scripting

Finding the age of a unix process, killing old processes, killing zombie processes

I had issues with processes locking up. This script checks for processes and kills them if they are older than a certain time. Its uses some functions you'll need to define or remove, like slog() which I use for logging, and is_running() which checks if this script is already running so you can... (0 Replies)
Discussion started by: sukerman
0 Replies

3. Shell Programming and Scripting

Runaway String Problem

Database.txt John:30:40 echo -n "New Title Please :" read NewTitle awk -F":" 'OFS = ":"{ $1 = "'$NewTitle'" ; print $0 } ' Database.txt> Database2.txt mv Database2.txt Database.txt what this does, is that when i input something into $NewTitle, it will update $1 which is "John" into... (3 Replies)
Discussion started by: gregarion
3 Replies

4. UNIX for Dummies Questions & Answers

Runaway process

Hello all, My hosting provider has contacted me in order to notify about a runaway process issue. Here it is: They have given me a list of those processes but I can neither analyze nor understand what I should do. DATE Fri Nov 21 21:32:29 GMT 2008 SINFO hostname:... (2 Replies)
Discussion started by: elwoodblues47
2 Replies

5. Solaris

Identifying and grouping OS processes and APP processes

Hi Is there an easy way to identify and group currently running processes into OS processes and APP processes. Not all applications are installed as packages. Any free tools or scripts to do this? Many thanks. (2 Replies)
Discussion started by: wilsonee
2 Replies

6. Programming

Runaway SIGALRM signal handler

I have written a program to demonstrate a problem I have encountered when using BSD style asynchronous input using the O_ASYNC flag in conjunction with a real time interval timer sending regular SIGALRM signals to the program. The SIGIO handler obeys all safe practices, using only an atomic update... (8 Replies)
Discussion started by: stewartw
8 Replies

7. UNIX for Advanced & Expert Users

Monitoring Processes - Killing hung processes

Is there a way to monitor certain processes and if they hang too long to kill them, but certain scripts which are expected to take a long time to let them go? Thank you Richard (4 Replies)
Discussion started by: ukndoit
4 Replies

8. Shell Programming and Scripting

I need some example of Co-Processes

I want to know how to work the Co-Processes in kornshell scripts. So, I very need some script about Co-Processes! thanks ...:) (3 Replies)
Discussion started by: javalee
3 Replies

9. UNIX for Dummies Questions & Answers

Runaway processes killed (Really need help)

I got about more than 300 emails from root with the subject "Runaway processes killed" saying that "13146 12737 97.7 6 bash" . So what should I do? Any help would be appreciate (2 Replies)
Discussion started by: Micz
2 Replies

10. UNIX for Advanced & Expert Users

Runaway process. Opinions needed

not too long ago, i wrote a very short script that will bring up 4 customized xterms. The script went completely abnormal simply because of an error I had made in a while loop. This script took control of the system and rendered everything useless. The system admin team which i was part of... (4 Replies)
Discussion started by: TRUEST
4 Replies
Login or Register to Ask a Question