Quote:
I was more interested in finding some online tutorials on how to trace into a hanging job effectively...
It's not a matter of "how to" -- it's detective work.
select() is hanging on file descriptors 3, 6, 8, 11, and 13. OK, so, what are those...? If you trace the program early enough you can look for the
open calls which return these file descriptor numbers and see.
ls -l /proc/####/pid may help if you didn't trace the program in time, as might
lsof.
Knowing what it's hanging on, is halfway to knowing why it's hanging.
You should also trace it from the other end. What things are happening which this application is
not seeing? If you can pinpoint the exact thing which fails -- "app x does y via z, but select() in app q does not see this change" --
then you know enough to actually begin asking questions!
Quote:
But as you ask:
The script is running jobs over ssh on remote servers.
A file based messaging service is used for communication between the remote server and the master.
The script seems to hang at a point where it is waiting for a msg containing the word COMPLETED.
This might be as easy as checking for a corrupt msg file the next time it hangs. But I am really hoping for some more generic tips-and-tricks type answers.
The script you posted is so incomplete to be useless, full of variables and functions which aren't explained, only used. You also haven't explained what application you traced.