Trace / Debug Howto?


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Trace / Debug Howto?
# 1  
Old 06-25-2014
Trace / Debug Howto?

Can anyone recommend any good guides on how to investigate what a hanging process is doing?

In fact I would be interested in any online guides that would improve my forensic skills on the Linux platform.

I have a script that occasionally hangs. Strace shows:

Code:
[root@cfg01o ~]# strace -p 32370
Process 32370 attached - interrupt to quit
select(14, [3 6 8 11 13], [], NULL, NULL) = 1 (in [3])
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
read(3, "\344u\245%\6U\216\307\276\355\213p\376\203}\2617\251\320\301}\5\376Y(\235]K\222\301\304\370"..., 16384) = 64
write(3, "K\t\26O\323\344\214\341\247W\346\\*e\330\304\372\323O\356q\34\360\327\350\345*\274\35(Q'", 32) = 32
select(14, [3 6 8 11 13], [], NULL, NULL^C <unfinished ...>

I think the select is referring to a case statement in a while loop that is reading from a file. It's looking for an exit message at the point where it seems most likely to be hanging.

I don't know how to drill down further than this, so any suggestions or pointers to good guides online would be appreciated.

Thanks Smilie
# 2  
Old 06-25-2014
Well -- looking at the shell script's code might be a good start!

A select call is rather strange to find in a shell. The read/write's contents look odd too -- pure binary data. This sort of stuff almost almost never passes through the shell (and arguably can't, since it can't handle things like binary nulls). I suspect you're tracing a subprocess, not the shell. Either that, or are just seeing it load its libraries.

So please tell us, in detail, what your script actually is; show us its contents; and tell us what process you're tracing, please.

Last edited by Corona688; 06-25-2014 at 01:19 PM..
# 3  
Old 06-25-2014
Tricky

I was more interested in finding some online tutorials on how to trace into a hanging job effectively...

But as you ask:

The script is running jobs over ssh on remote servers.

A file based messaging service is used for communication between the remote server and the master.

The script seems to hang at a point where it is waiting for a msg containing the word COMPLETED.

This might be as easy as checking for a corrupt msg file the next time it hangs. But I am really hoping for some more generic tips-and-tricks type answers.

Code:
m_process_msg_queue()
{
    local TMPFILE1="$(m_get_tmp_file ${FUNCNAME})" LINE CTL_FILE STATUS PIPE PID JOB_FILE TO=${C_MSG_PIPE_TO}
    [[ -f ${TMPFILE1} ]] || m_fail 1 "Error: Failed to create tmp file (${FUNCNAME})" 

    m_check_file -frw "${M_MSG_QUEUE}" s || m_fail 1 "Error: Msg queue not found (${FUNCNAME})" 

    while read LINE
    do
        #======================================
        # Split the line and get the control file
        #   and the status field
        #======================================
        CTL_FILE="$(echo ${LINE} | cut -d"|" -f1)"
        STATUS="$(echo ${LINE} | cut -d"|" -f2 | awk '{print $NF}' FS="Status=" )"
        
        [[ (-n ${STATUS}) && (-n ${CTL_FILE}) ]] || 
            m_fail 1 "Error: Failed to parse msg ctl file (${FUNCNAME})" 
        PIPE=${CTL_FILE##*/}

        JOB_FILE="$(sed -n '/^JobFile:/p' "${CTL_FILE}" | cut -d":" -f 2)"
        [[ -n ${JOB_FILE} ]] || m_fail 1 "Error: Failed to retrieve job ctl file (${FUNCNAME})" 
        m_check_file -frw "${JOB_FILE}" s || m_fail 1 "Error: job ctl validation failure (${FUNCNAME})" 

        PID="$(sed -n '/^Pid:/p' "${CTL_FILE}" | cut -d":" -f2)"
        [[ ${PID} =~ ^[[:digit:]]+$ ]] || m_fail 1 "Error: PID validation (${FUNCNAME})" 
        m_write_job_field ${C_JOB_PID} "${PID}" "${JOB_FILE}"

        case ${STATUS} in
            "COMPLETED")
                #======================================
                # Completed. Nothing to do.
                #======================================
                m_close_pipe "${FUNCNAME}" "${PIPE}" "${TO}" "${PID}"
                ;;
            "FAILED")
                #======================================
                # Log the error in the run log
                #======================================
                m_log_msg "Non FATAL error in (${CTL_FILE})"
                m_close_pipe "${FUNCNAME}" "${PIPE}" "${TO}" "${PID}"
                ;;
                
            "FATAL")
                #======================================
                    # Remote job flags a fatal error
                # Don't launch any more jobs
                # Wait for all other jobs to complete
                # Only then throw fatal error in master
                    #======================================
                m_log_msg "FATAL error in (${CTL_FILE})"
                M_HALT_ON_ERROR="true"
                m_close_pipe "${FUNCNAME}" "${PIPE}" "${TO}" "${PID}"
                ;;
            "MANUAL")
                #======================================
                # Manual intervention requested
                # Inform the user
                #======================================
                m_log_msg "Manual request in (${CTL_FILE})"
                M_HALT_ON_ERROR="true"
                m_close_pipe "${FUNCNAME}" "${PIPE}" "${TO}" "${PID}"
                ;;
            *)
                m_log_msg "Unrecognised request (${STATUS}) in (${CTL_FILE})"
                M_HALT_ON_ERROR="true"
                m_close_pipe "${FUNCNAME}" "${PIPE}" "${TO}" "${PID}"
                ;;
        esac

        m_write_job_field ${C_JOB_FINISH} "$(date)" "${JOB_FILE}"

    done < "${M_MSG_QUEUE}"

}

# 4  
Old 06-25-2014
Quote:
Originally Posted by bbq
I was more interested in finding some online tutorials on how to trace into a hanging job effectively...
It's not a matter of "how to" -- it's detective work.

select() is hanging on file descriptors 3, 6, 8, 11, and 13. OK, so, what are those...? If you trace the program early enough you can look for the open calls which return these file descriptor numbers and see. ls -l /proc/####/pid may help if you didn't trace the program in time, as might lsof.

Knowing what it's hanging on, is halfway to knowing why it's hanging.

You should also trace it from the other end. What things are happening which this application is not seeing? If you can pinpoint the exact thing which fails -- "app x does y via z, but select() in app q does not see this change" -- then you know enough to actually begin asking questions!

Quote:
But as you ask:

The script is running jobs over ssh on remote servers.

A file based messaging service is used for communication between the remote server and the master.

The script seems to hang at a point where it is waiting for a msg containing the word COMPLETED.

This might be as easy as checking for a corrupt msg file the next time it hangs. But I am really hoping for some more generic tips-and-tricks type answers.
The script you posted is so incomplete to be useless, full of variables and functions which aren't explained, only used. You also haven't explained what application you traced.
# 5  
Old 06-27-2014
Thanks

Thanks Corona

I'm running it under strace now and of course it has stopped failing....

Yes, the reason I didn't post any code was because it is a distributed application with multiple libraries included. Several thousand lines of code.

That's why my OP was a request for pointers to good tutorials, not a request to help debug this application, Smilie

But thanks for taking an interest anyway.
# 6  
Old 06-30-2014
A while loop reading from stdin can be obscured by commands or functions that read from stdin.
A hardened loop is
Code:
while read LINE <&3
  ...
done 3< "${M_MSG_QUEUE}"

--
A well-known shell-debugging is to put
Code:
set -x

in the script, or run it with
Code:
/bin/bash -x scriptname


Last edited by MadeInGermany; 06-30-2014 at 02:20 PM..
# 7  
Old 07-01-2014
I don't understand

Hi Mig

I don't get your comments about the while loop.

The example I posted is reading from a file that contains messages from remote servers.

Thanks for the -x tip but I was looking for something a little deeper than that. Smilie

Like I said in the OP, I'm really looking for online tutorials. Especially anything that can give me more insight into the output from strace for example. I still don't know what that 'select' is about.

Cheers
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. AIX

Trace su to root

Hi, is it possible to trace everything about user that changes from its own user to root user, failed and successful attempts (I would need user and IP address of user that was trying to do that)? I tried adding auth.notice and auth.info in syslog.conf but it only tracks user withoud IP... (6 Replies)
Discussion started by: sprehodec
6 Replies

2. Shell Programming and Scripting

Stack Trace

Hi All Thought it would be kind of fun to implement a stack trace for a shell script that calls functions within a sub shell. This is for bash under Linux and probably not portable - #! /bin/bash error_exit() { echo "=======================" echo $1 echo... (4 Replies)
Discussion started by: steadyonabix
4 Replies

3. UNIX for Dummies Questions & Answers

Help with trace file

Hi, I am an oracle DBA pretty new to unix. We had one of the filesystems full and a colleague cleared some stuffs to create more space. I just checked now and found there is now more space available. How do i find exactly what he cleared? We have oracle database installed and its a RAC... (4 Replies)
Discussion started by: dollypee
4 Replies

4. Solaris

Log Trace

Hi I would like to display only error messages from my log files while monotring application on my solaris box using tail command. Is there other way we can monitor please let me know? In general # tail -f "xyz.log' ---> this will display current activity of the logs, instead i would like... (4 Replies)
Discussion started by: gkrishnag
4 Replies

5. Shell Programming and Scripting

how to supress the trace

Hi I am working in ksh and getting the trace after trying to remove the file which in some cases does not exist: $ my_script loadfirm.dta.master: No such file or directory The code inside the script which produces this trace is the following: ] || rm ${FILE}.master >> /dev/null for... (3 Replies)
Discussion started by: aoussenko
3 Replies

6. HP-UX

how to trace the logs

Hi, Last day, In one of our unix boxes there was an issue wherein few of the directory structures were missing / got deleted. Is there any way by which we can find how it happened, I mean by going through syslog / which user had run what command? Thanks for your help (3 Replies)
Discussion started by: vivek_damodaran
3 Replies

7. UNIX for Dummies Questions & Answers

Trace DHCP - Help!

Can someone help me with commands to trace DHCP on an HP_UX box? Thanks! (0 Replies)
Discussion started by: nuGuy
0 Replies

8. Shell Programming and Scripting

Function Trace

Does anyone know if there is a util out there to run through a shell script and be able to trace the function call tree. I have inherited some code and the original author was ****mad**** keen on functions - even ones called only once! If anyone knows of anything I would appreciate it - web... (3 Replies)
Discussion started by: ajcannon
3 Replies

9. IP Networking

trace route ip

hi everybody , i have a solaris 5.6 box and i want to trace the route on an ip i treid traceroute but soalris 5.6 does not support it ... is there a command that can be used equivelent to traceroute ? thanks for your help (2 Replies)
Discussion started by: ppass
2 Replies

10. UNIX for Advanced & Expert Users

Trace connections

In my organization in order for anyone to go to any Unix server they have to go through "SERVER A" and login as themselves. Then people are free to go enywhere they please. For example: SERVER A, loggs in as himself telnets to SERVER B, loggs in as guest telnets to SERVER C, loggs in as... (8 Replies)
Discussion started by: jraitsev
8 Replies
Login or Register to Ask a Question