The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
Google UNIX.COM


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Domain not solved from script Sergiu-IT IP Networking 6 04-11-2008 02:52 AM
Kudda has successfully solved the downloading problems for numerous video web angelstar UNIX and Linux Applications 0 04-10-2008 02:41 AM
Xdmcp, dns, exceed broadcast solved BUT kymberm IP Networking 3 02-25-2003 07:47 PM

Reply
 
Submit Tools LinkBack Thread Tools Search this Thread Display Modes
  #1  
Old 11-07-2005
Registered User
 

Join Date: Jul 2005
Posts: 37
can this been solved with awk and sed?

Hi Masters,

Code:
___________________________________________________________________________________
Group of orthologs #1. Best score 3010 bits
Score difference with first non-orthologous sequence - yeast:3010   human:2754
YHR165C             	100.00%		PRP8_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #2. Best score 2100 bits
Score difference with first non-orthologous sequence - yeast:2033   human:1978
YLR106C             	100.00%		MDN1_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #3. Best score 2082 bits
Score difference with first non-orthologous sequence - yeast:997   human:593
YJL130C             	100.00%		PYR1_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #4. Best score 1959 bits
Score difference with first non-orthologous sequence - yeast:1959   human:1007
YKR054C             	100.00%		DYHC_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #5. Best score 1855 bits
Score difference with first non-orthologous sequence - yeast:1855   human:1022
YNR016C             	100.00%		Q6KE87_HUMAN        	100.00%
YMR207C             	19.86%		COA2_HUMAN          	90.52%
                    	       		COA1_HUMAN          	53.30%
___________________________________________________________________________________
Group of orthologs #6. Best score 1838 bits
Score difference with first non-orthologous sequence - yeast:1748   human:1767
YDL140C             	100.00%		RPB1_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #7. Best score 1768 bits
Score difference with first non-orthologous sequence - yeast:1768   human:1636
YJR066W             	100.00%		Q4LE76_HUMAN        	100.00%
YKL203C             	49.22%
Above records are part of a file. What I need to do is to extract the information from this file and put them into a speadsheet format, Like this:(examples from #5 and #7 above)

Group_number; Best_Score; S_one; P_one; S_two; P_two
5;1855;YNR016C;100.00%;Q6KE87_HUMAN;100.00%
5;1855;YMR207C;19.86%;COA2_HUMAN;90.52%
5;1855;;;COA1_HUMAN;53.30%
7;1768;YJR066W;100.00%;Q4LE76_HUMAN;100.00%
7;1768;YKL203C;49%;;

Thanks in Advance!

Last edited by Perderabo; 11-08-2005 at 08:41 AM. Reason: Add code tags and disable smilies for readability
Reply With Quote
Forum Sponsor
  #2  
Old 11-07-2005
Registered User
 

Join Date: Sep 2005
Location: Chennai
Posts: 80
Look at the example given:
if the last line of 5 is displayed as "5;1855;;;COA1_HUMAN;53.30%"
shouldnt the last line of 7 be displayed as "7;1768;;;YKL203C;49%" instead of "7;1768;YKL203C;49%;;" ?
Reply With Quote
  #3  
Old 11-08-2005
Registered User
 

Join Date: Jul 2005
Posts: 37
thx

No. The original file was,

empty empty record record for #5
record record empty empty for #7.

When I posted the records, the empty spaces were missed. But it should be extracted as a empty space. Thanks again.
Reply With Quote
  #4  
Old 11-08-2005
Perderabo's Avatar
Unix Daemon
 

Join Date: Aug 2001
Location: Washington DC Area
Posts: 8,656
This is harder than it looks because fields are defined both by syntax and position. Here is a ksh script that works with your sample data. But any surprises in your real data could break it.
Code:
#! /usr/bin/ksh

IFS=""
while read line ; do
    line=${line##+(_)}
    ((${#line})) ||  continue
    if [[ "$line" != "Group of orthologs"* ]] ; then
        echo error looking for start of record 1>&2
        echo $line  1>&2
        exit 1
    fi
    line=${line#"Group of orthologs #"}
    Group_number=${line%%\.*}
    line=${line#*"Best score "}
    Best_Score=${line%" "*}
    read line
    if [[ $line != "Score difference with "* ]] ; then
        echo "error stepping over 2nd line of group $Group_number" 1>&2
        echo $line  1>&2
        exit 1
    fi
    ProteinLines=1
    while ((ProteinLines)) ; do
        if read line ; then
            line=${line##+(_)}
            if ((!${#line})) ; then
                ProteinLines=0
            else
                eval set $line
                firstchar="${line%${line#?}}"
                if [[ $# -eq 4 ]] ; then
                    S_one=$1
                    P_one=$2
                    S_two=$3
                    P_two=$4
                else
                    if [[ $firstchar = [a-zA-Z0-9] ]] ; then
                        S_one=$1
                        P_one=$2
                        S_two=""
                        P_two=""
                    else
                        S_one=""
                        P_one=""
                        S_two=$1
                        P_two=$2
                    fi
                fi
                echo "${Group_number};${Best_Score};${S_one};${P_one};${S_two};${P_two};"
            fi
        else
            ProteinLines=0
        fi
    done
done
exit 0
Code:
$
$ ./pro < data
1;3010;YHR165C;100.00%;PRP8_HUMAN;100.00%;
2;2100;YLR106C;100.00%;MDN1_HUMAN;100.00%;
3;2082;YJL130C;100.00%;PYR1_HUMAN;100.00%;
4;1959;YKR054C;100.00%;DYHC_HUMAN;100.00%;
5;1855;YNR016C;100.00%;Q6KE87_HUMAN;100.00%;
5;1855;YMR207C;19.86%;COA2_HUMAN;90.52%;
5;1855;;;COA1_HUMAN;53.30%;
6;1838;YDL140C;100.00%;RPB1_HUMAN;100.00%;
7;1768;YJR066W;100.00%;Q4LE76_HUMAN;100.00%;
7;1768;YKL203C;49.22%;;;
$
Reply With Quote
  #5  
Old 11-09-2005
Registered User
 

Join Date: Sep 2005
Location: Chennai
Posts: 80
Heres with commandline PERL:

$ perl -ne 'chop; split;
> if($_[0] eq "Group")
> { $group=substr($_[3],1,length($_[3])-2);$score=$_[6];}
> else{
> if($_ !~ /^\s*$/&&$_[0] ne "Score")
> { if(@_==2){push(@_,"","");}
> if(@_==3){unshift(@_,"");}
> $string=join(";",@_);
> print ("\n$group;$score;$string");}}' file_name


Assumption(s):
Your records can have only 4 elements at the maximum.
That is ,
record/blank record/blank record/blank record/blank
If you can tell me whether these are tab separated, I can help with a more robust code.
Reply With Quote
  #6  
Old 11-09-2005
Registered User
 

Join Date: Sep 2005
Location: Chennai
Posts: 80
And as Perderabo says, any real surprises in the data could break it!
(Note that Perderabos' code generates trailing semi-colons which probably you do not need)
Reply With Quote
  #7  
Old 11-09-2005
Perderabo's Avatar
Unix Daemon
 

Join Date: Aug 2001
Location: Washington DC Area
Posts: 8,656
Quote:
Originally Posted by Abhishek Ghose
(Note that Perderabos' code generates trailing semi-colons which probably you do not need)
Opps... to lose that trailing semicolon, change the echo statement...
echo "${Group_number};${Best_Score};${S_one};${P_one};${S_two};${P_two}"
Reply With Quote
Google The UNIX and Linux Forums
Reply

Tags
linux

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes




All times are GMT -7. The time now is 12:46 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited.
The UNIX and Linux Forums Content Copyright ©1993-2008. All Rights Reserved.Ad Management by RedTyger Visit The Complex Event Processing Blog

Content Relevant URLs by vBSEO 3.2.0