can this been solved with awk and sed?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting can this been solved with awk and sed?
# 1  
Old 11-07-2005
can this been solved with awk and sed?

Hi Masters,

Code:
___________________________________________________________________________________
Group of orthologs #1. Best score 3010 bits
Score difference with first non-orthologous sequence - yeast:3010   human:2754
YHR165C             	100.00%		PRP8_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #2. Best score 2100 bits
Score difference with first non-orthologous sequence - yeast:2033   human:1978
YLR106C             	100.00%		MDN1_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #3. Best score 2082 bits
Score difference with first non-orthologous sequence - yeast:997   human:593
YJL130C             	100.00%		PYR1_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #4. Best score 1959 bits
Score difference with first non-orthologous sequence - yeast:1959   human:1007
YKR054C             	100.00%		DYHC_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #5. Best score 1855 bits
Score difference with first non-orthologous sequence - yeast:1855   human:1022
YNR016C             	100.00%		Q6KE87_HUMAN        	100.00%
YMR207C             	19.86%		COA2_HUMAN          	90.52%
                    	       		COA1_HUMAN          	53.30%
___________________________________________________________________________________
Group of orthologs #6. Best score 1838 bits
Score difference with first non-orthologous sequence - yeast:1748   human:1767
YDL140C             	100.00%		RPB1_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #7. Best score 1768 bits
Score difference with first non-orthologous sequence - yeast:1768   human:1636
YJR066W             	100.00%		Q4LE76_HUMAN        	100.00%
YKL203C             	49.22%

Above records are part of a file. What I need to do is to extract the information from this file and put them into a speadsheet format, Like this:(examples from #5 and #7 above)

Group_number; Best_Score; S_one; P_one; S_two; P_two
5;1855;YNR016C;100.00%;Q6KE87_HUMAN;100.00%
5;1855;YMR207C;19.86%;COA2_HUMAN;90.52%
5;1855;;;COA1_HUMAN;53.30%
7;1768;YJR066W;100.00%;Q4LE76_HUMAN;100.00%
7;1768;YKL203C;49%;;

Thanks in Advance!

Last edited by Perderabo; 11-08-2005 at 11:41 AM.. Reason: Add code tags and disable smilies for readability
# 2  
Old 11-08-2005
Look at the example given:
if the last line of 5 is displayed as "5;1855;;;COA1_HUMAN;53.30%"
shouldnt the last line of 7 be displayed as "7;1768;;;YKL203C;49%" instead of "7;1768;YKL203C;49%;;" ?
# 3  
Old 11-08-2005
thx

No. The original file was,

empty empty record record for #5
record record empty empty for #7.

When I posted the records, the empty spaces were missed. But it should be extracted as a empty space. Thanks again.
# 4  
Old 11-09-2005
This is harder than it looks because fields are defined both by syntax and position. Here is a ksh script that works with your sample data. But any surprises in your real data could break it.
Code:
#! /usr/bin/ksh

IFS=""
while read line ; do
    line=${line##+(_)}
    ((${#line})) ||  continue
    if [[ "$line" != "Group of orthologs"* ]] ; then
        echo error looking for start of record 1>&2
        echo $line  1>&2
        exit 1
    fi
    line=${line#"Group of orthologs #"}
    Group_number=${line%%\.*}
    line=${line#*"Best score "}
    Best_Score=${line%" "*}
    read line
    if [[ $line != "Score difference with "* ]] ; then
        echo "error stepping over 2nd line of group $Group_number" 1>&2
        echo $line  1>&2
        exit 1
    fi
    ProteinLines=1
    while ((ProteinLines)) ; do
        if read line ; then
            line=${line##+(_)}
            if ((!${#line})) ; then
                ProteinLines=0
            else
                eval set $line
                firstchar="${line%${line#?}}"
                if [[ $# -eq 4 ]] ; then
                    S_one=$1
                    P_one=$2
                    S_two=$3
                    P_two=$4
                else
                    if [[ $firstchar = [a-zA-Z0-9] ]] ; then
                        S_one=$1
                        P_one=$2
                        S_two=""
                        P_two=""
                    else
                        S_one=""
                        P_one=""
                        S_two=$1
                        P_two=$2
                    fi
                fi
                echo "${Group_number};${Best_Score};${S_one};${P_one};${S_two};${P_two};"
            fi
        else
            ProteinLines=0
        fi
    done
done
exit 0

Code:
$
$ ./pro < data
1;3010;YHR165C;100.00%;PRP8_HUMAN;100.00%;
2;2100;YLR106C;100.00%;MDN1_HUMAN;100.00%;
3;2082;YJL130C;100.00%;PYR1_HUMAN;100.00%;
4;1959;YKR054C;100.00%;DYHC_HUMAN;100.00%;
5;1855;YNR016C;100.00%;Q6KE87_HUMAN;100.00%;
5;1855;YMR207C;19.86%;COA2_HUMAN;90.52%;
5;1855;;;COA1_HUMAN;53.30%;
6;1838;YDL140C;100.00%;RPB1_HUMAN;100.00%;
7;1768;YJR066W;100.00%;Q4LE76_HUMAN;100.00%;
7;1768;YKL203C;49.22%;;;
$

# 5  
Old 11-09-2005
Heres with commandline PERL:

$ perl -ne 'chop; split;
> if($_[0] eq "Group")
> { $group=substr($_[3],1,length($_[3])-2);$score=$_[6];}
> else{
> if($_ !~ /^\s*$/&&$_[0] ne "Score")
> { if(@_==2){push(@_,"","");}
> if(@_==3){unshift(@_,"");}
> $string=join(";",@_);
> print ("\n$group;$score;$string");}}' file_name


Assumption(s):
Your records can have only 4 elements at the maximum.
That is ,
record/blank record/blank record/blank record/blank
If you can tell me whether these are tab separated, I can help with a more robust code.
# 6  
Old 11-09-2005
And as Perderabo says, any real surprises in the data could break it!
(Note that Perderabos' code generates trailing semi-colons which probably you do not need)
# 7  
Old 11-09-2005
Quote:
Originally Posted by Abhishek Ghose
(Note that Perderabos' code generates trailing semi-colons which probably you do not need)
Opps... to lose that trailing semicolon, change the echo statement...
echo "${Group_number};${Best_Score};${S_one};${P_one};${S_two};${P_two}"
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

[Solved] sed command help

Hello all. Im trying very hard to figure this out, but Im a newbie. I have a file that looks like this.... 6315551234 NJ224 5162224567 SUFF Im trying to put a command together that will make it into this.... UM,6315551234,,,,,NJ224,0 UM,5162224567,,,,,SUFF,0 Im all over the... (7 Replies)
Discussion started by: jay11789
7 Replies

2. UNIX for Dummies Questions & Answers

[Solved] How remove leading whitespace from xml (sed /awk?)

Hi again I have an xml file and want to remove the leading white space as it causes me issues later in my script I see sed is possible but cant seem to get it to work I tried sed 's/^ *//' file.xml output <xn:VsDataContainer id="1U104799" modifier="update"> ... (10 Replies)
Discussion started by: aniquebmx
10 Replies

3. Shell Programming and Scripting

[Solved] sed

sed -e 's/console/raw/g' this command will replace the letter pradeep with rawat what if i want to replace a word like FRIENDS with a space simultaneously from the same file i m replacing pradeep. im doing this sed -e 's/console/raw/g' && sed 's/FRIENDS//g' but i dono why this is not happening. (2 Replies)
Discussion started by: console
2 Replies

4. Shell Programming and Scripting

[SOLVED] sed -i not available in solaris 5.10

Hi All, i'm writing a script where i have to grep for a pattern and the 3 lines after the pattern and comment them out. Note that i have to do this for multiple files, i am able to grep the pattern and the next 3 lines but since solaris does not recognize the -i option, i was wondering if... (11 Replies)
Discussion started by: Irishboy24
11 Replies

5. Shell Programming and Scripting

[SOLVED] sed command

Help request, I have tsted this line of code for hours. The first line works and the second line returns the message " sed: command garbled.....". This is running on solaris. The "${} variables all have good values when echoed. ## /bin/sed -n '1,25p' ${file} >> ${MailFile} ... (3 Replies)
Discussion started by: millerg225
3 Replies

6. UNIX for Dummies Questions & Answers

[solved]Help with a sed command

So I have a bunch of strings in a file. Example Line ./prcol/trt/conf/conf-app/jobdefinition/trt-pre-extr-trt-step.jdef Intended Result pre-extr-trt-step So far I have parsed it out to the last bit, echo $line | cut -d'/' -f7 | cut -d. -f1Result trt-pre-extr-trt-step So I added a... (2 Replies)
Discussion started by: J-Man
2 Replies

7. Shell Programming and Scripting

[solved] how to separate using sed !

dears, hope evryone doing good in his work , i have a question about something important : how can i use 'sed' so in a script automatically it will take an enter before the number 1 in this line so 2 commands will be taken insted of one big command ?... (0 Replies)
Discussion started by: semaan
0 Replies

8. Shell Programming and Scripting

[Solved] Find duplicate and add pattern in sed/awk

<Update> I have the solution: sed 's/\{3\}/&;&;---;4/' The thread can be marked as solved! </Update> Hi There, I'm working on a script processing some data from a website into cvs format. There is only one final problem left I can't find a solution. I've processed my file... (0 Replies)
Discussion started by: lolworlds
0 Replies

9. Shell Programming and Scripting

[Solved] Sed/awk print between patterns the first occurrence

Guys, I am trying the following: i have a log file of a webbap which logs in the following pattern: 2011-08-14 21:10:04,535 blablabla ERROR blablabla bla bla bla bla 2011-08-14 21:10:04,535 blablabla ERROR blablabla bla bla bla ... (6 Replies)
Discussion started by: ppolianidis
6 Replies

10. Shell Programming and Scripting

Solved: AWK SED HELP

Hi, I need to process a file as below. Could you please help to achieve that using awk/sed commands. Input file: --------------- AB | "abcdef 12345" | 7r5561451.pdf PQRST | "fghfghf hgkjgtjhghb ghhgjhg hghjghg " | 76er6ry.pdf 12345 | "fghfgcv uytdywe bww76 jkh7dscbc 78 : nvchtry hbuyt"... (0 Replies)
Discussion started by: viveksr
0 Replies
Login or Register to Ask a Question