can this been solved with awk and sed?

Login or Register to Ask a Question and Join Our Community

can this been solved with awk and sed?

Tags

linux, shell scripts, solved

Login to Discuss or Reply to this Discussion in Our Community

Top Forums Shell Programming and Scripting can this been solved with awk and sed?

11-07-2005

Registered User

37, 0

Join Date: Jul 2005

Last Activity: 10 November 2008, 9:56 PM EST

Posts: 37

Thanks Given: 0

Thanked 0 Times in 0 Posts

can this been solved with awk and sed?

Hi Masters,

Code:

___________________________________________________________________________________
Group of orthologs #1. Best score 3010 bits
Score difference with first non-orthologous sequence - yeast:3010   human:2754
YHR165C             	100.00%		PRP8_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #2. Best score 2100 bits
Score difference with first non-orthologous sequence - yeast:2033   human:1978
YLR106C             	100.00%		MDN1_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #3. Best score 2082 bits
Score difference with first non-orthologous sequence - yeast:997   human:593
YJL130C             	100.00%		PYR1_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #4. Best score 1959 bits
Score difference with first non-orthologous sequence - yeast:1959   human:1007
YKR054C             	100.00%		DYHC_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #5. Best score 1855 bits
Score difference with first non-orthologous sequence - yeast:1855   human:1022
YNR016C             	100.00%		Q6KE87_HUMAN        	100.00%
YMR207C             	19.86%		COA2_HUMAN          	90.52%
                    	       		COA1_HUMAN          	53.30%
___________________________________________________________________________________
Group of orthologs #6. Best score 1838 bits
Score difference with first non-orthologous sequence - yeast:1748   human:1767
YDL140C             	100.00%		RPB1_HUMAN          	100.00%
___________________________________________________________________________________
Group of orthologs #7. Best score 1768 bits
Score difference with first non-orthologous sequence - yeast:1768   human:1636
YJR066W             	100.00%		Q4LE76_HUMAN        	100.00%
YKL203C             	49.22%

Above records are part of a file. What I need to do is to extract the information from this file and put them into a speadsheet format, Like this:(examples from #5 and #7 above)

Group_number; Best_Score; S_one; P_one; S_two; P_two
5;1855;YNR016C;100.00%;Q6KE87_HUMAN;100.00%
5;1855;YMR207C;19.86%;COA2_HUMAN;90.52%
5;1855;;;COA1_HUMAN;53.30%
7;1768;YJR066W;100.00%;Q4LE76_HUMAN;100.00%
7;1768;YKL203C;49%;;

Thanks in Advance!

Last edited by Perderabo; 11-08-2005 at 11:41 AM.. Reason: Add code tags and disable smilies for readability

mskcc

View Public Profile for mskcc

Find all posts by mskcc

11-08-2005

Registered User

81, 0

Join Date: Sep 2005

Last Activity: 31 October 2012, 6:24 AM EDT

Location: Chennai

Posts: 81

Thanks Given: 0

Thanked 0 Times in 0 Posts

Look at the example given:
if the last line of 5 is displayed as "5;1855;;;COA1_HUMAN;53.30%"
shouldnt the last line of 7 be displayed as "7;1768;;;YKL203C;49%" instead of "7;1768;YKL203C;49%;;" ?

Abhishek Ghose

View Public Profile for Abhishek Ghose

Find all posts by Abhishek Ghose

11-08-2005

Registered User

37, 0

Join Date: Jul 2005

Last Activity: 10 November 2008, 9:56 PM EST

Posts: 37

Thanks Given: 0

Thanked 0 Times in 0 Posts

thx

No. The original file was,

empty empty record record for #5
record record empty empty for #7.

When I posted the records, the empty spaces were missed. But it should be extracted as a empty space. Thanks again.

mskcc

View Public Profile for mskcc

Find all posts by mskcc

11-09-2005

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

This is harder than it looks because fields are defined both by syntax and position. Here is a ksh script that works with your sample data. But any surprises in your real data could break it.

Code:

#! /usr/bin/ksh

IFS=""
while read line ; do
    line=${line##+(_)}
    ((${#line})) ||  continue
    if [[ "$line" != "Group of orthologs"* ]] ; then
        echo error looking for start of record 1>&2
        echo $line  1>&2
        exit 1
    fi
    line=${line#"Group of orthologs #"}
    Group_number=${line%%\.*}
    line=${line#*"Best score "}
    Best_Score=${line%" "*}
    read line
    if [[ $line != "Score difference with "* ]] ; then
        echo "error stepping over 2nd line of group $Group_number" 1>&2
        echo $line  1>&2
        exit 1
    fi
    ProteinLines=1
    while ((ProteinLines)) ; do
        if read line ; then
            line=${line##+(_)}
            if ((!${#line})) ; then
                ProteinLines=0
            else
                eval set $line
                firstchar="${line%${line#?}}"
                if [[ $# -eq 4 ]] ; then
                    S_one=$1
                    P_one=$2
                    S_two=$3
                    P_two=$4
                else
                    if [[ $firstchar = [a-zA-Z0-9] ]] ; then
                        S_one=$1
                        P_one=$2
                        S_two=""
                        P_two=""
                    else
                        S_one=""
                        P_one=""
                        S_two=$1
                        P_two=$2
                    fi
                fi
                echo "${Group_number};${Best_Score};${S_one};${P_one};${S_two};${P_two};"
            fi
        else
            ProteinLines=0
        fi
    done
done
exit 0

Code:

$
$ ./pro < data
1;3010;YHR165C;100.00%;PRP8_HUMAN;100.00%;
2;2100;YLR106C;100.00%;MDN1_HUMAN;100.00%;
3;2082;YJL130C;100.00%;PYR1_HUMAN;100.00%;
4;1959;YKR054C;100.00%;DYHC_HUMAN;100.00%;
5;1855;YNR016C;100.00%;Q6KE87_HUMAN;100.00%;
5;1855;YMR207C;19.86%;COA2_HUMAN;90.52%;
5;1855;;;COA1_HUMAN;53.30%;
6;1838;YDL140C;100.00%;RPB1_HUMAN;100.00%;
7;1768;YJR066W;100.00%;Q4LE76_HUMAN;100.00%;
7;1768;YKL203C;49.22%;;;
$

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

11-09-2005

Registered User

81, 0

Join Date: Sep 2005

Last Activity: 31 October 2012, 6:24 AM EDT

Location: Chennai

Posts: 81

Thanks Given: 0

Thanked 0 Times in 0 Posts

Heres with commandline PERL:

$ perl -ne 'chop; split;
> if($_[0] eq "Group")
> { $group=substr($_[3],1,length($_[3])-2);$score=$_[6];}
> else{
> if($_ !~ /^\s*$/&&$_[0] ne "Score")
> { if(@_==2){push(@_,"","");}
> if(@_==3){unshift(@_,"");}
> $string=join(";",@_);
> print ("\n$group;$score;$string");}}' file_name

Assumption(s):
Your records can have only 4 elements at the maximum.
That is ,
record/blank record/blank record/blank record/blank
If you can tell me whether these are tab separated, I can help with a more robust code.

Abhishek Ghose

View Public Profile for Abhishek Ghose

Find all posts by Abhishek Ghose

11-09-2005

Registered User

81, 0

Join Date: Sep 2005

Last Activity: 31 October 2012, 6:24 AM EDT

Location: Chennai

Posts: 81

Thanks Given: 0

Thanked 0 Times in 0 Posts

And as Perderabo says, any real surprises in the data could break it!
(Note that Perderabos' code generates trailing semi-colons which probably you do not need)

Abhishek Ghose

View Public Profile for Abhishek Ghose

Find all posts by Abhishek Ghose

11-09-2005

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

Quote:

Originally Posted by Abhishek Ghose

(Note that Perderabos' code generates trailing semi-colons which probably you do not need)

Opps... to lose that trailing semicolon, change the echo statement...
echo "${Group_number};${Best_Score};${S_one};${P_one};${S_two};${P_two}"

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

[Solved] sed command help

Hello all. Im trying very hard to figure this out, but Im a newbie. I have a file that looks like this.... 6315551234 NJ224 5162224567 SUFF Im trying to put a command together that will make it into this.... UM,6315551234,,,,,NJ224,0 UM,5162224567,,,,,SUFF,0 Im all over the...

2. UNIX for Dummies Questions & Answers

[Solved] How remove leading whitespace from xml (sed /awk?)

Hi again I have an xml file and want to remove the leading white space as it causes me issues later in my script I see sed is possible but cant seem to get it to work I tried sed 's/^ *//' file.xml output <xn:VsDataContainer id="1U104799" modifier="update"> ...

3. Shell Programming and Scripting

[Solved] sed

sed -e 's/console/raw/g' this command will replace the letter pradeep with rawat what if i want to replace a word like FRIENDS with a space simultaneously from the same file i m replacing pradeep. im doing this sed -e 's/console/raw/g' && sed 's/FRIENDS//g' but i dono why this is not happening.

4. Shell Programming and Scripting

[SOLVED] sed -i not available in solaris 5.10

Hi All, i'm writing a script where i have to grep for a pattern and the 3 lines after the pattern and comment them out. Note that i have to do this for multiple files, i am able to grep the pattern and the next 3 lines but since solaris does not recognize the -i option, i was wondering if...

5. Shell Programming and Scripting

[SOLVED] sed command

Help request, I have tsted this line of code for hours. The first line works and the second line returns the message " sed: command garbled.....". This is running on solaris. The "${} variables all have good values when echoed. ## /bin/sed -n '1,25p' ${file} >> ${MailFile} ...

6. UNIX for Dummies Questions & Answers

[solved]Help with a sed command

So I have a bunch of strings in a file. Example Line ./prcol/trt/conf/conf-app/jobdefinition/trt-pre-extr-trt-step.jdef Intended Result pre-extr-trt-step So far I have parsed it out to the last bit, echo $line | cut -d'/' -f7 | cut -d. -f1Result trt-pre-extr-trt-step So I added a...

7. Shell Programming and Scripting

[solved] how to separate using sed !

dears, hope evryone doing good in his work , i have a question about something important : how can i use 'sed' so in a script automatically it will take an enter before the number 1 in this line so 2 commands will be taken insted of one big command ?...

8. Shell Programming and Scripting

[Solved] Find duplicate and add pattern in sed/awk

<Update> I have the solution: sed 's/\{3\}/&;&;---;4/' The thread can be marked as solved! </Update> Hi There, I'm working on a script processing some data from a website into cvs format. There is only one final problem left I can't find a solution. I've processed my file...

9. Shell Programming and Scripting

[Solved] Sed/awk print between patterns the first occurrence

Guys, I am trying the following: i have a log file of a webbap which logs in the following pattern: 2011-08-14 21:10:04,535 blablabla ERROR blablabla bla bla bla bla 2011-08-14 21:10:04,535 blablabla ERROR blablabla bla bla bla ...

10. Shell Programming and Scripting

Solved: AWK SED HELP

Hi, I need to process a file as below. Could you please help to achieve that using awk/sed commands. Input file: --------------- AB | "abcdef 12345" | 7r5561451.pdf PQRST | "fghfghf hgkjgtjhghb ghhgjhg hghjghg " | 76er6ry.pdf 12345 | "fghfgcv uytdywe bww76 jkh7dscbc 78 : nvchtry hbuyt"...

Login or Register to Ask a Question