Unix/Linux Go Back    


Shell Programming and Scripting BSD, Linux, and UNIX shell scripting — Post awk, bash, csh, ksh, perl, php, python, sed, sh, shell scripts, and other shell scripting languages questions here.

awk to combine lines if fields match in lines

Shell Programming and Scripting


Tags
awk, solved

Closed    
 
Thread Tools Search this Thread Display Modes
    #8  
Old Unix and Linux 05-14-2017   -   Original Discussion by cmccabe
RavinderSingh13 RavinderSingh13 is offline Forum Advisor  
Registered User
 
Join Date: May 2013
Last Activity: 22 November 2017, 12:40 PM EST
Location: Chennai
Posts: 2,670
Thanks: 588
Thanked 1,272 Times in 1,145 Posts
Hello cmccabe,

When I run my script I got following output. Where 4 lines in which string Fusion is there is coming to output.

Code:
chr12:12006495-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E4N15 1868
chr15:88483984-chr12:12006495 Fusion Gain-of-Function ETV6NTRK3-E4N15 1868
chr12:12022903-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E5N15 414833
chr15:88483984-chr12:12022903 Fusion Gain-of-Function ETV6NTRK3-E5N15 414833
chr17    7577108    COSM10749;COSM43737    C    A,T    149.594    PASS    AF=0.0830415,0.0;AO=372,2;DP=4420;FAO=166,0;FDP=1999;FR=.,.,REALIGNEDx0.0865;FRO=1833;FSAF=82,0;FSAR=84,0;FSRF=952;FSRR=881;FWDB=0.0072184,-0.0207142;FXX=4.99998E-4;HRUN=1,1;LEN=1,1;MLLD=293.795,80.5366;OALT=A,T;OID=COSM10749,COSM43737;OMAPALT=A,T;OPOS=7577108,7577108;OREF=C,C;PB=.,.;PBP=.,.;QD=0.299338;RBI=0.00721997,0.02565;REFB=1.40155E-4,-7.81395E-4;REVB=1.50579E-4,0.0151276;RO=4043;SAF=187,1;SAR=185,1;SRF=2118;SRR=1925;SSEN=0,0;SSEP=0,0;SSSB=-0.0251826,-5.12306E-4;STB=0.52327,0.5;STBP=0.541,1.0;TYPE=snp,snp;VARB=-0.00153404,0.0;HS;FUNC=[{'origPos':'7577108','origRef':'C','normalizedRef':'C','gene':'TP53','normalizedPos':'7577108','normalizedAlt':'A','polyphen':'1.0','gt':'pos','codon':'TTT','coding':'c.830G>T','sift':'0.0','grantham':'205.0','transcript':'NM_000546.5','function':'missense','protein':'p.Cys277Phe','location':'exonic','origAlt':'A','exon':'8','oncomineGeneClass':'Loss-of-Function','oncomineVariantClass':'Hotspot'}]    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT    0/1:149:4420:1999:4043:1833:372,2:166,0:0.0830415,0.0:185,1:187,1:2118:1925:84,0:82,0:952:881:1
chr10    89624278    .    G    T    62.8836    PASS    AF=0.0785393;AO=297;DP=4155;FAO=157;FDP=1999;FR=.;FRO=1842;FSAF=77;FSAR=80;FSRF=908;FSRR=934;FWDB=0.0113997;FXX=4.99998E-4;HRUN=1;LEN=1;MLLD=117.237;OALT=T;OID=.;OMAPALT=T;OPOS=89624278;OREF=G;PB=.;PBP=.;QD=0.12583;RBI=0.040843;REFB=5.39678E-4;REVB=-0.0392199;RO=3844;SAF=150;SAR=147;SRF=1936;SRR=1908;SSEN=0;SSEP=0;SSSB=0.00159791;STB=0.502301;STBP=0.96;TYPE=snp;VARB=-0.00676678;FUNC=[{'origPos':'89624278','origRef':'G','normalizedRef':'G','gene':'PTEN','normalizedPos':'89624278','normalizedAlt':'T','gt':'pos','codon':'TAG','coding':'c.52G>T','transcript':'NM_000314.4','function':'nonsense','protein':'p.Glu18Ter','location':'exonic','origAlt':'T','exon':'1'}]    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR:QT    0/1:62:4155:1999:3844:1842:297:157:0.0785393:147:150:1936:1908:80:77:908:934:1

Thanks,
R. Singh
The Following User Says Thank You to RavinderSingh13 For This Useful Post:
cmccabe (05-14-2017)
Sponsored Links
    #9  
Old Unix and Linux 05-14-2017   -   Original Discussion by cmccabe
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 17 November 2017, 8:12 AM EST
Location: Chicago
Posts: 1,188
Thanks: 713
Thanked 14 Times in 13 Posts
Yes, I see now... it is the awk -F'\t' '!seen[$4]++' that is removing the 4th line.


Code:
chr12:12006495-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E4N15 1868
chr15:88483984-chr12:12006495 Fusion Gain-of-Function ETV6NTRK3-E4N15 1868
chr12:12022903-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E5N15 414833
chr15:88483984-chr12:12022903 Fusion Gain-of-Function ETV6NTRK3-E5N15 414833

Since the lines in bold have the same $4 value only one of the lines needs to be printed.
The lines in italics have the same. This will only apply to those lines where SVTYPE=Fusion.

would adding the code in purple only remove the duplicates in the lines where SVTYPE=Fusion? Thank you very much Linux.

Code:
   };
                        match($0,/oncomineGeneClass.*,/);
                        print "Locus\tType\tFunction\tGene\tReads"
                        print $1":"$2 "-" VAL OFS svtype OFS substr($0,RSTART+20,RLENGTH-22) OFS $3 OFS read_count;
                                   '{ if (a[$4]++ == 0) print $0; }' "$@"
            next
                    }
    1

desired output

Code:
chr12:12006495-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E4N15 1868
chr12:12022903-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E5N15 414833
chr17 entire line.... (this line does not have the keyword in it so is printed in the output)
chr10 entire line...   (this line does not have the keyword in it so is printed in the output

Sponsored Links
    #10  
Old Unix and Linux 05-15-2017   -   Original Discussion by cmccabe
RavinderSingh13 RavinderSingh13 is offline Forum Advisor  
Registered User
 
Join Date: May 2013
Last Activity: 22 November 2017, 12:40 PM EST
Location: Chennai
Posts: 2,670
Thanks: 588
Thanked 1,272 Times in 1,145 Posts
Hello cmccabe,

Based on your shown sample output, could you please try following and let me know if this helps you.

Code:
awk '/SVTYPE=Fusion/{
                        match($5,/].*]/);
                        sub(/.COS.*/,"",$3);
                        sub(/-/,"",$3);
                        sub(/\./,"-",$3);
                        VAL=substr($5,RSTART+1,RLENGTH-2);
                        num=split($8, array,";");
                        for(i=1;i<=num;i++){
                                                if(array[i] ~  /SVTYPE/){
                                                sub(/.*=/,"",array[i]);
                                                svtype=array[i]
                                                                        };
                                                if(array[i] ~ /READ_COUNT/){
                                                sub(/.*=/,"",array[i]);
                                                read_count=array[i]
                                                                           }
                                           };
                        match($0,/oncomineGeneClass.*,/);
                        if(++A[$3]==1)
                                    {
                                        print $1":"$2 "-" VAL OFS svtype OFS substr($0,RSTART+20,RLENGTH-22) OFS $3 OFS read_count;
                                    }
                    next
                    }
        1
    '    Input_file

Thanks,
R. Singh
The Following User Says Thank You to RavinderSingh13 For This Useful Post:
cmccabe (05-15-2017)
    #11  
Old Unix and Linux 05-15-2017   -   Original Discussion by cmccabe
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 17 November 2017, 8:12 AM EST
Location: Chicago
Posts: 1,188
Thanks: 713
Thanked 14 Times in 13 Posts
Thank you as always for the great help and explanations Linux.

---------- Post updated at 07:30 AM ---------- Previous update was at 07:00 AM ----------

I am still a bit confused on this concept:

Quote:
#### Putting the values of subtring whose value starts from RSTART+1 to RLENGTH-2, here point to be noted RSTART and RLENGTH are the OOTB awk's keywords which will be SET when a match is found, we have done this match for 5th field above
I guess my confusion is what defines RSTART+1 and RLENGTH-2 in this line but RSTART+20 and RLENGH-22 in the print?

In the first RSTART $5 was used as the substring, so chr15:88483984, so is RSTART+1 the chr15 and RLENGTH-2 = to :88483984? Thank you Linux.
Sponsored Links
    #12  
Old Unix and Linux 05-15-2017   -   Original Discussion by cmccabe
RavinderSingh13 RavinderSingh13 is offline Forum Advisor  
Registered User
 
Join Date: May 2013
Last Activity: 22 November 2017, 12:40 PM EST
Location: Chennai
Posts: 2,670
Thanks: 588
Thanked 1,272 Times in 1,145 Posts
Quote:
Originally Posted by cmccabe View Post
Thank you as always for the great help and explanations Linux.
---------- Post updated at 07:30 AM ---------- Previous update was at 07:00 AM ----------
I am still a bit confused on this concept:
I guess my confusion is what defines RSTART+1 and RLENGTH-2 in this line but RSTART+20 and RLENGH-22 in the print?
In the first RSTART $5 was used as the substring, so chr15:88483984, so is RSTART+1 the chr15 and RLENGTH-2 = to :88483984? Thank you Linux.
Hello cmccabe,

No need to be confuse Linux

I will provide a guidance here and you could play with it. Let us assume you are printing simply RSTART and RLENGTH(RSTART will keep the index of starting point of the REGEX and RLENGTH will have the last character etc's index in it) variable's values then you will see some additional characters/digits are coming(which are not in your requirement) so this + and - to them is the adjustment of moving the cursor to print the exact required output. I hope I was clear in it, if you have doubt then kindly do let me know on same, will try to explain more on same.

Thanks,
R. Singh
The Following User Says Thank You to RavinderSingh13 For This Useful Post:
cmccabe (05-17-2017)
Sponsored Links
    #13  
Old Unix and Linux 05-17-2017   -   Original Discussion by cmccabe
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 17 November 2017, 8:12 AM EST
Location: Chicago
Posts: 1,188
Thanks: 713
Thanked 14 Times in 13 Posts
Thank you very much that helps a lot Linux.
Sponsored Links
Closed

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
How to count lines of CSV file where 2 fields match variables? nmoore2843 UNIX for Beginners Questions & Answers 4 07-07-2016 02:15 PM
Awk: Combine multiple lines based on number of fields mdkm Shell Programming and Scripting 10 01-01-2016 07:42 PM
awk - (URGENT!) Print lines sort and move lines if match found High-T UNIX for Dummies Questions & Answers 1 02-02-2015 03:05 AM
Print only lines where fields concatenated match strings Ophiuchus Shell Programming and Scripting 2 01-18-2013 05:11 PM
sed problem - delete all lines until a match on 2 lines plelie2 Shell Programming and Scripting 11 09-18-2009 05:25 AM



All times are GMT -4. The time now is 04:25 PM.