awk to remove field and match strings to add text


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to remove field and match strings to add text
# 1  
Old 05-28-2016
awk to remove field and match strings to add text

In file1 field $18 is removed.... column header is "Otherinfo", then each line in file1 is used to search file2 for a match. When a match is found the last four strings in file2 are copied to file1.

Maybe:
cut -f1-17 file1 and then match each line to file2

file1
Code:
Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene    PopFreqMax    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    common    Otherinfo
chr1    949654    949654    A    G    exonic    ISG15    .    synonymous SNV    ISG15:NM_005101:exon2:c.294A>G:p.V98V    0.96    .    .    .    .    .    .    1    3825.28    624    chr1    949654    .    A    G    3825.28    PASS    AF=1;AO=621;DP=624;FAO=399;FDP=399;FR=.;FRO=0;FSAF=225;FSAR=174;FSRF=0;FSRR=0;FWDB=0.00425236;FXX=0.00249994;HRUN=1;LEN=1;MLLD=97.922;OALT=G;OID=.;OMAPALT=G;OPOS=949654;OREF=A;PB=0.5;PBP=1;QD=38.3487;RBI=0.0367904;REFB=0.0353003;REVB=-0.0365438;RO=2;SAF=335;SAR=286;SRF=0;SRR=2;SSEN=0;SSEP=0;SSSB=0.00332809;STB=0.5;STBP=1;TYPE=snp;VARB=-3.42335e-05;ANN=ISG15    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:171:624:399:2:0:621:399:1:286:335:0:2:174:225:0:0 GOOD 399 reads
chr1    977330    977330    T    C    intronic    AGRN    .    .    .    1.    Benign    not_specified    RCV000116254.1    MedGen    CN169374    Common    1    1555.72    163    chr1    977330    .    T    C    1555.72    PASS    AF=1;AO=163;DP=163;FAO=163;FDP=163;FR=.;FRO=0;FSAF=65;FSAR=98;FSRF=0;FSRR=0;FWDB=0.0181869;FXX=0;HRUN=1;LEN=1;MLLD=117.006;OALT=C;OID=.;OMAPALT=C;OPOS=977330;OREF=T;PB=0.5;PBP=1;QD=38.1771;RBI=0.0262515;REFB=0;REVB=-0.0189308;RO=0;SAF=65;SAR=98;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=6.62803e-08;STB=0.5;STBP=1;TYPE=snp;VARB=-1.32905e-06;ANN=AGRN    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:74:163:163:0:0:163:163:1:98:65:0:0:98:65:0:0 GOOD 163 reads
chr1    981931    981931    A    G    exonic    AGRN    .    synonymous SNV    AGRN:NM_198576:exon18:c.3066A>G:p.S1022S    0.96    Benign    not_specified    RCV000116259.1    MedGen    CN169374    Common    1    915.076    96    chr1    981931    .    A    G    915.076    PASS    AF=1;AO=96;DP=96;FAO=96;FDP=96;FR=.;FRO=0;FSAF=34;FSAR=62;FSRF=0;FSRR=0;FWDB=0.0308666;FXX=0;HRUN=1;LEN=1;MLLD=62.5559;OALT=G;OID=.;OMAPALT=G;OPOS=981931;OREF=A;PB=0.5;PBP=1;QD=38.1282;RBI=0.0605651;REFB=0;REVB=-0.0521094;RO=0;SAF=34;SAR=62;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=-4.95992e-08;STB=0.5;STBP=1;TYPE=snp;VARB=-0.000155929;ANN=AGRN    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:43:96:96:0:0:96:96:1:62:34:0:0:62:34:0:0 GOOD 96 reads

file2
Code:
##.....
##.....
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NS12911_BC1
chr1    949654    .    A    G    3825.28    PASS    AF=1;AO=621;DP=624;FAO=399;FDP=399;FR=.;FRO=0;FSAF=225;FSAR=174;FSRF=0;FSRR=0;FWDB=0.00425236;FXX=0.00249994;HRUN=1;LEN=1;MLLD=97.922;OALT=G;OID=.;OMAPALT=G;OPOS=949654;OREF=A;PB=0.5;PBP=1;QD=38.3487;RBI=0.0367904;REFB=0.0353003;REVB=-0.0365438;RO=2;SAF=335;SAR=286;SRF=0;SRR=2;SSEN=0;SSEP=0;SSSB=0.00332809;STB=0.5;STBP=1;TYPE=snp;VARB=-3.42335e-05;ANN=ISG15    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:171:624:399:2:0:621:399:1:286:335:0:2:174:225:0:0 GOOD 399 reads hom
chr1    977330    .    T    C    1555.72    PASS    AF=1;AO=163;DP=163;FAO=163;FDP=163;FR=.;FRO=0;FSAF=65;FSAR=98;FSRF=0;FSRR=0;FWDB=0.0181869;FXX=0;HRUN=1;LEN=1;MLLD=117.006;OALT=C;OID=.;OMAPALT=C;OPOS=977330;OREF=T;PB=0.5;PBP=1;QD=38.1771;RBI=0.0262515;REFB=0;REVB=-0.0189308;RO=0;SAF=65;SAR=98;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=6.62803e-08;STB=0.5;STBP=1;TYPE=snp;VARB=-1.32905e-06;ANN=AGRN    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:74:163:163:0:0:163:163:1:98:65:0:0:98:65:0:0 GOOD 163 reads hom
chr1    981931    .    A    G    915.076    PASS    AF=1;AO=96;DP=96;FAO=96;FDP=96;FR=.;FRO=0;FSAF=34;FSAR=62;FSRF=0;FSRR=0;FWDB=0.0308666;FXX=0;HRUN=1;LEN=1;MLLD=62.5559;OALT=G;OID=.;OMAPALT=G;OPOS=981931;OREF=A;PB=0.5;PBP=1;QD=38.1282;RBI=0.0605651;REFB=0;REVB=-0.0521094;RO=0;SAF=34;SAR=62;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=-4.95992e-08;STB=0.5;STBP=1;TYPE=snp;VARB=-0.000155929;ANN=AGRN    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:43:96:96:0:0:96:96:1:62:34:0:0:62:34:0:0 GOOD 96 reads hom

desired output
Code:
Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene    PopFreqMax    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    common    Otherinfo
chr1    949654    949654    A    G    exonic    ISG15    .    synonymous SNV    ISG15:NM_005101:exon2:c.294A>G:p.V98V    0.96    .    .    .    .    .    .     GOOD 399 reads hom
chr1    977330    977330    T    C    intronic    AGRN    .    .    .    1.    Benign    not_specified    RCV000116254.1    MedGen    CN169374    Common    GOOD 163 reads hom
chr1    981931    981931    A    G    exonic    AGRN    .    synonymous SNV    AGRN:NM_198576:exon18:c.3066A>G:p.S1022S    0.96    Benign    not_specified    RCV000116259.1    MedGen    CN169374    Common    GOOD 96 reads hom


Last edited by cmccabe; 05-28-2016 at 01:35 PM.. Reason: fixed format, added cut
# 2  
Old 05-28-2016
While it is true that there are 18 fields in the heading line in file1, there are a lot more fields on subsequent lines in that file. The cut command you suggested would copy the input file to its output and add 16 trailing <tab> characters to each input line. Using awk to delete field $18 (assuming we set the awk field delimiter to four <space> characters would seem to be a request to delete the header line completely ($otherinfo expanding to $0 unless otherinfo is defined to be some numeric value) and to remove the first field on your other sample lines (since field 18 on the other lines is in file1 is 1).

And you haven't said how you "match" 17 fields from file1 (after removing the last 13 fields) with lines from file2 containing 10 fields. Do you match lines from file1 (after skipping line1) with lines from file2 (after skipping lines that start with a "#") sequentially? Do you match lines with identical field 2 values in both files? Or, is there some other criteria used to determine a match?

From the sample input and output you provided, I would guess that you are trying to:
  1. Keep the header from file1 unchanged.
  2. Keep the 1st 17 fields from other lines in file1 unchanged; delete all other fields at the end of those lines; and if and only if there is a line in file2 with a field 2 that matches field 2 in this line, add the last four single <space> separated fields from the matched line in file2.
Is that an accurate statement of your requirements?

Last edited by Don Cragun; 05-28-2016 at 05:28 PM.. Reason: Add details about field counts in the two files.
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 05-28-2016
Yes, what you have stated above is correct. Thank you Smilie.
# 4  
Old 05-28-2016
The following should work as long as there aren't any spaces in fields 1 and 2 in file2:
Code:
awk '
BEGIN {	OFS = "    "
}
FNR == NR {
	if($0 !~ /^#/ )
		d[$2] = $(NF-3) " " $(NF-2) " " $(NF-1) " " $NF
	next
}
FNR > 1 {
	NF = 17
	$18 = d[$2]
}
1' file2 FS='    ' file1

If there could be spaces in those two fields, extracting the desired strings from file2 will be a little more complicated.

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 05-28-2016
The awk is close but hopefully the below explanation helps. Both file1 and file2 are tab-delineated, since $1 and $2 of file1 match $1 and $2 in file2, the last 4 strings in file2 are copied to the end of the matching file1 line. In thiscase the last 4 strings are GOOD 399 reads hom and are in bold in file2. Thank you very much Smilie.

file1
Code:
Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene    PopFreqMax    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    common
chr1    949654    949654    A    G    exonic    ISG15    .    synonymous SNV    ISG15:NM_005101:exon2:c.294A>G:p.V98V    0.96    .    .    .    .    .    .

file2
Code:
##.....
##.....
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NS12911_BC1
chr1    949654    .    A    G    3825.28    PASS    AF=1;AO=621;DP=624;FAO=399;FDP=399;FR=.;FRO=0;FSAF=225;FSAR=174;FSRF=0;FSRR=0;FWDB=0.00425236;FXX=0.00249994;HRUN=1;LEN=1;MLLD=97.922;OALT=G;OID=.;OMAPALT=G;OPOS=949654;OREF=A;PB=0.5;PBP=1;QD=38.3487;RBI=0.0367904;REFB=0.0353003;REVB=-0.0365438;RO=2;SAF=335;SAR=286;SRF=0;SRR=2;SSEN=0;SSEP=0;SSSB=0.00332809;STB=0.5;STBP=1;TYPE=snp;VARB=-3.42335e-05;ANN=ISG15    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:171:624:399:2:0:621:399:1:286:335:0:2:174:225:0:0 GOOD 399 reads hom

desired output
Code:
Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene     GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene     PopFreqMax    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID     common
chr1    949654    949654    A    G    exonic    ISG15    .     synonymous SNV    ISG15:NM_005101:exon2:c.294A>G:p.V98V    0.96     .    .    .    .    .    . GOOD 399 reads hom

# 6  
Old 05-28-2016
You say fields are <tab> delimited, but there are absolutely no <tab> characters in any of your sample inputs and outputs. In your sample inputs and outputs, the field delimiter is four adjacent <space> characters and some fields in addition contain some single space characters. Using the exact contents of what you provided in post #1 in this thread for file1 and file2 contents, the script I suggested produced the exact output that you said you wanted for the 1st, 2nd, and 4th output lines. The output my script put out exactly four <space> characters before the four single-<space> separated strings added to the end of the 3rd line where the output you said you wanted had five spaces instead of four at that location.

If your real data is <tab> separated instead of 4 adjacent <space> character separated as in the data you showed us, change the following two lines in the script I suggested:
Code:
BEGIN {	OFS = "    "
	and
1' file2 FS='    ' file1

to:
Code:
BEGIN {	OFS = "\t"
	and
1' file2 FS='\t' file1

respectively.
Note that this will put out a <tab> (not a <space>) before the four <space> separated strings are used to replace the "OtherInfo" data. And, as requested in post #1 in this thread, the "OtherInfo" heading field will be retained; not discarded.

If this is not what you want, please post sample inputs and outputs that match the description of the real data you are processing.
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 05-31-2016
That fixed it... thank you very much Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print text in field if match and range is met

In the awk below I am trying to match the value in $4 of file1 with the split value from $4 in file2. I store the value of $4 in file1 in A and the split value (using the _ for the split) in array. I then strore the value in $2 as min, the value in $3 as max, and the value in $1 as chr. If A is... (6 Replies)
Discussion started by: cmccabe
6 Replies

2. Shell Programming and Scripting

awk to add text to matching pattern in field

In the awk I am trying to add :p.=? to the end of each $9 that matches the pattern NM_. The below executes andis close but I can not seem to figure out why the :p.=? repeats in the split as in the green in the current output. I have added comments as well. Thank you :). file ... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. UNIX for Beginners Questions & Answers

Use strings from nth field from one file to match strings in entire line in another file, awk

I cannot seem to get what should be a simple awk one-liner to work correctly and cannot figure out why. I would like to use patterns from a specific field in one file as regex to search for matching strings in the entire line ($0) of another file. I would like to output the lines of File2 which... (1 Reply)
Discussion started by: jvoot
1 Replies

4. Shell Programming and Scripting

awk to match field between two files and use conditions on match

I am trying to look for $2 of file1 (skipping the header) in $2 of file2 (skipping the header) and if they match and the value in $10 is > 30 and $11 is > 49, then print the line from file1 to a output file. If no match is foung the line is not printed. Both the input and output are tab-delimited.... (3 Replies)
Discussion started by: cmccabe
3 Replies

5. Shell Programming and Scripting

Using awk to remove lines from file that match text

I am trying to remove each line in which $2 is FP or RFP. I believe the below will remove one instance but not both. Thank you :). file 12 123 FP 11 10 RFP awk awk -F'\t' ' $2 != "FP"' file desired output 12 11 (6 Replies)
Discussion started by: cmccabe
6 Replies

6. Shell Programming and Scripting

awk to parse field and include the text of 1 pipe in field 4

I am trying to parse the input in awk to include the |gc= in $4 but am not able to. The below is close: awk so far: awk '{sub(/\|]+]++/, ""); print }' input.txt Input chr1 955543 955763 AGRN-6|pr=2|gc=75 0 + chr1 957571 957852 AGRN-7|pr=3|gc=61.2 0 + chr1 970621 ... (7 Replies)
Discussion started by: cmccabe
7 Replies

7. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

First, thanks for the help in previous posts... couldn't have gotten where I am now without it! So here is what I have, I use AWK to match $1 and $2 as 1 string in file1 to $1 and $2 as 1 string in file2. Now I'm wondering if I can extend this AWK command to incorporate the following: If $1... (4 Replies)
Discussion started by: right_coaster
4 Replies

8. Shell Programming and Scripting

awk, comma as field separator and text inside double quotes as a field.

Hi, all I need to get fields in a line that are separated by commas, some of the fields are enclosed with double quotes, and they are supposed to be treated as a single field even if there are commas inside the quotes. sample input: for this line, 5 fields are supposed to be extracted, they... (8 Replies)
Discussion started by: kevintse
8 Replies

9. Shell Programming and Scripting

awk or sed to add field in a text file

Hi there, I have a csv file with some columns comma sepated like this : 4502-17,PETER,ITA2,LEGUE,92,ME - HALF,23/05/10 15:00 4502-18,CARL,ITA2,LEGUE,96,ME - HALF,20/01/09 14:00 4502-19,OTTO,ITA2,LEGUE,97,ME - MARY,23/05/10 15:00 As you can see the column n. 7 is a timestamp column, I need... (23 Replies)
Discussion started by: capnino
23 Replies

10. Shell Programming and Scripting

Remove Carriage returns between strings in a field

Is there any way to remove carriage retuns between the records? We have input records separated by TABS and have carriage returns as below: 123 456 789 ABC "1952.00" 678 "abcdef ghik lmno" Above we... (10 Replies)
Discussion started by: acheepi
10 Replies
Login or Register to Ask a Question