awk to format file with conditional split


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to format file with conditional split
# 1  
Old 06-26-2017
awk to format file with conditional split

In the awk below I am splitting $7 on the : (colon) then - (hyphen) as array a. The word chr is printed at the start of every $1 line.
Next, $4 is split on the > (greater then) as array b. I am not sure how to account for the two other possibilities in $4 so the correct output is printed. Every line in the input file with either be a del, an ins or have a > in it. The split of $4 captures the last condition, but I am not sure about the first two. I added comments to each line in the awk and hope it is close. Thank you Smilie.

The --- are not part of the file they are just there as examples.

Code:
1. if $4 has del then the using the del to the right (whatever letter(s), the format of that line is REF=C;OBS=      --- line 3 --- OBS is a null string
2. if $4 has ins then the using del the to the right (whatever letter(s), the format of that line is REF=;OBS=ACA      --- line 4 --- REF is a null string

file tab-delimited
Code:
Gene	Accession_number	COSMIC_id	CDS_mut_syntax	AA_mut_syntax	Strand	HG19_coordinates	Amplicon_id	Insert	Target_url
SMARCB1	NM_003073.2	1002	c.118C>T	p.R40*	+	22:24133967-24133967	CHP2_SMARCB1_1	CCTCCGTATGTTCCGAGGTTCTCTGTACAAGAGATACCCCTCACTCTGGAGGCGACTAGCCACTGTGGAAGAGAGGAAGAAAATAGTTGCATCGTCACATGGTAAAAAAAC	http://www.sanger.ac.uk/perl/genetics/CGP/cosmic?action=mut_summary&id=1002
RB1	NM_000321	1042	c.2107-2A>G	p.?	+	13:49037865-49037865	CHP2_RB1_9	TACATCAATTTATTTACTAGATTATGATGTGTTCCATGTATGGCATATGCAAAGTGAAGAATATAGACCTTAAATTCAAAATCATT	http://www.sanger.ac.uk/perl/genetics/CGP/cosmic?action=mut_summary&id=1042
SMARCB1	NM_003073.2	1057	c.1148delC	p.P383fs	+	22:24176357-24176357	CHP2_SMARCB1_4	CCTTGGGAAGGGCAGCGCCCAGGCTGGGAGCTGGCCCCGACTCATTGCCCTCCCCACTCCTCTTCCAGGCGGATGAGGCGTCTTGCCAACACGGCCCCGGCCTGGTAACCAGCCCATCAGCACACGGCTCCC	http://www.sanger.ac.uk/perl/genetics/CGP/cosmic?action=mut_summary&id=1057
BRAF	NM_004333	144982	c.1797_1798insACA	p.T599_V600insT	-	7:140453137-140453138	CHP2_BRAF_2	TCAAACTGATGGGACCCACTCCATCGAGATTTCACTGTAGCTAGACCAAAATCACCTATTTTTACTGTGAGGTCTTCATGAAGAAATATATCTGAGGTGTAGTAAGTAAAGGAAAACAG	http://www.sanger.ac.uk/perl/genetics/CGP/cosmic?action=mut_summary&id=144982

desired output tab-delimited
Code:
chr22	24133967	24133967	c.118C>T	REF=C;OBS=T	SMARCB1
chr13	49037865	49037865	c.2107-2A>G	 REF=A;OBS=G	RB1
chr22	24176357	24176357	c.1148delC	REF=C;OBS=	SMARCB1
chr7	140453138	140453138	c.1797_1798insACA	REF;OBS=TACTACG	BRAF

awk
Code:
awk 'BEGIN{ FS=OFS="\t" }NR>1{ split($7,a,":|-"); k="chr";  # define FS and OFS as tab, skip header, split $7 on : then - reading values in array a, set k equal to chr
          { split($4,b,">");                                # split $4 on > reading values into array b
            printf(k a[1],a[2],a[3],$4,"REF="b[1],"";"OBS="b[2],$1) }' file  # print desired output


Last edited by cmccabe; 06-26-2017 at 04:20 PM.. Reason: added details
# 2  
Old 06-26-2017
Hello cmccabe,

Could you please try following and let me know if this helps you.
Code:
awk 'NR>1{
                if($4 !~ /del/ && $4 !~ /ins/){
                                                REF=substr($4,index($4,">")-1,1);
                                                OBS=substr($4,index($4,">")+1,1)
                                              }
                else                          {
                                                e=$4;
                                                if($4 ~ /ins/){
                                                                sub(/.*ins/,"",e);
                                                                REF="";
                                                                OBS=e
                                                              };
                                                if($4 ~ /del/){
                                                                sub(/.*del/,"",e);
                                                                REF=e;
                                                                OBS=""
                                                              }
                                              };
                CHR=substr($7,1,index($7,":")-1);
                CHR_1=substr($7,index($7,":")+1,index($7,"-")-4);
                CHR_2=substr($7,index($7,"-")+1);
                if($4 !~ /ins/)               {
                                                print "chr"CHR,CHR_1 CHR_2$4,"REF="REF,"OBS="OBS,$1
                                              }
                else                          {
                                                print "chr"CHR,CHR_1,CHR_2,$4,"REF="REF,"OBS="OBS,$1
                                              }
         }
    '    Input_file

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 06-27-2017
It works great.... thank you Smilie

I added comments to each line of the awk, if you have time and are able to could you please look at them to see if I understand?

Code:
awk -F'\t' -v OFS="\t" 'NR>1{                     # define FS and OFS as tab and skip header, and start block
                if($4 !~ /del/ && $4 !~ /ins/){   # if $4 not equal to del or ins then  (start processing)
                                                REF=substr($4,index($4,">")-1,1); # extract the value to the left of the > as REF (why -1,1)?
                                                OBS=substr($4,index($4,">")+1,1)  # extract the value to the right of the > as OBS (why +1,1)?
                                              }  # end block
                else                          {  # start block
                                                e=$4                              # set $4 to e
                                                if($4 ~ /ins/){                   # if pattern ins, start block
                                                                sub(/.*ins/,"",e);  # not quite sure, is this creting an pattern index for ins
                                                                REF="";             # where REF is null
                                                                OBS=e               # and OBS is whatever the value is
                                                              };                    # end block
                                                if($4 ~ /del/){                     # if pattern del and start block
                                                                sub(/.*del/,"",e);  # not quite sure, is this creting an pattern index for del
                                                                REF=e;              # where OBS is whatever the value is
                                                                OBS=""              # and OBS is null
                                                              }                     # end block
                                              };                                    # end block
                CHR=substr($7,1,index($7,":")-1);  # split $7 on : and store value to left as CHR
                CHR_1=substr($7,index($7,":")+1,index($7,"-")-4);  # split $7 on - and store value to left as CHR_1 (why are two index needed)?
                CHR_2=substr($7,index($7,"-")+1);                  # split $7 on - and store value to right as CHR_2 (why are two index needed)?
                if($4 !~ /ins/)               {  # print output for > ($4 not equal ins and start block
                                                print "chr"CHR,CHR_1,CHR_2,$4,"REF="REF";""OBS="OBS,$1
                                              } # close block
                else                          { # start block
                                                print "chr"CHR,CHR_1,CHR_2,$4,"REF="REF";""OBS="OBS,$1  # print output if $4 has del or ins in it (set by the pattern if statements) 
                                              }  # close block
         }  # end processing
    ' input   # define input

Thanks again Smilie.
# 4  
Old 06-27-2017
Hello cmccabe,

Nice explanation. I too added complete explanation here.
Code:
awk 'NR>1{ ###checking here if line number is greater than 1 here, if yes then perform following steps.
                if($4 !~ /del/ && $4 !~ /ins/){ ###Checking here if 4th field is NOT having del and ins strings in them.
                                                REF=substr($4,index($4,">")-1,1); ###creating variable REF which is a substring starting from index of char > -1 to 1 character.
                                                OBS=substr($4,index($4,">")+1,1) ###creating variable OBS which is having substring of 4th field whose starting index is the index number of > +1 to till 1 character.
                                              }
                else                          { ###If a line does not have ins and del strings within them then perform following actions.
                                                e=$4; ###creating a variable named e whose value is 4th field.
                                                if($4 ~ /ins/){ ###checking if 4th field has ins string in it.
                                                                sub(/.*ins/,"",e); ###substituting all characters from starting to till ins string with NULL in e variable.
                                                                REF=""; ###Making variable REF to NULL.
                                                                OBS=e ###Making variable named OBS to new value of variable e now.
                                                              };
                                                if($4 ~ /del/){ ###Checking here if any 4th field has string del in it.
                                                                sub(/.*del/,"",e); ###substituting all characters from starting to till string del with NULL in variable named e.
                                                                REF=e; ###creating variable REF to new value of variable e here.
                                                                OBS="" ###creating OBS variable with NULL value.
                                                              }
                                              };
                CHR=substr($7,1,index($7,":")-1); ###creating variable CHR with substring whose starting point is 1st position to till index of colons index -1.
                CHR_1=substr($7,index($7,":")+1,index($7,"-")-4);### creating variable named CHR_1 with a substring whose starting point is index of colon +1 till index of dash -4 value in 7th field.
                CHR_2=substr($7,index($7,"-")+1);### creating a variable named CHR_2 here whose value is substring whose starting point is index of dash value +1 till the end of 7th field value.
                if($4 !~ /ins/)               { ###Checking if 4th field does not have ins string in it, if yes then perform following.
                                                print "chr"CHR,CHR_1 CHR_2$4,"REF="REF,"OBS="OBS,$1  ###printing values of variables CHR,CHR_1,CHR_2 etc.
                                              }
                else                          { ###Checking if 4th field does not have ins value in it, then perform following.
                                                print "chr"CHR,CHR_1,CHR_2,$4,"REF="REF,"OBS="OBS,$1 ###print the value of variables CHR,CHR_1,CHR_2 etc.
                                              }
         }
    ' Input_file                                ###Mentioning the Input_file name here.

Thanks,
R. Singh
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk conditional operators- lookup value in 2nd file

I use conditional operators alot in AWK to print rows from large text files based on values in a certain column. For example: awk -F '\t' '{ if ($1 == "A" || $1 == "C" ) print $0}' OFS="\t" file1.txt > file2.txt In this case every row is printed from file1 to file2 for which the column 1... (5 Replies)
Discussion started by: Geneanalyst
5 Replies

2. UNIX for Beginners Questions & Answers

Conditional Split

Greetings, I need help in splitting the files in an efficient way while accommodating the below requirements . I am on AIX. Split condition Split the file based on the record type and the position of the data pattern that appears on the on the record type. Both record type and and the... (9 Replies)
Discussion started by: techedipro
9 Replies

3. Shell Programming and Scripting

awk - read from a file and write conditional output

I have a file, which has '|' as separator; I need to read each line from that file and produce output to another file. While reading, I have certain condition on few specific columns (like column3 ='good'); only those lines will be processed. (3 Replies)
Discussion started by: mady135
3 Replies

4. Shell Programming and Scripting

How to Split a source file in specified format?

Requirement: Need to split a source file say a1.txt which can be of size upto 150 MB into 25 target files each with a max size of 25 MB along with the header line in each target file. NOTE: Few target files can be empty also ,but 25 files must be generated for 1 source file( I can expect upto... (4 Replies)
Discussion started by: mad_man12
4 Replies

5. Shell Programming and Scripting

Split File by Pattern with File Names in Source File... Awk?

Hi all, I'm pretty new to Shell scripting and I need some help to split a source text file into multiple files. The source has a row with pattern where the file needs to be split, and the pattern row also contains the file name of the destination for that specific piece. Here is an example: ... (2 Replies)
Discussion started by: cul8er
2 Replies

6. Shell Programming and Scripting

Need to split a xml file in proper format

Hi, I have a file which has xml data but all in single line Ex - <?xml version="1.0"?><User><Name>Robert</Name><Location>California</Location><Occupation>Programmer</Occupation></User> I want to split the data in proper xml format Ex- <?xml version="1.0"?> <User> <Name>Robert</Name>... (6 Replies)
Discussion started by: avishek007
6 Replies

7. Shell Programming and Scripting

Split variable length and variable format CSV file

Dear all, I have basic knowledge of Unix script and her I am trying to process variable length and variable format CSV file. The file length will depend on the numbers of Earnings/Deductions/Direct Deposits. And The format will depend on whether it is Earnings/Deductions or Direct Deposits... (2 Replies)
Discussion started by: chechun
2 Replies

8. Shell Programming and Scripting

AWK CSV to TXT format, TXT file not in a correct column format

HI guys, I have created a script to read 1 column in a csv file and then place it in text file. However, when i checked out the text file, it is not in a column format... Example: CSV file contains name,age aa,11 bb,22 cc,33 After using awk to get first column TXT file... (1 Reply)
Discussion started by: mdap
1 Replies

9. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this. For example: split -l 3000000 filename.txt This is very slow and it splits the file with 3 million records in each... (10 Replies)
Discussion started by: madhunk
10 Replies

10. Shell Programming and Scripting

conditional split

Hi, Can someone let me know how I can split a record when it contains a vairable length of fields. Line1 field1,field101,field102,field 103,field104 Line 2 field1,field101,field102,field 103,field104,field201,field202,field 203,field204 Line 3 field1,field101,field102,field... (5 Replies)
Discussion started by: braindrain
5 Replies
Login or Register to Ask a Question