Split column using awk in a text file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split column using awk in a text file
# 1  
Old 05-23-2013
Split column using awk in a text file

Code:
chr1    412573  .       A       C       2758.77 .      AC=2;AF=1.00;AN=2;DP=71;Dels=0.00;FS=0.000;HaplotypeScore=2.8822;MLEAC=2;MLEAF=1.00;MQ=58.36;MQ0=0;QD=38.86;resource.EFF=INTERGENIC(MODIFIER||||||||) GT:AD:DP:GQ:PL 1/1:0,71:71:99:2787,214,0       GATKSAM

chr1    602567  rs21953190      A       G       5481.77 .   AC=2;AF=1.00;AN=2;DB;DP=152;Dels=0.00;FS=0.000;HaplotypeScore=6.8385;MLEAC=2;MLEAF=1.00;MQ=59.09;MQ0=0;QD=36.06;resource.EFF=SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D1034|ADNP2|protein_coding|CODING|ENSCAFT00000000008|5) GT:AD:DP:GQ:PL 1/1:0,151:151:99:5510,430,0     GATKSAM

I have text file with lines as shown here. Each row has 11 columns separated by tab. In each row, i want to split the 8th column such that the output should look like shown below. Here value in the 9th column is DP value and in the 10th column is MQ value followed by the values after resource.EFF=.


Code:
chr1    412573  .       A       C       2758.77 .           71     58.36    INTERGENIC    MODIFIER GT:AD:DP:GQ:PL 1/1:0,71:71:99:2787,214,0       GATKSAM

chr1    602567  rs21953190      A       G       5481.77 .         152       59.09  SYNONYMOUS_CODING
LOW    SILENT    gaT/gaC    D1034    ADNP2    protein_coding    CODING    ENSCAFT00000000008    5  GT:AD:DP:GQ:PL 1/1:0,151:151:99:5510,430,0     GATKSAM

Which means the 8th column has to be cleaned up such that, it has only DP value, MQ value and the information after SNPEFF_*= separated by tabs.

Could anyone help?

Last edited by mehar; 05-23-2013 at 07:24 AM..
# 2  
Old 05-23-2013
Try

Code:
awk -F "\t" '{n=split($8,P,";");for(i=1;i<=n;i++){if(P[i] ~ /^SNPEFF/){split(P[i],K,"=");S=S?S" "K[2]:K[2]}};$8=S;S=""}1' file

# 3  
Old 05-23-2013
Ubuntu

Hi,

Thanks it is working. But it is not doing one thing. If you observe the output lines the 9th column is DP value and 10th is MQ value and then the values after SNPEFF_=.

Could you modify to achieve this?
# 4  
Old 05-23-2013
Please use code tags, not quote tags.
Try this:
Code:
awk     '       {n  = split ($8,TMP,";")
                 $8 = ""
                 for (i=1; i<=n; i++)
                        if (match (TMP[i], /^DP=|^MQ=|^SNPEFF/)) {sub (/^.*=/,"",TMP[i]); $8 = $8 ($8?"\t":"") TMP[i]}
                }
         1
        ' FS="\t" file
chr1 403111 . G A 42 . 34    53    INTERGENIC    NONE    MODIFIER GT:GQ:PL 0/1:75:72,0,118 SAM
       
chr1 412573 . A C 2758.77 . 71    58.36    INTERGENIC    NONE    MODIFIER GT:ADP:GQ:PL1/1:0,71:71:99:2787,214,0 GATKSAM

This User Gave Thanks to RudiC For This Post:
# 5  
Old 05-23-2013
Thanks it works. Could you explain the code if possible?
# 6  
Old 05-23-2013
Code:
awk     '       {n  = split ($8,TMP,";")                     # split the 8th field into TMP array using separator ";", keep no. of elements in n
                 $8 = ""                                     # clear 8th field
                 for (i=1; i<=n; i++)                        # inspect all elements of TMP
                   if (match (TMP[i], /^DP=|^MQ=|^SNPEFF/))  # if any array element starts with either of the regexs
                      {sub (/^.*=/,"",TMP[i])                # remove the part before "="
                       $8 = $8 ($8?"\t":"") TMP[i]}          # append the rest to field 8
                } 
         1                                                   # print the modified line
        ' FS="\t" file                                       # use TAB as the separator


Last edited by RudiC; 05-23-2013 at 02:06 PM..
This User Gave Thanks to RudiC For This Post:
# 7  
Old 05-23-2013
I tried to put the code as a oneliner as shown below,
Code:
awk '{n  = split ($8,TMP,";") $8 = "" for (i=1; i<=n; i++) if (match (TMP[i], /^DP=|^MQ=|^SNPEFF/)) {sub (/^.*=/,"",TMP[i]); $8 = $8 ($8?"\t":"") TMP[i]} }1' FS="\t" file

But it says syntax error at comma in {sub (/^.*=/,"",TMP[i]). What could be wrong here?

---------- Post updated at 05:25 AM ---------- Previous update was at 03:52 AM ----------

Hi,

The original question is slightly modified with slightly different input in the 8th column. Could you help? Thanks in advance.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Using awk to split a column into two columns

Hi, I am trying to split the following output into two columns, where each column has Source: Destination: OUTPUT TO FILTER $ tshark -r Capture_without_mtr.pcap -V | awk '/ (Source|Destination): /' | more Source: x.x.x.x Destination: x.x.x.x Source:... (2 Replies)
Discussion started by: sand1234
2 Replies

2. Shell Programming and Scripting

awk split columns to row after N number of column

I want to split this with every 5 or 50 depend on how much data the file will have. And remove the comma on the end Source file will have 001,0002,0003,004,005,0006,0007,007A,007B,007C,007E,007F,008A,008C Need Output from every 5 tab and remove the comma from end of each row ... (4 Replies)
Discussion started by: ranjancom2000
4 Replies

3. Shell Programming and Scripting

Awk: split column if special characters

Hi, I've data like these: Gene1,Gene2 snp1 Gene3 snp2 Gene4 snp3 I'd like to split line if comma and then print remaining information for the respective gene. My code: awk '{ if($1 ~ /,/){ n = split($0, t, ",") (7 Replies)
Discussion started by: genome
7 Replies

4. Shell Programming and Scripting

How to split a file into column with awk?

The following is my code nawk -F',' ' BEGIN { printf "MSISDN,IMSI,NAM,TS11,TS21,TS22,OBO,OBI,BAIC,BAOC,BOIC,BOIEXH,APNID0,APNID1,APNID2,APNID3,APNID0,CSP,RSA\n" } { for(i=1; i<=NF; i++) { split($i,a,":") gsub(" ","", a) printf "%s;",a } printf "\n" }'HLR_DUMP_BZV >> HLR_full This is... (1 Reply)
Discussion started by: gillesi
1 Replies

5. Shell Programming and Scripting

awk to sum a column based on duplicate strings in another column and show split totals

Hi, I have a similar input format- A_1 2 B_0 4 A_1 1 B_2 5 A_4 1 and looking to print in this output format with headers. can you suggest in awk?awk because i am doing some pattern matching from parent file to print column 1 of my input using awk already.Thanks! letter number_of_letters... (5 Replies)
Discussion started by: prashob123
5 Replies

6. Shell Programming and Scripting

Split text separated by ; in a column into multiple columns

Hi, I need help to split a long text in a column which is separated by ; and i need to print them out in multiple columns. My input file is tab-delimited and has 11 columns as below:- aRg02004 21452 asdfwf 21452 21452 4.6e-29 5e-29 -1 3 50 ffg|GGD|9009 14101.10 High class -node. ; ffg|GGD|969... (3 Replies)
Discussion started by: redse171
3 Replies

7. UNIX for Dummies Questions & Answers

Using awk to log transform a column in a tab-delimited text file?

How do I use awk to log transform the fifth column of a tab-delimited text file? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

8. Shell Programming and Scripting

using awk to substitute data in a column delimited text file

using awk to substitute data in a column delimited text file hello i would like to use awk to do the following calculation from the following snippet. input file C;2390 ;CV BOUILLOTTE 2L 2FACES NERVUREES ;1.00 ;3552612239004;13417 ;25 ;50 ; 12;50000 ; ; ... (3 Replies)
Discussion started by: iindie
3 Replies

9. Shell Programming and Scripting

How to split a fixed width text file into several ones based on a column value?

Hi, I have a fixed width text file without any header row. One of the columns contains a date in YYYYMMDD format. If the original file contains 3 dates, I want my shell script to split the file into 3 small files with data for each date. I am a newbie and need help doing this. (14 Replies)
Discussion started by: bhanja_trinanja
14 Replies

10. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this. For example: split -l 3000000 filename.txt This is very slow and it splits the file with 3 million records in each... (10 Replies)
Discussion started by: madhunk
10 Replies
Login or Register to Ask a Question