Challenging Awk array problem


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Challenging Awk array problem
# 1  
Old 05-21-2010
Challenging Awk array problem

Hi,

I rather have a very complicated awk problem here, at least to me. I have two files.

File 1:

Code:
607    687    174    0    0    chr1    3000001    3000156    -194195276    -    L1_Mur2    LINE    L1    -4310    1567    1413    1
607    917    214    114    45    chr1    3000237    3000733    -194194699    -    L1_Mur2    LINE    L1    -4488    1389    913    1
607    215    31    0    30    chr1    3000733    3000766    -194194666    +    (TTTG)n    Simple_repeat    Simple_repeat    2    33    0    2
607    845    233    76    114    chr1    3000766    3000792    -194194640    -    L1_Mur2    LINE    L1    -6816    912    887    1
607    621    250    65    37    chr1    3001287    3001583    -194193849    -    Lx9    LINE    L1    -1596    6048    5742    3
607    1320    197    332    7    chr1    3001722    3002005    -194193427    -    RLTR25A    LTR    ERVK    0    1028    625    4

File 2:
Code:
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069  chr3  154935392 GAGTTTTACAGTCCA
28|3721 +  gi|149288852|ref|NC_000067.5|NC_000067 chr1  152633707 GAGTTTTACAGTCCA
28|3721  + gi|149361432|ref|NC_000073.5|NC_000073 chr7  86595415 GAGTTTTACAGTCCA
34|3145  - gi|149321426|ref|NC_000084.5|NC_000084 chr18  43464724 ACGGCTTACGA
34|3145  - gi|149354224|ref|NC_000071.5|NC_000071 chr5  37676290 ACGGCTTACGA

If field 6 of file 1 is same as field 4 of file 2, then see if field 5 of file 2 lies within the range specified by the fields 7 and 8 of file 1. If yes, extract the line from file 2 and add the fields 11, 12 and 13 of file 1 in to a separate file. Whew!

Ok for example - field 4 of file 2 i.e. chr1 is same as field 6 of file 1. Then see if field 5 of file 2 i.e.3000072 (which is always a number) lies in the range of fields 7 and 8 (3000001 3000156) of file 1. So, I need the output (the line from file 2 plus fields 11,12 and 13 of file 1) in a separate file as

Code:
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074  chr1  3000072 TTTATCGTCATCGTC L1_Mur2    LINE    L1

Thank you very much in advance

Last edited by Scott; 05-21-2010 at 06:44 PM.. Reason: Please use code tags
# 2  
Old 05-21-2010
Quote:
Originally Posted by polsum
File 2:
Code:
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069  chr3  154935392 GAGTTTTACAGTCCA
28|3721 +  gi|149288852|ref|NC_000067.5|NC_000067 chr1  152633707 GAGTTTTACAGTCCA
28|3721  + gi|149361432|ref|NC_000073.5|NC_000073 chr7  86595415 GAGTTTTACAGTCCA
34|3145  - gi|149321426|ref|NC_000084.5|NC_000084 chr18  43464724 ACGGCTTACGA
34|3145  - gi|149354224|ref|NC_000071.5|NC_000071 chr5  37676290 ACGGCTTACGA

...<snip>...

Ok for example - field 4 of file 2 i.e. chr1 is same as field 6 of file 1. Then see if field 5 of file 2 i.e.3000072 (which is always a number) lies in the range of fields 7 and 8 (3000001 3000156) of file 1. So, I need the output (the line from file 2 plus fields 11,12 and 13 of file 1) in a separate file as
What's the delimiter in file 2? The pipe, "|" ? If so, taking the first line in file 2, isn't field 4 "ref"? And field 5 is "NC_000074.5"?

Regards and welcome to the forum,
Alister

---------- Post updated at 05:52 PM ---------- Previous update was at 05:50 PM ----------

Nevermind. The pipes threw me. I see that it's whitespace delimited. Duh! Great first impression, huh? Smilie
This User Gave Thanks to alister For This Post:
# 3  
Old 05-21-2010
cracks me up when folks in the UK say: "Duh!"

Quick question for the OP...are the files sorted and the records are guaranteed in the same order? Otherwise, what's the key to tie the records? I ask since your initial evaluation seems to focus on flags like chr1...
This User Gave Thanks to curleb For This Post:
# 4  
Old 05-21-2010
Assuming that the nth line in file1 corresponds to the nth line in file2:
Code:
paste -d\\n file1 file2 |
awk '{
    chr=$6; min=$7; max=$8; s=$11" "$12" "$13;
    getline;
    if (chr==$4 && $5>=min && $5<=max)
        print $0, s;
}'

Regards,
Alister
This User Gave Thanks to alister For This Post:
# 5  
Old 05-21-2010
Alister, another quick and dirty masterpiece! Works great, for me...

But I must ask: what if they wanted to iterate over the lines in the file? Or would you just wrap it into a while read...?
Code:
$ tr -s ' ' <Edit1
607 687 174 0 0 chr1 3000001 3000156 -194195276 - L1_Mur2 LINE L1 -4310 1567 1413 1
607 917 214 114 45 chr1 3000237 3000733 -194194699 - L1_Mur2 LINE L1 -4488 1389 913 1
607 215 31 0 30 chr1 3000733 3000766 -194194666 + (TTTG)n Simple_repeat Simple_repeat 2 33 0 2
607 845 233 76 114 chr1 3000766 3000792 -194194640 - L1_Mur2 LINE L1 -6816 912 887 1
607 621 250 65 37 chr1 3001287 3001583 -194193849 - Lx9 LINE L1 -1596 6048 5742 3
607 1320 197 332 7 chr1 3001722 3002005 -194193427 - RLTR25A LTR ERVK 0 1028 625 4

$ tr -s ' ' <Edit2
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1 3000072 TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069 chr3 154935392 GAGTTTTACAGTCCA
28|3721 + gi|149288852|ref|NC_000067.5|NC_000067 chr1 152633707 GAGTTTTACAGTCCA
28|3721 + gi|149361432|ref|NC_000073.5|NC_000073 chr7 86595415 GAGTTTTACAGTCCA
34|3145 - gi|149321426|ref|NC_000084.5|NC_000084 chr18 43464724 ACGGCTTACGA
34|3145 - gi|149354224|ref|NC_000071.5|NC_000071 chr5 37676290 ACGGCTTACGA

$ paste -d\\n Edit1 Edit2 |
> awk '{
>     chr=$6; min=$7; max=$8; s=$11" "$12" "$13;
>     getline;
>     if (chr==$4 && $5>=min && $5<=max)
>         print $0, s;
> }'
 L1_Mur2 LINE L1361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC


Last edited by curleb; 05-21-2010 at 10:43 PM.. Reason: spoke too soon...
This User Gave Thanks to curleb For This Post:
# 6  
Old 05-21-2010
Hi, curleb:

I'm not sure I understand what you mean by iterate over the lines of the file. Every line in both files is read by that solution; the paste merges them:

file1 line1
file2 line1
file1 line2
file2 line2
file1 line3
file2 line3
...etc

There just happens to be only one line that meets the requirements.

If I misundersood your question, please elaborate.

Cheers,
Alister

P.S. The last line in your output seems garbled. Perhaps a terminal hiccup? Or a carriage return somehwere? Here's a sample run from my system:
Code:
$ cat file1
607    687    174    0    0    chr1    3000001    3000156    -194195276    -    L1_Mur2    LINE    L1    -4310    1567    1413    1
607    917    214    114    45    chr1    3000237    3000733    -194194699    -    L1_Mur2    LINE    L1    -4488    1389    913    1
607    215    31    0    30    chr1    3000733    3000766    -194194666    +    (TTTG)n    Simple_repeat    Simple_repeat    2    33    0    2
607    845    233    76    114    chr1    3000766    3000792    -194194640    -    L1_Mur2    LINE    L1    -6816    912    887    1
607    621    250    65    37    chr1    3001287    3001583    -194193849    -    Lx9    LINE    L1    -1596    6048    5742    3
607    1320    197    332    7    chr1    3001722    3002005    -194193427    -    RLTR25A    LTR    ERVK    0    1028    625    4
$ cat file2
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069  chr3  154935392 GAGTTTTACAGTCCA
28|3721 +  gi|149288852|ref|NC_000067.5|NC_000067 chr1  152633707 GAGTTTTACAGTCCA
28|3721  + gi|149361432|ref|NC_000073.5|NC_000073 chr7  86595415 GAGTTTTACAGTCCA
34|3145  - gi|149321426|ref|NC_000084.5|NC_000084 chr18  43464724 ACGGCTTACGA
34|3145  - gi|149354224|ref|NC_000071.5|NC_000071 chr5  37676290 ACGGCTTACGA
$ paste -d\\n file1 file2 |
> awk '{
>     chr=$6; min=$7; max=$8; s=$11" "$12" "$13;
>     getline;
>     if (chr==$4 && $5>=min && $5<=max)
>         print $0, s;
> }'
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC L1_Mur2 LINE L1

This User Gave Thanks to alister For This Post:
# 7  
Old 05-21-2010
hmmm...I'm not seeing it do that, but I'm only playing on ksh93 (U/Win)...

Mocked it up as follows, so that the first and last lines match the scenario, but I only get the one:
Code:
$ tr -s ' ' <Edit1
607 687 174 0 0 chr1 3000001 3000156 -194195276 - L1_Mur2 LINE L1 -4310 1567 1413 1
607 917 214 114 45 chr1 3000237 3000733 -194194699 - L1_Mur2 LINE L1 -4488 1389 913 1
607 215 31 0 30 chr1 3000733 3000766 -194194666 + (TTTG)n Simple_repeat Simple_repeat 2 33 0 2
607 845 233 76 114 chr1 3000766 3000792 -194194640 - L1_Mur2 LINE L1 -6816 912 887 1
607 621 250 65 37 chr1 3001287 3001583 -194193849 - Lx9 LINE L1 -1596 6048 5742 3
607 1320 197 332 7 chr1 37600000 37676290 -194193427 - RLTR25A LTR ERVK 0 1028 625 4

$ tr -s ' ' <Edit2
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1 3000072 TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069 chr3 154935392 GAGTTTTACAGTCCA
28|3721 + gi|149288852|ref|NC_000067.5|NC_000067 chr1 152633707 GAGTTTTACAGTCCA
28|3721 + gi|149361432|ref|NC_000073.5|NC_000073 chr7 86595415 GAGTTTTACAGTCCA
34|3145 - gi|149321426|ref|NC_000084.5|NC_000084 chr18 43464724 ACGGCTTACGA
34|3145 - gi|149354224|ref|NC_000071.5|NC_000071 chr5 37676290 ACGGCTTACGA

$ paste -d\\n Edit1 Edit2 |awk '{chr=$6; min=$7; max=$8; s=$11" "$12" "$13; getline; if (chr==$4 && $5>=min && $5<=max) print $0;}'
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC

This User Gave Thanks to curleb For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Index problem in associate array in awk

I am trying to reformat the table by filling any missing rows. The final table will have consecutive IDs in the first column. My problem is the index of the associate array in the awk script. infile: S01 36407 53706 88540 S02 69343 87098 87316 S03 50133 59721 107923... (4 Replies)
Discussion started by: yifangt
4 Replies

2. Shell Programming and Scripting

Problem with awk array when loading from shell variable

Hi, I have a problem with awk array when iam trying to use awk in solaris box as below..Iam unable to figure out the problem.. Need your help. is there any alternative to make it in arrays from variable values nawk 'BEGIN {SUBSEP=" "; split("101880|110045 101887|110045 101896|110045... (9 Replies)
Discussion started by: cskumar
9 Replies

3. Shell Programming and Scripting

Using awk array problem

I am trying to map values in the input file, where 2nd column depends on the specific value in the 1st column. When 1st column is A place 1 into 2nd column, when it is B, place 2, when C place 3, otherwise no change. My input: U |100|MAIN ST |CLMN1|1 A |200|GREEN LN |CLMN2|2 1 |12... (4 Replies)
Discussion started by: migurus
4 Replies

4. Shell Programming and Scripting

awk array problem

Hi, Im trying to count bats flying through an infrared beam array. One of the experts here helped me a few months ago but now I am having a problem that is stumping me. here is the original code that works (with two differnt patterns in array): # this has been changed to operate under the... (15 Replies)
Discussion started by: cmp260
15 Replies

5. Shell Programming and Scripting

AWK Array problem

Dear All, I am facing problem to get right output through awk program I have file in which “B” value is appearing multiple time and I need to capture all these values. My script is BEGIN { FS=" " } { if ( substr($1,1,5) == "START" ) { i =... (2 Replies)
Discussion started by: arvindng
2 Replies

6. Shell Programming and Scripting

Problem with lookup values on AWK associative array

I'm at wits end with this issue and my troubleshooting leads me to believe it is a problem with the file formatting of the array referenced by my script: awk -F, '{if (NR==FNR) {a=$4","$3","$2}\ else {print a "," $0}}' WBTSassignments1.txt RNCalarms.tmp On the WBTSassignments1.txt file... (2 Replies)
Discussion started by: JasonHamm
2 Replies

7. Shell Programming and Scripting

awk array problem

hi i am trying to perform some calculations with awk and arrays. i have this so far: awk 'NR==FNR{ for(i=1; i<=NF; i++) {array+=$i} tot++;next} {for(i=1; i<=NF; i++) {avg=array/tot} {diff=(array - avg)}} {for(i=1; i<=NF; i++) {printf("%5.8f\n",diff)}}' "$count".txt "$count".ttt >... (4 Replies)
Discussion started by: npatwardhan
4 Replies

8. Shell Programming and Scripting

Very Challenging Problem. Please read fully.

Hi, This is the Third thread i'm putting here for the same problem. :( Actually, i'm trying a script like this.. but its taking a long time.. about 3 days to complete fully.. #!/bin/ksh if then exit 1 fi while read i do while read j do field7=`echo $j|cut -d "|"... (12 Replies)
Discussion started by: RRVARMA
12 Replies

9. Programming

A challenging problem involving symbolic links.

Hello, I'm working on an application that bridges together several applications involved in creating a video workflow for editing with digital cinema cameras. The main platform is MacOSX. Because of the nature of some of the utilities for working with this video footage I must spoof filenames... (2 Replies)
Discussion started by: ibloom
2 Replies

10. UNIX for Dummies Questions & Answers

A Challenging Situation : i hope the moderators will respond to this problem..

I have the following situation : i have 4 Unix Sco servers, one Windows 2000 server, and an ADSL internet connection. All the servers, that is the 4 unix and the windows server have real static IPs supplied by my ISP. the servers are connected to a Switch , the switch is connected to an... (2 Replies)
Discussion started by: BAM
2 Replies
Login or Register to Ask a Question