Quote:
Originally Posted by
Diya123
Hi,
Sorry for not being clear and specific
1) my file 1 has 10M rows and file 2 has 30,000 rows
2)gene A B C D are our desired categories based on where the 4th position of file 1 falls in file 2
3)interval 1 are my exon starts and exon stops.. There can be 1 or 2 to many.. That's the reason they are separated by commas.. Same goes for interval 4 as they are intron starts and stops. The between columns have just one start and one stop ( column 15,16,17,18)
So my intervals should be (column 13and14 pair wise( 1st value in column 13 and 1st value in column14) for interval 1 and 4
Interval 2: between column 15-16; interval 3 column 17-18
4)the number of ranges in interval 1 are always 1 or more than 1; where as for interval 4 it can be 0 or more than 0. For interval 2 and 3 it's always one value.
There may be a zero match or more than one match.. If its a zero match we assign "none" at the end of that row in file 1 .. If its more than 1 we assign "ambiguous" at the end of the row..if it falls in the intervals then interval1 is assigned geneA and interval 2 as gene B and interval 3 as gene C and interval 4 as gene D..
Let me know if I am still not clear..
Thanks
Your descriptions of Interval 4 is:
Same as Interval 1 but different category. So, I repeat if there is a match on one of the ranges specified by columns 13 and 14, what determines whether the result is
geneA or
geneD?
If there are multiple matches (but all of the matches are in the same interval) is the result
"ambiguous" or is it
"geneX" where
X corresponds to the interval that was matched on all matching lines?
The way awk works, field separators separate fields rather than terminate them. So it looks like a set of ranges specified by
11873,12594,13402, and
12227,12721,14409, in columns 13 and 14 is specifying 4 ranges. I repeat, why is there a trailing
, on these columns but not on columns 15, 16, 17, and 18?
Should all ranges in an interval be in increasing numeric order with no overlaps between ranges in different intervals? (Note that every value selected by Interval 3 in your sample input line is also in the last range in Interval 1 (and Interval 4)). So, is a match against this line for a line in the
Input or
file 1 file with field 4 set to 14400 supposed to be labeled
ambiguous, or is it supposed to be labeled with some combination of
geneA,
geneC, and
geneD?
Please give us a more extensive example that includes at least one line with one match, at least one line that has multiple matches, and something that shows us how you determine whether a range specified by columns 13 and 14 is counted as being in Interval 1 or Interval 4 and show us the EXACT output that you want in all of these cases!