Appending information from 2nd file into 1st based on intervals

04-04-2013

Registered User

139, 1

Join Date: May 2011

Last Activity: 21 May 2014, 3:07 PM EDT

Posts: 139

Thanks Given: 7

Thanked 1 Time in 1 Post

Appending information from 2nd file into 1st based on intervals

Hi,

I am trying to gather information from the second file and append it to the first file.

input

Code:

HWUSI-EAS000_29:1:100:10000:11479#0/1   +       chr5    14458050        ATTGGCTGAGGTCCTACTAGTTGTGATGTGTAAGTGT   HHHHHHGDGGEDGGGDGCGEDDEFFFAGE 0

second file:

Code:

chr1	11873	14409	uc010nxq.1	uc010nxq.1	chr1	+	11873	14409	12189	13639	3	11873,12594,13402,	12227,12721,14409,	11874	12188	13640	14408	12593,13401	12228,12722,

What I want is, for every value in column 4, it has to check in file 2 where it falls into; for example if the value "14458050" falls between which intervals

Interval 1 : Note: 11873,12594,13402, of column 13 and 12227,12721,14409, column 14 are pairs (i mean 11873-12227; 12594-12721, 13402-14409)
Interval 2 : between 15 and column 16
Interval 3 : between 17 and 18
Interval 4 : Same as Interval 1 but different category

If 14458050 falls in interval 1 then append the corresponding row of file 1 with geneA, for interval 2 append with geneB , for interval 3 append with geneC and for Interval 4 append with geneD..If not in any category then append with NONE

output:

Code:

chr1	11873	14409	uc010nxq.1	uc010nxq.1	chr1	+	11873	14409	12189	13639	3	11873,12594,13402,	12227,12721,14409,	11874	12188	13640	14408	12593,13401	12228,12722,  NONE

I am not sure if awk is a easy way to do this.. I have 10 Million rows..

Thanks,

Diya123

View Public Profile for Diya123

Find all posts by Diya123

04-04-2013

Registered User

1,910, 488

Join Date: Sep 2008

Last Activity: 22 December 2019, 2:31 AM EST

Location: San Jose, CA

Posts: 1,910

Thanks Given: 54

Thanked 488 Times in 481 Posts

Is your output correct? I mean you mentioned it should be appended to the first file right?
And file2 has only one line?

Also, your intervals are overlapping.

--ahamed

Last edited by ahamed101; 04-04-2013 at 09:26 PM..

ahamed101

View Public Profile for ahamed101

Find all posts by ahamed101

04-04-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I agree with ahamed that this is way underspecified:

Do you have 10 million rows in input and 10 million rows in second file?

I can guess that geneA and geneB are fields 5 and 6 in input. Where are geneC and geneD?

How do you tell the difference between Interval 1 and Interval 4; i.e., how do we determine what category is being processed? What is a category?

Why do some ranges in second file have no trailing comma and others have a trailing comma as in interval 3 as specified by column 17 (12593,13401) and column 18 (12228,12722,)?

Are there always three ranges in Interval 1, one range in Interval 2, and two ranges in Interval 3?

Will there always be zero or one match in second file for each line in input, or could there be multiple matches in second file for some lines in input? If there are multiple matches what is the output supposed to be?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-04-2013

Registered User

139, 1

Join Date: May 2011

Last Activity: 21 May 2014, 3:07 PM EDT

Posts: 139

Thanks Given: 7

Thanked 1 Time in 1 Post

Hi,

Sorry for not being clear and specific

1) my file 1 has 10M rows and file 2 has 30,000 rows
2)gene A B C D are our desired categories based on where the 4th position of file 1 falls in file 2
3)interval 1 are my exon starts and exon stops.. There can be 1 or 2 to many.. That's the reason they are separated by commas.. Same goes for interval 4 as they are intron starts and stops. The between columns have just one start and one stop ( column 15,16,17,18)
So my intervals should be (column 13and14 pair wise( 1st value in column 13 and 1st value in column14) for interval 1 and 4
Interval 2: between column 15-16; interval 3 column 17-18
4)the number of ranges in interval 1 are always 1 or more than 1; where as for interval 4 it can be 0 or more than 0. For interval 2 and 3 it's always one value.
There may be a zero match or more than one match.. If its a zero match we assign "none" at the end of that row in file 1 .. If its more than 1 we assign "ambiguous" at the end of the row..if it falls in the intervals then interval1 is assigned geneA and interval 2 as gene B and interval 3 as gene C and interval 4 as gene D..
Let me know if I am still not clear..

Thanks

Diya123

View Public Profile for Diya123

Find all posts by Diya123

04-04-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by Diya123

Your descriptions of Interval 4 is: Same as Interval 1 but different category. So, I repeat if there is a match on one of the ranges specified by columns 13 and 14, what determines whether the result is geneA or geneD?

If there are multiple matches (but all of the matches are in the same interval) is the result "ambiguous" or is it "geneX" where X corresponds to the interval that was matched on all matching lines?

The way awk works, field separators separate fields rather than terminate them. So it looks like a set of ranges specified by 11873,12594,13402, and 12227,12721,14409, in columns 13 and 14 is specifying 4 ranges. I repeat, why is there a trailing , on these columns but not on columns 15, 16, 17, and 18?

Should all ranges in an interval be in increasing numeric order with no overlaps between ranges in different intervals? (Note that every value selected by Interval 3 in your sample input line is also in the last range in Interval 1 (and Interval 4)). So, is a match against this line for a line in the Input or file 1 file with field 4 set to 14400 supposed to be labeled ambiguous, or is it supposed to be labeled with some combination of geneA, geneC, and geneD?

Please give us a more extensive example that includes at least one line with one match, at least one line that has multiple matches, and something that shows us how you determine whether a range specified by columns 13 and 14 is counted as being in Interval 1 or Interval 4 and show us the EXACT output that you want in all of these cases!

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-08-2013

Registered User

139, 1

Join Date: May 2011

Last Activity: 21 May 2014, 3:07 PM EDT

Posts: 139

Thanks Given: 7

Thanked 1 Time in 1 Post

To describe it in detail file 2 each row is a gene. Most genes have one 5UTR start one 5UTR stop one 3UTR start and one 3UTR stop. It can have n number of exons each with a start and a stop and n-1 introns each with intron start and intron stop

Attached is the image

So one gene can have trailing commas for exons and introns but not for UTR’s.

If position (4th column) of file 1, falls between a 5UTRstart and 5UTRstop then add a additional column in file 1 for each row and append it with 5UTR_intron
If position (4th column) of file 1, falls between a 5UTRstart and 5UTRstop and also in an exon start and stop then for that row append with 5UTR_exon
If position falls between a exon start and stop then name the row as Exon
If position falls between a intron start and stop then name it as Intron
If position (4th column) of file 1, falls between a 3UTRstart and 3UTRstop then add a additional column in file 1 for each row and append it with 3UTR_intron
If position (4th column) of file 1, falls between a 3UTRstart and 3UTRstop and also in an exon start and stop then for that row append with 3UTR_exon
If position falls in the exons of 2 genes then name it as ambiguos
If position does not fall in any above category then name it as intergenic

Column 13: Exon starts
Column 14: Exon stops
Column 15: 5UTR start
Column 16: 5UTR stop
Column 17: 3UTR start
Column 18: 3UTR stop
Column 19: Intron start
Column 20: Intron stop

Note: Some genes do not have UTR’s and some does not have any introns

Appending information from 2nd file into 1st based on intervals-img_0360jpg

Diya123

View Public Profile for Diya123

Find all posts by Diya123

04-08-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You still need to give us a rasonable example (with more than 1 line in each input file) and with the exact output that you want produced given those input files!

How do we determine which of the 30,000 lines in file1 is supposed to be matched against the 10,000,000 lines in file2? Do you want 30,000 lines of output for each line in file2?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Appending information from 2nd file into 1st based on intervals

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Compare 1st column from 2 file and if match print line from 1st file and append column 7 from 2nd

Discussion started by: tententen

2. UNIX for Beginners Questions & Answers

Get information from one files, based on data from other file

Discussion started by: dragonfly85

3. Shell Programming and Scripting

How to create file and file content based existing information?

Discussion started by: Torhong

4. Shell Programming and Scripting

File comparing and appending based on fields

Discussion started by: Sanchari

5. Shell Programming and Scripting

Filter records based on 2nd file

Discussion started by: ritakadm

6. UNIX for Dummies Questions & Answers

Obtaining File information based on String Search

Discussion started by: pchegoor

7. Shell Programming and Scripting

Calculate 2nd Column Based on 1st Column

Discussion started by: attila

8. Shell Programming and Scripting

Appending 1st field in a file into 2nd field in another file

Discussion started by: amurib

9. Shell Programming and Scripting

How to keep appending a newly created file based on some keywords

Discussion started by: supreet

10. Shell Programming and Scripting

Bash script to delete folder based on text file information

Discussion started by: bone11409