Appending information from 2nd file into 1st based on intervals


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Appending information from 2nd file into 1st based on intervals
# 1  
Old 04-04-2013
Appending information from 2nd file into 1st based on intervals

Hi,

I am trying to gather information from the second file and append it to the first file.

input
Code:
HWUSI-EAS000_29:1:100:10000:11479#0/1   +       chr5    14458050        ATTGGCTGAGGTCCTACTAGTTGTGATGTGTAAGTGT   HHHHHHGDGGEDGGGDGCGEDDEFFFAGE 0

second file:
Code:
chr1	11873	14409	uc010nxq.1	uc010nxq.1	chr1	+	11873	14409	12189	13639	3	11873,12594,13402,	12227,12721,14409,	11874	12188	13640	14408	12593,13401	12228,12722,

What I want is, for every value in column 4, it has to check in file 2 where it falls into; for example if the value "14458050" falls between which intervals

Interval 1 : Note: 11873,12594,13402, of column 13 and 12227,12721,14409, column 14 are pairs (i mean 11873-12227; 12594-12721, 13402-14409)
Interval 2 : between 15 and column 16
Interval 3 : between 17 and 18
Interval 4 : Same as Interval 1 but different category

If 14458050 falls in interval 1 then append the corresponding row of file 1 with geneA, for interval 2 append with geneB , for interval 3 append with geneC and for Interval 4 append with geneD..If not in any category then append with NONE

output:
Code:
chr1	11873	14409	uc010nxq.1	uc010nxq.1	chr1	+	11873	14409	12189	13639	3	11873,12594,13402,	12227,12721,14409,	11874	12188	13640	14408	12593,13401	12228,12722,  NONE

I am not sure if awk is a easy way to do this.. I have 10 Million rows..

Thanks,
# 2  
Old 04-04-2013
Is your output correct? I mean you mentioned it should be appended to the first file right?
And file2 has only one line?

Also, your intervals are overlapping.

--ahamed

Last edited by ahamed101; 04-04-2013 at 09:26 PM..
# 3  
Old 04-04-2013
I agree with ahamed that this is way underspecified:

Do you have 10 million rows in input and 10 million rows in second file?

I can guess that geneA and geneB are fields 5 and 6 in input. Where are geneC and geneD?

How do you tell the difference between Interval 1 and Interval 4; i.e., how do we determine what category is being processed? What is a category?

Why do some ranges in second file have no trailing comma and others have a trailing comma as in interval 3 as specified by column 17 (12593,13401) and column 18 (12228,12722,)?

Are there always three ranges in Interval 1, one range in Interval 2, and two ranges in Interval 3?

Will there always be zero or one match in second file for each line in input, or could there be multiple matches in second file for some lines in input? If there are multiple matches what is the output supposed to be?
# 4  
Old 04-04-2013
Hi,

Sorry for not being clear and specific

1) my file 1 has 10M rows and file 2 has 30,000 rows
2)gene A B C D are our desired categories based on where the 4th position of file 1 falls in file 2
3)interval 1 are my exon starts and exon stops.. There can be 1 or 2 to many.. That's the reason they are separated by commas.. Same goes for interval 4 as they are intron starts and stops. The between columns have just one start and one stop ( column 15,16,17,18)
So my intervals should be (column 13and14 pair wise( 1st value in column 13 and 1st value in column14) for interval 1 and 4
Interval 2: between column 15-16; interval 3 column 17-18
4)the number of ranges in interval 1 are always 1 or more than 1; where as for interval 4 it can be 0 or more than 0. For interval 2 and 3 it's always one value.
There may be a zero match or more than one match.. If its a zero match we assign "none" at the end of that row in file 1 .. If its more than 1 we assign "ambiguous" at the end of the row..if it falls in the intervals then interval1 is assigned geneA and interval 2 as gene B and interval 3 as gene C and interval 4 as gene D..
Let me know if I am still not clear..

Thanks
# 5  
Old 04-04-2013
Quote:
Originally Posted by Diya123
Hi,

Sorry for not being clear and specific

1) my file 1 has 10M rows and file 2 has 30,000 rows
2)gene A B C D are our desired categories based on where the 4th position of file 1 falls in file 2
3)interval 1 are my exon starts and exon stops.. There can be 1 or 2 to many.. That's the reason they are separated by commas.. Same goes for interval 4 as they are intron starts and stops. The between columns have just one start and one stop ( column 15,16,17,18)
So my intervals should be (column 13and14 pair wise( 1st value in column 13 and 1st value in column14) for interval 1 and 4
Interval 2: between column 15-16; interval 3 column 17-18
4)the number of ranges in interval 1 are always 1 or more than 1; where as for interval 4 it can be 0 or more than 0. For interval 2 and 3 it's always one value.
There may be a zero match or more than one match.. If its a zero match we assign "none" at the end of that row in file 1 .. If its more than 1 we assign "ambiguous" at the end of the row..if it falls in the intervals then interval1 is assigned geneA and interval 2 as gene B and interval 3 as gene C and interval 4 as gene D..
Let me know if I am still not clear..

Thanks
Your descriptions of Interval 4 is: Same as Interval 1 but different category. So, I repeat if there is a match on one of the ranges specified by columns 13 and 14, what determines whether the result is geneA or geneD?

If there are multiple matches (but all of the matches are in the same interval) is the result "ambiguous" or is it "geneX" where X corresponds to the interval that was matched on all matching lines?

The way awk works, field separators separate fields rather than terminate them. So it looks like a set of ranges specified by 11873,12594,13402, and 12227,12721,14409, in columns 13 and 14 is specifying 4 ranges. I repeat, why is there a trailing , on these columns but not on columns 15, 16, 17, and 18?

Should all ranges in an interval be in increasing numeric order with no overlaps between ranges in different intervals? (Note that every value selected by Interval 3 in your sample input line is also in the last range in Interval 1 (and Interval 4)). So, is a match against this line for a line in the Input or file 1 file with field 4 set to 14400 supposed to be labeled ambiguous, or is it supposed to be labeled with some combination of geneA, geneC, and geneD?

Please give us a more extensive example that includes at least one line with one match, at least one line that has multiple matches, and something that shows us how you determine whether a range specified by columns 13 and 14 is counted as being in Interval 1 or Interval 4 and show us the EXACT output that you want in all of these cases!
# 6  
Old 04-08-2013
To describe it in detail file 2 each row is a gene. Most genes have one 5UTR start one 5UTR stop one 3UTR start and one 3UTR stop. It can have n number of exons each with a start and a stop and n-1 introns each with intron start and intron stop

Attached is the image

So one gene can have trailing commas for exons and introns but not for UTR’s.

If position (4th column) of file 1, falls between a 5UTRstart and 5UTRstop then add a additional column in file 1 for each row and append it with 5UTR_intron
If position (4th column) of file 1, falls between a 5UTRstart and 5UTRstop and also in an exon start and stop then for that row append with 5UTR_exon
If position falls between a exon start and stop then name the row as Exon
If position falls between a intron start and stop then name it as Intron
If position (4th column) of file 1, falls between a 3UTRstart and 3UTRstop then add a additional column in file 1 for each row and append it with 3UTR_intron
If position (4th column) of file 1, falls between a 3UTRstart and 3UTRstop and also in an exon start and stop then for that row append with 3UTR_exon
If position falls in the exons of 2 genes then name it as ambiguos
If position does not fall in any above category then name it as intergenic

Column 13: Exon starts
Column 14: Exon stops
Column 15: 5UTR start
Column 16: 5UTR stop
Column 17: 3UTR start
Column 18: 3UTR stop
Column 19: Intron start
Column 20: Intron stop

Note: Some genes do not have UTR’s and some does not have any introns
Appending information from 2nd file into 1st based on intervals-img_0360jpg

Last edited by Diya123; 04-09-2013 at 03:02 AM..
# 7  
Old 04-08-2013
You still need to give us a rasonable example (with more than 1 line in each input file) and with the exact output that you want produced given those input files!

How do we determine which of the 30,000 lines in file1 is supposed to be matched against the 10,000,000 lines in file2? Do you want 30,000 lines of output for each line in file2?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Compare 1st column from 2 file and if match print line from 1st file and append column 7 from 2nd

hi I have 2 file with more than 10 columns for both 1st file apple,0,0,0...... orange,1,2,3..... mango,2,4,5..... 2nd file apple,2,3,4,5,6,7... orange,2,3,4,5,6,8... watermerlon,2,3,4,5,6,abc... mango,5,6,7,4,6,def.... (1 Reply)
Discussion started by: tententen
1 Replies

2. UNIX for Beginners Questions & Answers

Get information from one files, based on data from other file

Hello. I am trying to get some info from log file. I have fileA , which contains all the country prefixes (the file contains one column and "n" rows ). And i have fileB, which contains huge data of phone numbers (the file contains one column and "n" rows). What i want to do is, to count... (7 Replies)
Discussion started by: dragonfly85
7 Replies

3. Shell Programming and Scripting

How to create file and file content based existing information?

Hi Gurus, I am SQL developer and new unix user. I need to create some file and file content based on information in two files. I have one file contains basic information below file1 and another exception file file2. the rule is if "zone' and "cd" in file1 exists in file2, then file name is... (13 Replies)
Discussion started by: Torhong
13 Replies

4. Shell Programming and Scripting

File comparing and appending based on fields

I want to compare 2 files, locus_file.txt is a very large file and attr.txt is a small file. I want to match the first 2 columns of the first file to the second column of attr.txt and print the attributes together. locus_file.txt:large file LOC_Os02g47020, LOC_Os03g57840,0.88725114... (3 Replies)
Discussion started by: Sanchari
3 Replies

5. Shell Programming and Scripting

Filter records based on 2nd file

Hello, I want to filter records of a file if they fall in range associated with a second file. First the chr number (2nd col of 1st file and 1st col of 2nd file) needs to be matched. Then if the 3rd col of the first file falls within any of the ranges specified by the 2nd and 3rd cols , then... (4 Replies)
Discussion started by: ritakadm
4 Replies

6. UNIX for Dummies Questions & Answers

Obtaining File information based on String Search

Is there a single Command in Unix to get the following Information when searching for files containing one or more strings in a Unix Directory (including sub directories within it) : 1) Complete filename ( path and filename) 2) Owner of the file 3) Size of the file 4) Last Modified date... (3 Replies)
Discussion started by: pchegoor
3 Replies

7. Shell Programming and Scripting

Calculate 2nd Column Based on 1st Column

Dear All, I have input file like this. input.txt CE2_12-15 3950.00 589221.0 9849709.0 768.0 CE2_12_2012 CE2_12-15 3949.00 589199.0 9849721.0 768.0 CE2_12_2012 CE2_12-15 3948.00 589178.0 9849734.0 768.0 CE2_12_2012 CE2_12-52 1157.00 ... (3 Replies)
Discussion started by: attila
3 Replies

8. Shell Programming and Scripting

Appending 1st field in a file into 2nd field in another file

Hi, I've internally searched through forums for about 2+ hours. Unfortunately, with no luck. Although I've found some cases close to mine below, but didn't help so much. Actually, I'm in short with time. So I had to post my case. Hoping that you can help. I have 2 files, FILE1 ... (0 Replies)
Discussion started by: amurib
0 Replies

9. Shell Programming and Scripting

How to keep appending a newly created file based on some keywords

Hi Friends, I have to create a new log file everyday and append it with content based on some keywords found in another log file. Here is what I have tried so far... grep Error /parentfolder/someLogFile.log >> /parentfolder /Archive/"testlogfile_error_`date '+%d%m%y'`.txt" grep error... (6 Replies)
Discussion started by: supreet
6 Replies

10. Shell Programming and Scripting

Bash script to delete folder based on text file information

I have been working on a script to list all the name's of a subfolder in a text file then edit that text file and then delete the subfolder base on the edited text file so far I have been able to do every thing I just talked about but can't figure out how to delete the subfolers base on a text file... (8 Replies)
Discussion started by: bone11409
8 Replies
Login or Register to Ask a Question