Appending information from 2nd file into 1st based on intervals


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Appending information from 2nd file into 1st based on intervals
# 8  
Old 04-10-2013
File 2( 10 million rows)

Code:
HWUSI-EAS000_29:2:112:15026:1079#0/1    +    chr21    9827004
HWUSI-EAS000_29:2:112:1096:1083#0/1    +    chr21    46529599
HWUSI-EAS000_29:2:112:6116:1092#0/1    +    chr21    9827328
HWUSI-EAS000_29:2:112:7436:1103#0/1    -    chr21    38597405
HWUSI-EAS000_29:2:112:3168:1114#0/1    -    chr21    44836222
HWUSI-EAS000_29:2:112:12481:1110#0/1    +    chr21    45089410
HWUSI-EAS000_29:2:112:16829:1109#0/1    -    chr21    11087783
HWUSI-EAS000_29:2:112:6005:1121#0/1    +    chr21    11180428
HWUSI-EAS000_29:2:112:12016:1128#0/1    -    chr21    38187834
HWUSI-EAS000_29:2:112:4252:1140#0/1    +    chr21    46534847
HWUSI-EAS000_29:2:112:14645:1133#0/1    +    chr21    46493472
HWUSI-EAS000_29:2:112:16002:1130#0/1    -    chr21    47700601
HWUSI-EAS000_29:2:112:13823:1144#0/1    -    chr21    46189143
HWUSI-EAS000_29:2:112:16154:1152#0/1    +    chr21    9827328
HWUSI-EAS000_29:2:112:9792:1159#0/1    +    chr21    9827404
HWUSI-EAS000_29:2:112:1333:1168#0/1    -    chr21    46269533
HWUSI-EAS000_29:2:112:6703:1175#0/1    +    chr21    46517134


file 1( gene position file)
Code:
hg19.knownCanonical.chrom    Condition_testing    hg19.knownCanonical.chromStart    hg19.knownCanonical.chromEnd    hg19.knownCanonical.transcript    hg19.knownGene.name    hg19.knownGene.chrom    hg19.knownGene.strand    hg19.knownGene.txStart    hg19.knownGene.txEnd    hg19.knownGene.cdsStart    hg19.knownGene.cdsEnd    hg19.knownGene.exonCount    hg19.knownGene.exonStarts    hg19.knownGene.exonEnds    5'UTR_start    5'UTR_stop    3'UTR_start    3'UTR_stop    intron_stop    intron_start
chr1    1    367658    368597    uc010nxu.2    uc010nxu.2    chr1    +    367658    368597    367658    368597    1    367658,    368597,    NA    NA    NA    NA        
chr1    1    1266725    1269844    uc010nyk.2    uc010nyk.2    chr1    +    1266725    1269844    1266725    1269844    6    1266725,1267017,1267403,1268300,1268638,1268885,    1266916,1267318,1268186,1268504,1268759,1269844,    NA    NA    NA    NA    1267016,1267402,1268299,1268637,1268884    
chr1    0    229761980    229795946    uc001hts.1    uc001hts.1    chr1    +    229761980    229795946    229763380    229795044    10    229761980,229763367,229768015,229770663,229779279,229781605,229783256,229786981,229789995,229794846,    229762103,229763506,229768192,229773994,229779440,229781716,229783499,229787069,229790135,229795946,    229761981    229763379    229795045    229795945    229763366,229768014,229770662,229779278,229781604,229783255,229786980,229789994,229794845    
chr1    0    206940947    206945839    uc001hen.1    uc001hen.1    chr1    -    206940947    206945839    206941980    206945780    5    206940947,206943173,206944251,206944700,206945615,    206942073,206943239,206944404,206944760,206945839,    206945838    206945781    206941979    206940948    206943172,206944250,206944699,206945614    
chr21    0    43731776    43735706    uc002zav.3    uc002zav.3    chr21    -    43731776    43735706    43732365    43735526    3    43731776,43733594,43735402,    43732379,43733741,43735706,    43735705    43735527    43732364    43731777    43733593,43735401    
chr21    0    43766466    43771208    uc002zaw.3    uc002zaw.3    chr21    -    43766466    43771208    43766641    43771066    4    43766466,43767594,43769989,43770987,    43766655,43767741,43770139,43771208,    43771207    43771067    43766640    43766467    43767593,43769988,43770986

output:
Code:
HWUSI-EAS000_29:2:112:1096:1083#0/1    +    chr21    46529599 Exon
HWUSI-EAS000_29:2:112:6116:1092#0/1    +    chr21    9827328 3'UTR
HWUSI-EAS000_29:2:112:7436:1103#0/1    -    chr21    38597405 5'UTR_Intron
HWUSI-EAS000_29:2:112:3168:1114#0/1    -    chr21    44836222 intergenic

The appended value in my output is just for reference..

#Note: Also the chr numbers should match in 2 files..Example if the 1st row in file2 coresponds to chr21 then we need to look only for the positions for chr21 in file 1

Thanks,

Last edited by Diya123; 04-11-2013 at 08:36 PM..
# 9  
Old 04-11-2013
Hello awk group,

Do you still need any kind of information.

Thanks,
# 10  
Old 04-11-2013
Before we worry about appending the output anywhere for any reason, we need to know what your expected output is for a simple exmple of file 2 in. If there are keys and storage required to match up the information, we need to know what the kays are and what states we are detecting on each of those keys, like stock symbols and each successive difference in sales price as the trade day goes on.

I am getting a general drift, but file 1 is both input and output for some reason. Line 1 of file 2 is discarded or moved? The last field of file 1 line one is Exon but the closest in input is 'hg19.knownGene.exonCount hg19.knownGene.exonStarts hg19.knownGene.exonEnds' -- gratuitous capitalization? We have start, Starts, End, ends, 'start stop' then 'stop start'. Somewhere there is a key matrix you are not sharing?

Last edited by DGPickett; 04-11-2013 at 08:53 PM..
# 11  
Old 04-11-2013
Hello DGPickett,

I have pasted the sample output in the previous post

Also I mentioned the chr numbers in both the fields should be matched.( for example if we are looking for positions of chr21 on file2 then we need to check for genes only on chr21).

Thanks,
# 12  
Old 04-11-2013
Does the chr1 info belong in a different file1? How did file 1 get started with the right keys to we could extend it? Is there a simpler way to relate files with names and parameters?
# 13  
Old 04-11-2013
Hello,

The chr information is already in file 1 and its the first column in file1. We need to look for each line (row) in file 2 for its position in column 4 and also its chr name in column1 and match it to the chr in file1, then start looking for the intervals where the position falls into and accordingly append it to the end of the row in file2.

Hope i am clear.. If not let me know.

Thanks,
# 14  
Old 04-11-2013
Quote:
Originally Posted by Diya123
Hello awk group,

Do you still need any kind of information.

Thanks,
I don't think I follow DGPickett's response in message #10 in this thread (we're talking about gene sequences rather than stocks).

But, looking at Diya123's response in message #8 in this thread, I'm still completely lost. There are 17 lines in the file 2 sample. There are 2 lines in the file 1 sample where the 3rd column in file 2 matches the 1st column and the 7th column in file 1. And, there are 4 lines in the specified output file.

I repeat: How do you determine which lines in files 1 and 2 are to be matched together? How do 2 lines in file 2 yield 4 lines of output OR how do 17 lines in file 1 yield only 4 lines of output?

Are we supposed to match file 2 column 3 to file 1 column 1 or column 7 or both?

And now that we have another sample of file 1 data (this time with a header line), why does the column with heading "intron_stop" come before the column with heading "intron_start" when the other "Start" fields come before the corresponding "End" or "Stop" fields? And why are are there "intron_stop" fields in every line of the new sample file 1, but no "intron_start" data on any of those lines? (Or, in other words, what ranges are supposed to be inferred when the "intron_stop" field is "43767593,43769988,43770986" and the "intron_start" field is ""?)

Until I see a clear statement that answers these questions, there is nothing I can do to help you!

PLEASE show us the exact output that should be produced given the 21 lines in the latest file 1 sample and the 7 lines in the latest file 2 sample (or a corrected version of it) AND EXPLAIN HOW YOU ARRIVED AT THAT OUTPUT!

Last edited by Don Cragun; 04-11-2013 at 09:59 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Compare 1st column from 2 file and if match print line from 1st file and append column 7 from 2nd

hi I have 2 file with more than 10 columns for both 1st file apple,0,0,0...... orange,1,2,3..... mango,2,4,5..... 2nd file apple,2,3,4,5,6,7... orange,2,3,4,5,6,8... watermerlon,2,3,4,5,6,abc... mango,5,6,7,4,6,def.... (1 Reply)
Discussion started by: tententen
1 Replies

2. UNIX for Beginners Questions & Answers

Get information from one files, based on data from other file

Hello. I am trying to get some info from log file. I have fileA , which contains all the country prefixes (the file contains one column and "n" rows ). And i have fileB, which contains huge data of phone numbers (the file contains one column and "n" rows). What i want to do is, to count... (7 Replies)
Discussion started by: dragonfly85
7 Replies

3. Shell Programming and Scripting

How to create file and file content based existing information?

Hi Gurus, I am SQL developer and new unix user. I need to create some file and file content based on information in two files. I have one file contains basic information below file1 and another exception file file2. the rule is if "zone' and "cd" in file1 exists in file2, then file name is... (13 Replies)
Discussion started by: Torhong
13 Replies

4. Shell Programming and Scripting

File comparing and appending based on fields

I want to compare 2 files, locus_file.txt is a very large file and attr.txt is a small file. I want to match the first 2 columns of the first file to the second column of attr.txt and print the attributes together. locus_file.txt:large file LOC_Os02g47020, LOC_Os03g57840,0.88725114... (3 Replies)
Discussion started by: Sanchari
3 Replies

5. Shell Programming and Scripting

Filter records based on 2nd file

Hello, I want to filter records of a file if they fall in range associated with a second file. First the chr number (2nd col of 1st file and 1st col of 2nd file) needs to be matched. Then if the 3rd col of the first file falls within any of the ranges specified by the 2nd and 3rd cols , then... (4 Replies)
Discussion started by: ritakadm
4 Replies

6. UNIX for Dummies Questions & Answers

Obtaining File information based on String Search

Is there a single Command in Unix to get the following Information when searching for files containing one or more strings in a Unix Directory (including sub directories within it) : 1) Complete filename ( path and filename) 2) Owner of the file 3) Size of the file 4) Last Modified date... (3 Replies)
Discussion started by: pchegoor
3 Replies

7. Shell Programming and Scripting

Calculate 2nd Column Based on 1st Column

Dear All, I have input file like this. input.txt CE2_12-15 3950.00 589221.0 9849709.0 768.0 CE2_12_2012 CE2_12-15 3949.00 589199.0 9849721.0 768.0 CE2_12_2012 CE2_12-15 3948.00 589178.0 9849734.0 768.0 CE2_12_2012 CE2_12-52 1157.00 ... (3 Replies)
Discussion started by: attila
3 Replies

8. Shell Programming and Scripting

Appending 1st field in a file into 2nd field in another file

Hi, I've internally searched through forums for about 2+ hours. Unfortunately, with no luck. Although I've found some cases close to mine below, but didn't help so much. Actually, I'm in short with time. So I had to post my case. Hoping that you can help. I have 2 files, FILE1 ... (0 Replies)
Discussion started by: amurib
0 Replies

9. Shell Programming and Scripting

How to keep appending a newly created file based on some keywords

Hi Friends, I have to create a new log file everyday and append it with content based on some keywords found in another log file. Here is what I have tried so far... grep Error /parentfolder/someLogFile.log >> /parentfolder /Archive/"testlogfile_error_`date '+%d%m%y'`.txt" grep error... (6 Replies)
Discussion started by: supreet
6 Replies

10. Shell Programming and Scripting

Bash script to delete folder based on text file information

I have been working on a script to list all the name's of a subfolder in a text file then edit that text file and then delete the subfolder base on the edited text file so far I have been able to do every thing I just talked about but can't figure out how to delete the subfolers base on a text file... (8 Replies)
Discussion started by: bone11409
8 Replies
Login or Register to Ask a Question