Sponsored Content
Full Discussion: extract data from 2 files
Top Forums Shell Programming and Scripting extract data from 2 files Post 302597502 by Diya123 on Friday 10th of February 2012 01:19:52 PM
Old 02-10-2012
extract data from 2 files

file 1
Code:
WASH7P        17232,18267,18500,20564            17368,18362,18554,21139

file 2


Code:
chr1	14969	15038	Exon	WASH7P
chr1	17232	17368	Exon	WASH7P
chr1	17258	17368	Exon	WASH7P
chr1	17605	17742	Exon	WASH7P
chr1	18267	18362	Exon	WASH7P
chr1	18267	18366	Exon	WASH7P
chr1	18267	18369	Exon	WASH7P
chr1	18267	18379	Exon	WASH7P
chr1	18496	18554	Exon	WASH7P
chr1	18500	18554	Exon	WASH7P
chr1	18912	19139	Exon	WASH7P

What I need to check is for each gene if column 2 of file 2 is present in column 2 of file 1 and column 3 of file 2 is present in column 3 of file 1 . If so then these two are supposed to be placed in column 4 and 5 of file 1. I just explained one example here. I have 15000 entries with different genes(column5 of file 2). so a for loop should be there to iterate..

o/p

Code:
 WASH7P        17232,18267,18500,20564            17368,18362,18554,21139  17232,18267,18500   17232,18267,18500

Thanks,

Last edited by Franklin52; 02-11-2012 at 08:59 AM.. Reason: Adding code tags
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl script for extract data from xml files

Hi All, Prepare a perl script for extracting data from xml file. The xml data look like as AC StartTime="1227858839" ID="88" ETime="1227858837" DSTFlag="false" Type="2" Duration="303" /> <AS StartTime="1227858849" SigPairs="119 40 98 15 100 32 128 18 131 23 70 39 123 20 120 27 100 17 136 12... (3 Replies)
Discussion started by: allways4u21
3 Replies

2. Shell Programming and Scripting

extract the relevant data files for a quarter

CTB_KT_OllyotLvos_20081204_164352_200811.txt CTB_KT_LN_utahfwd_20081204_164352_200811.txt CTB_KT_LN_utahfwd_Summ_20081204_164352_200811.txt CTB_KT_PML_astdt_prFr_20081204_210153_200811.txt CTB_KT_PML_astdt_prOt_20081204_210153_200811.txt CTB_KT_PML_astdt_Nopr_20081204_210153_200811.txt... (7 Replies)
Discussion started by: w020637
7 Replies

3. UNIX for Dummies Questions & Answers

AWK, extract data from multiple files

Hi, I'm using AWK to try to extract data from multiple files (*.txt). The script should look for a flag that occurs at a specific position in each file and it should return the data to the right of that flag. I should end up with one line for each file, each containing 3 columns:... (8 Replies)
Discussion started by: Liverpaul09
8 Replies

4. Shell Programming and Scripting

How to extract data from indexed files (ISAM files) maintained in an unix server.

Hi, Could someone please assist on a quick way of How to extract data from indexed files (ISAM files) maintained in an UNIX(AIX) server.The file data needs to be extracted in flat text file or CSV or excel format . Usually we have programs in microfocus COBOL to extract data, but would like... (2 Replies)
Discussion started by: devina
2 Replies

5. Shell Programming and Scripting

extract data with awk from html files

Hello everyone, I'm new to this forum and i am new as a shell scripter. my problem is to have html files in a directory and I would like to extract from these some data that lies between two different lines Here's my situation <td align="default"> oxidizability (mg / l): data_to_extract... (6 Replies)
Discussion started by: sbobotex
6 Replies

6. Shell Programming and Scripting

Extract data with awk and write to several files

Hi! I have one file with data that looks like this: 1 data data data data 2 data data data data 3 data data data data . . . 1 data data data data 2 data data data data 3 data data data data . . . I would like to have awk to write each block to a separate file, like this: 1... (3 Replies)
Discussion started by: LinWin
3 Replies

7. Shell Programming and Scripting

How to extract information from two files with data range

Hi, I want to make a query about extracting data from two files that both have data ranges. the data that i want to extract; when there is matching between file1 column 2 is equal to file2 column2 , and file1 column 3 and column 4 is within the range of file2 columns 3 and 4. I would like rows... (1 Reply)
Discussion started by: houkto
1 Replies

8. UNIX for Dummies Questions & Answers

Extract common data out of multiple files

I am trying to extract common list of Organisms from different files For example I took 3 files and showed expected result. In real I have more than 1000 files. I am aware about the useful use of awk and grep but unaware in depth so need guidance regarding it. I want to use awk/ grep/ cut/... (7 Replies)
Discussion started by: macmath
7 Replies

9. Shell Programming and Scripting

Extract data in tabular format from multiple files

Hi, I have directory with multiple files from which i need to extract portion of specif lines and insert it in a new file, the new file will contain a separate columns for each file data. Example: I need to extract Value_1 & Value_3 from all files and insert in output file as below: ... (2 Replies)
Discussion started by: belalr
2 Replies

10. Shell Programming and Scripting

Match and extract data using two files

Hello, Using the information in file 1, I would like to extract from file2 all rows which matchs in column 3. file 1 1233 1230 1231 1232 file2 65733.00 19775.00 1220 65733.00 19793.00 1220 65733.00 19801.00 1220 65733.00 19809.00 1231 65733.00 19817.00 ... (2 Replies)
Discussion started by: jiam912
2 Replies
tabix(1)						       Bioinformatics tools							  tabix(1)

NAME
bgzip - Block compression/decompression utility tabix - Generic indexer for TAB-delimited genome position files SYNOPSIS
bgzip [-cdhB] [-b virtualOffset] [-s size] [file] tabix [-0lf] [-p gff|bed|sam|vcf] [-s seqCol] [-b begCol] [-e endCol] [-S lineSkip] [-c metaChar] in.tab.bgz [region1 [region2 [...]]] DESCRIPTION
Tabix indexes a TAB-delimited genome position file in.tab.bgz and creates an index file in.tab.bgz.tbi when region is absent from the com- mand-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface. After indexing, tabix is able to quickly retrieve data lines overlapping regions specified in the format "chr:beginPos-endPos". Fast data retrieval also works over network if URI is given as a file name and in this case the index file will be downloaded if it is not present locally. OPTIONS OF TABIX
-p STR Input format for indexing. Valid values are: gff, bed, sam, vcf and psltab. This option should not be applied together with any of -s, -b, -e, -c and -0; it is not used for data retrieval because this setting is stored in the index file. [gff] -s INT Column of sequence name. Option -s, -b, -e, -S, -c and -0 are all stored in the index file and thus not used in data retrieval. [1] -b INT Column of start chromosomal position. [4] -e INT Column of end chromosomal position. The end column can be the same as the start column. [5] -S INT Skip first INT lines in the data file. [0] -c CHAR Skip lines started with character CHAR. [#] -0 Specify that the position in the data file is 0-based (e.g. UCSC files) rather than 1-based. -h Print the header/meta lines. -B The second argument is a BED file. When this option is in use, the input file may not be sorted or indexed. The entire input will be read sequentially. Nonetheless, with this option, the format of the input must be specificed correctly on the command line. -f Force to overwrite the index file if it is present. -l List the sequence names stored in the index file. EXAMPLE
(grep ^"#" in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) | bgzip > sorted.gff.gz; tabix -p gff sorted.gff.gz; tabix sorted.gff.gz chr1:10,000,000-20,000,000; NOTES
It is straightforward to achieve overlap queries using the standard B-tree index (with or without binning) implemented in all SQL data- bases, or the R-tree index in PostgreSQL and Oracle. But there are still many reasons to use tabix. Firstly, tabix directly works with a lot of widely used TAB-delimited formats such as GFF/GTF and BED. We do not need to design database schema or specialized binary formats. Data do not need to be duplicated in different formats, either. Secondly, tabix works on compressed data files while most SQL databases do not. The GenCode annotation GTF can be compressed down to 4%. Thirdly, tabix is fast. The same indexing algorithm is known to work effi- ciently for an alignment with a few billion short reads. SQL databases probably cannot easily handle data at this scale. Last but not the least, tabix supports remote data retrieval. One can put the data file and the index at an FTP or HTTP server, and other users or even web services will be able to get a slice without downloading the entire file. AUTHOR
Tabix was written by Heng Li. The BGZF library was originally implemented by Bob Handsaker and modified by Heng Li for remote file access and in-memory caching. SEE ALSO
samtools(1) tabix-0.2.0 11 May 2010 tabix(1)
All times are GMT -4. The time now is 04:26 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy