Match ids and print original file Post: 302777517

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

print remaining part from the first-match within a file

Hi, i was looking for unix command(s) for : find the first occurrence of a given pattern with in a file and print the remaining part. below is an example of what i am looking for: lets say, a file named myfile.txt now, the command i am looking for will do the following

2. Shell Programming and Scripting

Strings from one file which exactly match to the 1st column of other file and then print lines.

Hi, I have two files. 1st file has 1 column (huge file containing ~19200000 lines) and 2nd file has 2 columns (small file containing ~6000 lines). ################################# huge_file.txt a a ab b ################################## small_file.txt a 1.5 b 2.5 ab ...

3. Shell Programming and Scripting

print when column match with other file

Hello all, please help. There are two file like this: file1: 1197510.0 294777.7 9666973.0 21.6 1839.8 1197510.0 294777.7 9666973.0 413.2 2075.9 1197510.0 294777.7 9666973.0 689.3 2260.0 ...

4. UNIX for Dummies Questions & Answers

Match values/IDs from column and text files

Hello, I am trying to modify 2 files, to yield results in a 3rd file. File-1 is a 8-columned file, separted with tab. 1234:1 xyz1234 blah blah blah blah blah blah 1234:1 xyz1233 blah blah blah blah blah blah 1234:1 abc1234 blah blah blah blah blah blah n/a RRR0000 blah blah blah...

5. Shell Programming and Scripting

AWK print and retain original format

I have a file with very specific column spacing formatting, I wish to do the following: awk '{print $1, $2, $3, $4, $5, $6, $19-$7, $20-$8, $21-$9, $10, $11, $12}' merge.pdb > vector.pdb but the format gets ruined. I have tried with print -f but to no avail....

6. Shell Programming and Scripting

Match and print columns in second file

Hi All, I have to match each row in file 1 with 1st row in file 2 and print the corresponding column from file2. I am trying to use an awk script to do this. For example cat File1 X1 X3 X4 cat File2 ID X1 X2 X3 X4 A 1 6 2 1 B 2 7 3 3 C 3 8 4 1 D 4 9 1 1

7. Shell Programming and Scripting

Match ids

Hello, I have two files File 1 with 10 columns rsid position ........ xx 1:10000 File 2 position 1:10000 2:2000 .... I need to extract the IDs given in file 2(column1) from file 1 (column2) and print all columns from file1. I am trying this command

8. UNIX for Beginners Questions & Answers

Count multiple columns and print original file

Hello, I have two tab files with headers File1: with 4 columns header1 header2 header3 header4 44 a bb 1 57 c ab 4 64 d d 5 File2: with 26 columns header1.. header5 header6 header7 ... header 22...header26 id1 44 a bb id2 57 ...

9. UNIX for Beginners Questions & Answers

Match duplicate ids in two files

I have two text files. File 1 has 150 ids but all the ids exists in duplicates so it has 300 ids in total. File 2 has 1500 ids but all exists in duplicates so file 2 has 300 ids in total. i want to match the first occurance of every id in file 1 with first occurance of thet id in file 2 and 2nd...

10. Shell Programming and Scripting

awk to print match or non-match and select fields/patterns for non-matches

In the awk below I am trying to output those lines that Match between file1 and file2, those Missing in file1, and those missing in file2. Using each $1,$2,$4,$5 value as a key to match on, that is if those 4 fields are found in both files the match, but if those 4 fields are not found then missing...

LEARN ABOUT DEBIAN

tabix

tabix(1)						       Bioinformatics tools							  tabix(1)

NAME

       bgzip - Block compression/decompression utility

       tabix - Generic indexer for TAB-delimited genome position files

SYNOPSIS

       bgzip [-cdhB] [-b virtualOffset] [-s size] [file]

       tabix [-0lf] [-p gff|bed|sam|vcf] [-s seqCol] [-b begCol] [-e endCol] [-S lineSkip] [-c metaChar] in.tab.bgz [region1 [region2 [...]]]

DESCRIPTION

       Tabix  indexes a TAB-delimited genome position file in.tab.bgz and creates an index file in.tab.bgz.tbi when region is absent from the com-
       mand-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface. After indexing, tabix is
       able  to quickly retrieve data lines overlapping regions specified in the format "chr:beginPos-endPos". Fast data retrieval also works over
       network if URI is given as a file name and in this case the index file will be downloaded if it is not present locally.

OPTIONS OF TABIX

       -p STR	 Input format for indexing. Valid values are: gff, bed, sam, vcf and psltab. This option should not be applied together  with  any
		 of -s, -b, -e, -c and -0; it is not used for data retrieval because this setting is stored in the index file. [gff]

       -s INT	 Column  of  sequence name. Option -s, -b, -e, -S, -c and -0 are all stored in the index file and thus not used in data retrieval.
		 [1]

       -b INT	 Column of start chromosomal position. [4]

       -e INT	 Column of end chromosomal position. The end column can be the same as the start column. [5]

       -S INT	 Skip first INT lines in the data file. [0]

       -c CHAR	 Skip lines started with character CHAR. [#]

       -0	 Specify that the position in the data file is 0-based (e.g. UCSC files) rather than 1-based.

       -h	 Print the header/meta lines.

       -B	 The second argument is a BED file. When this option is in use, the input file may not be sorted or indexed. The entire input will
		 be read sequentially. Nonetheless, with this option, the format of the input must be specificed correctly on the command line.

       -f	 Force to overwrite the index file if it is present.

       -l	 List the sequence names stored in the index file.

EXAMPLE

       (grep ^"#" in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) | bgzip > sorted.gff.gz;

       tabix -p gff sorted.gff.gz;

       tabix sorted.gff.gz chr1:10,000,000-20,000,000;

NOTES

       It  is  straightforward	to  achieve overlap queries using the standard B-tree index (with or without binning) implemented in all SQL data-
       bases, or the R-tree index in PostgreSQL and Oracle. But there are still many reasons to use tabix. Firstly, tabix directly  works  with  a
       lot  of	widely used TAB-delimited formats such as GFF/GTF and BED. We do not need to design database schema or specialized binary formats.
       Data do not need to be duplicated in different formats, either. Secondly, tabix works on compressed data files while most SQL databases	do
       not.  The  GenCode annotation GTF can be compressed down to 4%.	Thirdly, tabix is fast. The same indexing algorithm is known to work effi-
       ciently for an alignment with a few billion short reads. SQL databases probably cannot easily handle data at this scale. Last but  not  the
       least,  tabix supports remote data retrieval. One can put the data file and the index at an FTP or HTTP server, and other users or even web
       services will be able to get a slice without downloading the entire file.

AUTHOR

       Tabix was written by Heng Li. The BGZF library was originally implemented by Bob Handsaker and modified by Heng Li for remote  file  access
       and in-memory caching.

SEE ALSO

       samtools(1)

tabix-0.2.0							    11 May 2010 							  tabix(1)