10-21-2011
still have duplicates
Thanks.
It works but I realized that there are genes that are in the output file more than once because some isoforms happen to have the same length.
So now I would have to take that file and create a new one with only one gene per line.
I used the code below on the original file.
[Code]
nawk 'NR<2{next}{c=($NF-$(NF-1))}(!($1 in A))||(c>m[$1]&&($1 in A)){m[$1]=c;A[$1]=$0 FS m[$1]}END{for(i in A) print A[i]}' tst.txt
Output file:
test.txt
gene accession chr chr_st begin end length
TNFRSF18 NM_004195 chr1 - 1138887 1142089 3202
TNFRSF18 NM_148902 chr1 - 1138887 1142089 3202
TNFRSF18 NM_148901 chr1 - 1138887 1142089 3202
MIB2 NM_080875 chr1 + 1550794 1565990 15196
MIB2 NM_001170688 chr1 + 1550794 1565990 15196
MIB2 NM_001170687 chr1 + 1550794 1565990 15196
MIB2 NM_001170686 chr1 + 1550794 1565990 15196
CDK11A NM_024011 chr1 - 1634169 1655791 21622
CDK11A NM_033529 chr1 - 1634169 1655791 21622
WASH7P NR_024540 chr1 - 14361 29370 15009
FAM138F NR_026820 chr1 - 34610 36081 1471
FAM138A NR_026818 chr1 - 34610 36081 1471
So from this final file, how can I get it to make a file that has only one gene per line?
So I would want output.txt to be modified as:
Desired final file
gene accession chr chr_st begin end length
TNFRSF18 NM_004195 chr1 - 1138887 1142089 3202
MIB2 NM_080875 chr1 + 1550794 1565990 15196
CDK11A NM_024011 chr1 - 1634169 1655791 21622
WASH7P NR_024540 chr1 - 14361 29370 15009
FAM138F NR_026820 chr1 - 34610 36081 1471
FAM138A NR_026818 chr1 - 34610 36081 1471
I hope this is clearer.
Thanks!
Last edited by wolf_blue; 10-21-2011 at 04:27 PM..
10 More Discussions You Might Find Interesting
1. UNIX for Dummies Questions & Answers
i have a input as :
2898 | homy | pune | 7/4/09
1 :6298 | anna | chennai | 7/4/08
2 :3728 | gonna | kol | 8/2/10
3 :3987 | hogja | mumbai | 8/5/09
4 :6187 | galma | london | 9/5/01
5 :9167 | tamina | ny | 8/3/10
6 :3981 | dastan | bagh | 8/2/07
7 :4617 | vazir | ny
now,i want to get... (2 Replies)
Discussion started by: adityamitra
2 Replies
2. Shell Programming and Scripting
Select and display sum depending upon even columns
i have a input as :
2898 | homy | pune | 7/4/09
1 :6298 | anna | chennai | 7/4/08
2 :3728 | gonna | kol | 8/2/10
3 :3987 | hogja | mumbai | 8/5/09
4 :6187 | galma | london | 9/5/01
5 :9167 | tamina | ny | 8/3/10
6 :3981 | dastan | bagh |... (1 Reply)
Discussion started by: adityamitra
1 Replies
3. Shell Programming and Scripting
i have a file of the form
9488 14392 1 1.8586e-07
5702 7729 1 1.8586e-07
9048 14018 1 1.8586e-07
5992 12556 1 1.8586e-07
9488 14393 1 1.8586e-07
9048 14019 1 1.8586e-07
5992 12557 1 1.8586e-07
9488 14394 ... (1 Reply)
Discussion started by: vaibhavkorde
1 Replies
4. Shell Programming and Scripting
I have a huge matrix file which looks like this (example matrix):
1 2 3 5
4 5 6 7
7 6 8 9
1 2 4 2
7 6 5 1
3 2 1 9
As one can see, this matrix has 4 columns and 6 rows. But my original matrix has some 3 million rows and 6000 columns.
For example, on this matrix I can define my task as... (2 Replies)
Discussion started by: shoaibjameel123
2 Replies
5. Shell Programming and Scripting
I have a file with 20 columns. I'd like to retain only the lines for which the values in at least x columns, looking only at columns 6-20, are above a threshold.
For example, I'd like to retain only the lines in the file below that have at least 8 columns (again, looking only at columns 6-20)... (3 Replies)
Discussion started by: pathunkathunk
3 Replies
6. Shell Programming and Scripting
Dear All,
Please help me, I have file input like this,
1 2142
215 2162
217 2842
285 2862
287 4002
401 4022
403 4822
1 2142
215 2162
217 2842
285 2862
287 4002
401 4022
403 4882
1 4801 (8 Replies)
Discussion started by: attila
8 Replies
7. Shell Programming and Scripting
Hi , I want to compare first 3 columns of File A and File B and create a new file File C which will have all rows from File B and will include rows that are present in File A and not in File B based on First 3 column comparison.
Thanks in advance for your help.
File A
A,B,C,45,46... (2 Replies)
Discussion started by: ady_koolz
2 Replies
8. Shell Programming and Scripting
Hi,
I can select all the even columns from a file like this:
awk '{ for (i=1;i<=NF;i+=2) $i="" }1' file > new file
How can I select the 1st and all the even columns using awk? Thanks! (1 Reply)
Discussion started by: forU
1 Replies
9. Shell Programming and Scripting
I want to select 2nd, 3rd columns if line has "key3" and print rest of the lines as is.
# This is my sample input
key1="val1" key2="val2" key3="val3" key4="val4"
some text some text
some text some text
key1="val1" key2="val2" key3="val3" key4="val4"
some text some text
some text some... (3 Replies)
Discussion started by: kchinnam
3 Replies
10. UNIX for Beginners Questions & Answers
I have a dateset like this:
Gly1 Gly2 2 1 0
Gly3 Gly4 3 4 5
Gly3 Gly5 1 3 2
Gly2 Gly1 3 6 2
Gly4 Gly3 2 2 1
Gly6 Gly4 4 2 1what I expected is:
Gly1 Gly2 2 1 0
Gly2 Gly1 3 6 2
Gly3 Gly4 3 4 5
Gly4 Gly3 2 2 1
A vs B, or B vs A are the same... (7 Replies)
Discussion started by: nengcheng
7 Replies
tabix(1) Bioinformatics tools tabix(1)
NAME
bgzip - Block compression/decompression utility
tabix - Generic indexer for TAB-delimited genome position files
SYNOPSIS
bgzip [-cdhB] [-b virtualOffset] [-s size] [file]
tabix [-0lf] [-p gff|bed|sam|vcf] [-s seqCol] [-b begCol] [-e endCol] [-S lineSkip] [-c metaChar] in.tab.bgz [region1 [region2 [...]]]
DESCRIPTION
Tabix indexes a TAB-delimited genome position file in.tab.bgz and creates an index file in.tab.bgz.tbi when region is absent from the com-
mand-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface. After indexing, tabix is
able to quickly retrieve data lines overlapping regions specified in the format "chr:beginPos-endPos". Fast data retrieval also works over
network if URI is given as a file name and in this case the index file will be downloaded if it is not present locally.
OPTIONS OF TABIX
-p STR Input format for indexing. Valid values are: gff, bed, sam, vcf and psltab. This option should not be applied together with any
of -s, -b, -e, -c and -0; it is not used for data retrieval because this setting is stored in the index file. [gff]
-s INT Column of sequence name. Option -s, -b, -e, -S, -c and -0 are all stored in the index file and thus not used in data retrieval.
[1]
-b INT Column of start chromosomal position. [4]
-e INT Column of end chromosomal position. The end column can be the same as the start column. [5]
-S INT Skip first INT lines in the data file. [0]
-c CHAR Skip lines started with character CHAR. [#]
-0 Specify that the position in the data file is 0-based (e.g. UCSC files) rather than 1-based.
-h Print the header/meta lines.
-B The second argument is a BED file. When this option is in use, the input file may not be sorted or indexed. The entire input will
be read sequentially. Nonetheless, with this option, the format of the input must be specificed correctly on the command line.
-f Force to overwrite the index file if it is present.
-l List the sequence names stored in the index file.
EXAMPLE
(grep ^"#" in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) | bgzip > sorted.gff.gz;
tabix -p gff sorted.gff.gz;
tabix sorted.gff.gz chr1:10,000,000-20,000,000;
NOTES
It is straightforward to achieve overlap queries using the standard B-tree index (with or without binning) implemented in all SQL data-
bases, or the R-tree index in PostgreSQL and Oracle. But there are still many reasons to use tabix. Firstly, tabix directly works with a
lot of widely used TAB-delimited formats such as GFF/GTF and BED. We do not need to design database schema or specialized binary formats.
Data do not need to be duplicated in different formats, either. Secondly, tabix works on compressed data files while most SQL databases do
not. The GenCode annotation GTF can be compressed down to 4%. Thirdly, tabix is fast. The same indexing algorithm is known to work effi-
ciently for an alignment with a few billion short reads. SQL databases probably cannot easily handle data at this scale. Last but not the
least, tabix supports remote data retrieval. One can put the data file and the index at an FTP or HTTP server, and other users or even web
services will be able to get a slice without downloading the entire file.
AUTHOR
Tabix was written by Heng Li. The BGZF library was originally implemented by Bob Handsaker and modified by Heng Li for remote file access
and in-memory caching.
SEE ALSO
samtools(1)
tabix-0.2.0 11 May 2010 tabix(1)