help! script to select line with greatest value 2 between columns Post: 302566973

Sponsored Content

Top Forums UNIX for Dummies Questions & Answers help! script to select line with greatest value 2 between columns Post 302566973 by wolf_blue on Friday 21st of October 2011 03:12:42 PM

10-21-2011

Registered User

still have duplicates

Thanks.
It works but I realized that there are genes that are in the output file more than once because some isoforms happen to have the same length.
So now I would have to take that file and create a new one with only one gene per line.
I used the code below on the original file.

[Code]
nawk 'NR<2{next}{c=($NF-$(NF-1))}(!($1 in A))||(c>m[$1]&&($1 in A)){m[$1]=c;A[$1]=$0 FS m[$1]}END{for(i in A) print A[i]}' tst.txt

Output file:
test.txt

gene accession chr chr_st begin end length
TNFRSF18 NM_004195 chr1 - 1138887 1142089 3202
TNFRSF18 NM_148902 chr1 - 1138887 1142089 3202
TNFRSF18 NM_148901 chr1 - 1138887 1142089 3202
MIB2 NM_080875 chr1 + 1550794 1565990 15196
MIB2 NM_001170688 chr1 + 1550794 1565990 15196
MIB2 NM_001170687 chr1 + 1550794 1565990 15196
MIB2 NM_001170686 chr1 + 1550794 1565990 15196
CDK11A NM_024011 chr1 - 1634169 1655791 21622
CDK11A NM_033529 chr1 - 1634169 1655791 21622
WASH7P NR_024540 chr1 - 14361 29370 15009
FAM138F NR_026820 chr1 - 34610 36081 1471
FAM138A NR_026818 chr1 - 34610 36081 1471

So from this final file, how can I get it to make a file that has only one gene per line?
So I would want output.txt to be modified as:

Desired final file
gene accession chr chr_st begin end length
TNFRSF18 NM_004195 chr1 - 1138887 1142089 3202
MIB2 NM_080875 chr1 + 1550794 1565990 15196
CDK11A NM_024011 chr1 - 1634169 1655791 21622
WASH7P NR_024540 chr1 - 14361 29370 15009
FAM138F NR_026820 chr1 - 34610 36081 1471
FAM138A NR_026818 chr1 - 34610 36081 1471

I hope this is clearer.
Thanks!

Last edited by wolf_blue; 10-21-2011 at 04:27 PM..

wolf_blue

View Public Profile for wolf_blue

Find all posts by wolf_blue

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Select and display sum depending upon even columns

2. Shell Programming and Scripting

Select and display sum depending upon even columns

Select and display sum depending upon even columns i have a input as : 2898 | homy | pune | 7/4/09 1 :6298 | anna | chennai | 7/4/08 2 :3728 | gonna | kol | 8/2/10 3 :3987 | hogja | mumbai | 8/5/09 4 :6187 | galma | london | 9/5/01 5 :9167 | tamina | ny | 8/3/10 6 :3981 | dastan | bagh |...

3. Shell Programming and Scripting

[Solved] Select the columns which have value greater than particular number

i have a file of the form 9488 14392 1 1.8586e-07 5702 7729 1 1.8586e-07 9048 14018 1 1.8586e-07 5992 12556 1 1.8586e-07 9488 14393 1 1.8586e-07 9048 14019 1 1.8586e-07 5992 12557 1 1.8586e-07 9488 14394 ...

4. Shell Programming and Scripting

Select columns from a matrix given within a range in BASH

I have a huge matrix file which looks like this (example matrix): 1 2 3 5 4 5 6 7 7 6 8 9 1 2 4 2 7 6 5 1 3 2 1 9 As one can see, this matrix has 4 columns and 6 rows. But my original matrix has some 3 million rows and 6000 columns. For example, on this matrix I can define my task as...

5. Shell Programming and Scripting

Select lines where at least x columns above threshold value

I have a file with 20 columns. I'd like to retain only the lines for which the values in at least x columns, looking only at columns 6-20, are above a threshold. For example, I'd like to retain only the lines in the file below that have at least 8 columns (again, looking only at columns 6-20)...

6. Shell Programming and Scripting

Take greatest value from second column

Dear All, Please help me, I have file input like this, 1 2142 215 2162 217 2842 285 2862 287 4002 401 4022 403 4822 1 2142 215 2162 217 2842 285 2862 287 4002 401 4022 403 4882 1 4801

7. Shell Programming and Scripting

Comparing Select Columns from two CSV files in UNIX and create a third file based on comparision

Hi , I want to compare first 3 columns of File A and File B and create a new file File C which will have all rows from File B and will include rows that are present in File A and not in File B based on First 3 column comparison. Thanks in advance for your help. File A A,B,C,45,46...

8. Shell Programming and Scripting

Select all the even columns from a file

Hi, I can select all the even columns from a file like this: awk '{ for (i=1;i<=NF;i+=2) $i="" }1' file > new file How can I select the 1st and all the even columns using awk? Thanks!

9. Shell Programming and Scripting

How do I select certain columns with matching pattern and rest of the lines?

I want to select 2nd, 3rd columns if line has "key3" and print rest of the lines as is. # This is my sample input key1="val1" key2="val2" key3="val3" key4="val4" some text some text some text some text key1="val1" key2="val2" key3="val3" key4="val4" some text some text some text some...

10. UNIX for Beginners Questions & Answers

How to select rows that have opposite values (A vs B, or B vs A) on first two columns?

I have a dateset like this: Gly1 Gly2 2 1 0 Gly3 Gly4 3 4 5 Gly3 Gly5 1 3 2 Gly2 Gly1 3 6 2 Gly4 Gly3 2 2 1 Gly6 Gly4 4 2 1what I expected is: Gly1 Gly2 2 1 0 Gly2 Gly1 3 6 2 Gly3 Gly4 3 4 5 Gly4 Gly3 2 2 1 A vs B, or B vs A are the same...

LEARN ABOUT DEBIAN

tabix

tabix(1)						       Bioinformatics tools							  tabix(1)

NAME

       bgzip - Block compression/decompression utility

       tabix - Generic indexer for TAB-delimited genome position files

SYNOPSIS

       bgzip [-cdhB] [-b virtualOffset] [-s size] [file]

       tabix [-0lf] [-p gff|bed|sam|vcf] [-s seqCol] [-b begCol] [-e endCol] [-S lineSkip] [-c metaChar] in.tab.bgz [region1 [region2 [...]]]

DESCRIPTION

       Tabix  indexes a TAB-delimited genome position file in.tab.bgz and creates an index file in.tab.bgz.tbi when region is absent from the com-
       mand-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface. After indexing, tabix is
       able  to quickly retrieve data lines overlapping regions specified in the format "chr:beginPos-endPos". Fast data retrieval also works over
       network if URI is given as a file name and in this case the index file will be downloaded if it is not present locally.

OPTIONS OF TABIX

       -p STR	 Input format for indexing. Valid values are: gff, bed, sam, vcf and psltab. This option should not be applied together  with  any
		 of -s, -b, -e, -c and -0; it is not used for data retrieval because this setting is stored in the index file. [gff]

       -s INT	 Column  of  sequence name. Option -s, -b, -e, -S, -c and -0 are all stored in the index file and thus not used in data retrieval.
		 [1]

       -b INT	 Column of start chromosomal position. [4]

       -e INT	 Column of end chromosomal position. The end column can be the same as the start column. [5]

       -S INT	 Skip first INT lines in the data file. [0]

       -c CHAR	 Skip lines started with character CHAR. [#]

       -0	 Specify that the position in the data file is 0-based (e.g. UCSC files) rather than 1-based.

       -h	 Print the header/meta lines.

       -B	 The second argument is a BED file. When this option is in use, the input file may not be sorted or indexed. The entire input will
		 be read sequentially. Nonetheless, with this option, the format of the input must be specificed correctly on the command line.

       -f	 Force to overwrite the index file if it is present.

       -l	 List the sequence names stored in the index file.

EXAMPLE

       (grep ^"#" in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) | bgzip > sorted.gff.gz;

       tabix -p gff sorted.gff.gz;

       tabix sorted.gff.gz chr1:10,000,000-20,000,000;

NOTES

       It  is  straightforward	to  achieve overlap queries using the standard B-tree index (with or without binning) implemented in all SQL data-
       bases, or the R-tree index in PostgreSQL and Oracle. But there are still many reasons to use tabix. Firstly, tabix directly  works  with  a
       lot  of	widely used TAB-delimited formats such as GFF/GTF and BED. We do not need to design database schema or specialized binary formats.
       Data do not need to be duplicated in different formats, either. Secondly, tabix works on compressed data files while most SQL databases	do
       not.  The  GenCode annotation GTF can be compressed down to 4%.	Thirdly, tabix is fast. The same indexing algorithm is known to work effi-
       ciently for an alignment with a few billion short reads. SQL databases probably cannot easily handle data at this scale. Last but  not  the
       least,  tabix supports remote data retrieval. One can put the data file and the index at an FTP or HTTP server, and other users or even web
       services will be able to get a slice without downloading the entire file.

AUTHOR

       Tabix was written by Heng Li. The BGZF library was originally implemented by Bob Handsaker and modified by Heng Li for remote  file  access
       and in-memory caching.

SEE ALSO

       samtools(1)

tabix-0.2.0							    11 May 2010 							  tabix(1)

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Select and display sum depending upon even columns

Discussion started by: adityamitra

2. Shell Programming and Scripting

Select and display sum depending upon even columns

Discussion started by: adityamitra

3. Shell Programming and Scripting

[Solved] Select the columns which have value greater than particular number

Discussion started by: vaibhavkorde

4. Shell Programming and Scripting

Select columns from a matrix given within a range in BASH

Discussion started by: shoaibjameel123

5. Shell Programming and Scripting

Select lines where at least x columns above threshold value

Discussion started by: pathunkathunk

6. Shell Programming and Scripting

Take greatest value from second column

Discussion started by: attila

7. Shell Programming and Scripting

Comparing Select Columns from two CSV files in UNIX and create a third file based on comparision

Discussion started by: ady_koolz

8. Shell Programming and Scripting

Select all the even columns from a file

Discussion started by: forU

9. Shell Programming and Scripting

How do I select certain columns with matching pattern and rest of the lines?

Discussion started by: kchinnam

10. UNIX for Beginners Questions & Answers

How to select rows that have opposite values (A vs B, or B vs A) on first two columns?

Discussion started by: nengcheng

LEARN ABOUT DEBIAN

tabix