10-21-2011
still have duplicates
Thanks.
It works but I realized that there are genes that are in the output file more than once because some isoforms happen to have the same length.
So now I would have to take that file and create a new one with only one gene per line.
I used the code below on the original file.
[Code]
nawk 'NR<2{next}{c=($NF-$(NF-1))}(!($1 in A))||(c>m[$1]&&($1 in A)){m[$1]=c;A[$1]=$0 FS m[$1]}END{for(i in A) print A[i]}' tst.txt
Output file:
test.txt
gene accession chr chr_st begin end length
TNFRSF18 NM_004195 chr1 - 1138887 1142089 3202
TNFRSF18 NM_148902 chr1 - 1138887 1142089 3202
TNFRSF18 NM_148901 chr1 - 1138887 1142089 3202
MIB2 NM_080875 chr1 + 1550794 1565990 15196
MIB2 NM_001170688 chr1 + 1550794 1565990 15196
MIB2 NM_001170687 chr1 + 1550794 1565990 15196
MIB2 NM_001170686 chr1 + 1550794 1565990 15196
CDK11A NM_024011 chr1 - 1634169 1655791 21622
CDK11A NM_033529 chr1 - 1634169 1655791 21622
WASH7P NR_024540 chr1 - 14361 29370 15009
FAM138F NR_026820 chr1 - 34610 36081 1471
FAM138A NR_026818 chr1 - 34610 36081 1471
So from this final file, how can I get it to make a file that has only one gene per line?
So I would want output.txt to be modified as:
Desired final file
gene accession chr chr_st begin end length
TNFRSF18 NM_004195 chr1 - 1138887 1142089 3202
MIB2 NM_080875 chr1 + 1550794 1565990 15196
CDK11A NM_024011 chr1 - 1634169 1655791 21622
WASH7P NR_024540 chr1 - 14361 29370 15009
FAM138F NR_026820 chr1 - 34610 36081 1471
FAM138A NR_026818 chr1 - 34610 36081 1471
I hope this is clearer.
Thanks!
Last edited by wolf_blue; 10-21-2011 at 04:27 PM..
10 More Discussions You Might Find Interesting
1. UNIX for Dummies Questions & Answers
i have a input as :
2898 | homy | pune | 7/4/09
1 :6298 | anna | chennai | 7/4/08
2 :3728 | gonna | kol | 8/2/10
3 :3987 | hogja | mumbai | 8/5/09
4 :6187 | galma | london | 9/5/01
5 :9167 | tamina | ny | 8/3/10
6 :3981 | dastan | bagh | 8/2/07
7 :4617 | vazir | ny
now,i want to get... (2 Replies)
Discussion started by: adityamitra
2 Replies
2. Shell Programming and Scripting
Select and display sum depending upon even columns
i have a input as :
2898 | homy | pune | 7/4/09
1 :6298 | anna | chennai | 7/4/08
2 :3728 | gonna | kol | 8/2/10
3 :3987 | hogja | mumbai | 8/5/09
4 :6187 | galma | london | 9/5/01
5 :9167 | tamina | ny | 8/3/10
6 :3981 | dastan | bagh |... (1 Reply)
Discussion started by: adityamitra
1 Replies
3. Shell Programming and Scripting
i have a file of the form
9488 14392 1 1.8586e-07
5702 7729 1 1.8586e-07
9048 14018 1 1.8586e-07
5992 12556 1 1.8586e-07
9488 14393 1 1.8586e-07
9048 14019 1 1.8586e-07
5992 12557 1 1.8586e-07
9488 14394 ... (1 Reply)
Discussion started by: vaibhavkorde
1 Replies
4. Shell Programming and Scripting
I have a huge matrix file which looks like this (example matrix):
1 2 3 5
4 5 6 7
7 6 8 9
1 2 4 2
7 6 5 1
3 2 1 9
As one can see, this matrix has 4 columns and 6 rows. But my original matrix has some 3 million rows and 6000 columns.
For example, on this matrix I can define my task as... (2 Replies)
Discussion started by: shoaibjameel123
2 Replies
5. Shell Programming and Scripting
I have a file with 20 columns. I'd like to retain only the lines for which the values in at least x columns, looking only at columns 6-20, are above a threshold.
For example, I'd like to retain only the lines in the file below that have at least 8 columns (again, looking only at columns 6-20)... (3 Replies)
Discussion started by: pathunkathunk
3 Replies
6. Shell Programming and Scripting
Dear All,
Please help me, I have file input like this,
1 2142
215 2162
217 2842
285 2862
287 4002
401 4022
403 4822
1 2142
215 2162
217 2842
285 2862
287 4002
401 4022
403 4882
1 4801 (8 Replies)
Discussion started by: attila
8 Replies
7. Shell Programming and Scripting
Hi , I want to compare first 3 columns of File A and File B and create a new file File C which will have all rows from File B and will include rows that are present in File A and not in File B based on First 3 column comparison.
Thanks in advance for your help.
File A
A,B,C,45,46... (2 Replies)
Discussion started by: ady_koolz
2 Replies
8. Shell Programming and Scripting
Hi,
I can select all the even columns from a file like this:
awk '{ for (i=1;i<=NF;i+=2) $i="" }1' file > new file
How can I select the 1st and all the even columns using awk? Thanks! (1 Reply)
Discussion started by: forU
1 Replies
9. Shell Programming and Scripting
I want to select 2nd, 3rd columns if line has "key3" and print rest of the lines as is.
# This is my sample input
key1="val1" key2="val2" key3="val3" key4="val4"
some text some text
some text some text
key1="val1" key2="val2" key3="val3" key4="val4"
some text some text
some text some... (3 Replies)
Discussion started by: kchinnam
3 Replies
10. UNIX for Beginners Questions & Answers
I have a dateset like this:
Gly1 Gly2 2 1 0
Gly3 Gly4 3 4 5
Gly3 Gly5 1 3 2
Gly2 Gly1 3 6 2
Gly4 Gly3 2 2 1
Gly6 Gly4 4 2 1what I expected is:
Gly1 Gly2 2 1 0
Gly2 Gly1 3 6 2
Gly3 Gly4 3 4 5
Gly4 Gly3 2 2 1
A vs B, or B vs A are the same... (7 Replies)
Discussion started by: nengcheng
7 Replies
LEARN ABOUT OPENDARWIN
cut
CUT(1) BSD General Commands Manual CUT(1)
NAME
cut -- select portions of each line of a file
SYNOPSIS
cut -b list [-n] [file ...]
cut -c list [file ...]
cut -f list [-d delim] [-s] [file ...]
DESCRIPTION
The cut utility selects portions of each line (as specified by list) from each file and writes them to the standard output. If no file argu-
ments are specified, or a file argument is a single dash ('-'), cut reads from from the standard input. The items specified by list can be
in terms of column position or in terms of fields delimited by a special character. Column numbering starts from 1.
The list option argument is a comma or whitespace separated set of increasing numbers and/or number ranges. Number ranges consist of a num-
ber, a dash ('-'), and a second number and select the fields or columns from the first number to the second, inclusive. Numbers or number
ranges may be preceded by a dash, which selects all fields or columns from 1 to the first number. Numbers or number ranges may be followed
by a dash, which selects all fields or columns from the last number to the end of the line. Numbers and number ranges may be repeated, over-
lapping, and in any order. It is not an error to select fields or columns not present in the input line.
The options are as follows:
-b list
The list specifies byte positions.
-c list
The list specifies character positions.
-d delim
Use the first character of delim as the field delimiter character instead of the tab character.
-f list
The list specifies fields, delimited in the input by a single tab character. Output fields are separated by a single tab character.
-n Do not split multi-byte characters.
-s Suppress lines with no field delimiter characters. Unless specified, lines with no delimiters are passed through unmodified.
ENVIRONMENT
The LANG, LC_ALL and LC_CTYPE environment variables affect the execution of cut if the -n option is specified. Their effect is described in
environ(7).
EXAMPLES
Extract users' login names and shells from the system passwd(5) file as ``name:shell'' pairs:
cut -d : -f 1,7 /etc/passwd
Show the names and login times of the currently logged in users:
who | cut -c 1-16,26-38
DIAGNOSTICS
The cut utility exits 0 on success, and >0 if an error occurs.
SEE ALSO
paste(1)
STANDARDS
The cut utility conforms to IEEE Std 1003.2-1992 (``POSIX.2'').
HISTORY
A cut command appeared in AT&T System III UNIX.
BUGS
The -c option is a synonym for the -b option, which causes incorrect behaviour in locales that support multibyte characters.
When operating on fields (-f option is specified), cut does not recognise multibyte characters, and the delim character is recognised in the
middle of multibyte sequences.
BSD
June 6, 1993 BSD