07-06-2011
Collapsing and counting by key column in a sorted file
Hi
I have a tab separated file with reads mappings of more than 2 million reads> the file is sorted by ID and looks like the following:
SeqID | Seq | FreqSeq | PosSeq |
HWI-EA332_0036:5:100:10131:16361#ATGC/1 | GACTTGAGGTCTCCCCCGCA | 1 | TZRTMR_40497:317:+ |
HWI-EA332_0036:5:100:10131:16361#ATGC/1 | GACTTGAGGTCTCCCCCGCA | 1 | YSXAZZ_40497:317:+ |
HWI-EA332_0036:5:100:10131:16361#ATGC/1 | GACTTGAGGTCTCCCCCGCA | 1 | AZZOL148119:523:+ |
HWI-EA332_0036:5:100:10131:16361#ATGC/1 | GACTTGAGGTCTCCCCCGCA | 1 | VCXT148119:523:+ |
HWI-EA332_0036:5:100:10554:9799#ATGC/1 | GACTCCTAAATTAACAACAA | 1 | YSXAZZ_35135:573:+ |
HWI-EA332_0036:5:100:10554:9799#ATGC/1 | GACTCCTAAATTAACAACAA | 1 | TZRTMR_35135:573:+ |
HWI-EA332_0036:5:100:10791:13901#ATGC/1 | GACTAGTGAGTGACCCGCTC | 1 | TZRTMR_7034:497:+ |
HWI-EA332_0036:5:100:10791:13901#ATGC/1 | GACTAGTGAGTGACCCGCTC | 1 | YSXAZZ_7034:497:+ |
HWI-EA332_0036:5:100:11825:11517#ATGC/1 | GACTAATATAAATAAGTCTC | 1 | YSXAZZ_3676:148:+ |
HWI-EA332_0036:5:100:11825:11517#ATGC/1 | GACTAATATAAATAAGTCTC | 1 | TZRTMR_3676:148:+ |
HWI-EA332_0036:5:100:11825:11517#ATGC/1 | GACTAATATAAATAAGTCTC | 1 | TZRTMR_2085:139:+ |
HWI-EA332_0036:5:100:11825:11517#ATGC/1 | GACTAATATAAATAAGTCTC | 1 | YSXAZZ_2085:139:+ |
HWI-EA332_0036:5:100:13509:3643#ATGC/1 | GACTACCCGCCAAACCCCAG | 2 | YTTSTZ_566255:526:- |
HWI-EA332_0036:5:100:13509:3643#ATGC/1 | GACTACCCGCCAAACCCCAG | 2 | YYTWQ_566255:526:- |
HWI-EA332_0036:5:100:13837:1118#ATGC/1 | GACTCCCGCATCCCGCAAAC | 1 | PZXVXZ_21909:999:+ |
HWI-EA332_0036:5:100:13837:1118#ATGC/1 | GACTCCCGCATCCCGCAAAC | 1 | TZRTMR_21909:999:+ |
And I would like to collapse the columns [3,4] having the same ID to get a table like this :
SeqID | Seq | FreqSeq | PosSeq | N_UpT |
HWI-EA332_0036:5:100:10131:16361#ATGC/1 | GACTTGAGGTCTCCCCCGCA | 1,1,1,1 | TZRTMR_40497:317:+,YSXAZZ_40497:317:+,AZZOL148119:523:+,VCXT148119:523:+ | 4 |
HWI-EA332_0036:5:100:10554:9799#ATGC/1 | GACTCCTAAATTAACAACAA | 1,1 | YSXAZZ_35135:573:+,TZRTMR_35135:573:+ | 2 |
HWI-EA332_0036:5:100:10791:13901#ATGC/1 | GACTAGTGAGTGACCCGCTC | 1,1 | TZRTMR_7034:497:+,YSXAZZ_7034:497:+ | 2 |
HWI-EA332_0036:5:100:11825:11517#ATGC/1 | GACTAATATAAATAAGTCTC | 1,1,1,1 | YSXAZZ_3676:148:+,TZRTMR_3676:148:+,TZRTMR_2085:139:+,YSXAZZ_2085:139:+ | 4 |
HWI-EA332_0036:5:100:13509:3643#ATGC/1 | GACTACCCGCCAAACCCCAG | 2,2 | YTTSTZ_566255:526:-,YYTWQ_566255:526:- | 2 |
HWI-EA332_0036:5:100:13837:1118#ATGC/1 | GACTCCCGCATCCCGCAAAC | 1,1 | PZXVXZ_21909:999:+,TZRTMR_21909:999:+ | 2 |
The column 2 is unique as well and the N_UpT is the count of number of lines matched by the id.
Any help idea about the best way of doing this (in AWK ? Perl ?) would be much appreciated.
Thanks in advance for the help and suggestions.
Best,
Ramzi
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
I'm new of UNIX shell scripting. I'm recently generating a excel report in UNIX(file with delimiter is fine). How should I make a script to do it?
1 file to join comes from output of one UNIX command, the second from another UNIX command, and third from a database query. The key columes of all... (7 Replies)
Discussion started by: bigsmile
7 Replies
2. Shell Programming and Scripting
I have multiple files; each file contains a certain data in a column view
simply i want to combine all those files into one file in columns
example
file1:
a
b
c
d
file 2:
1
2
3
4
file 3:
G (4 Replies)
Discussion started by: ahmedamro
4 Replies
3. Shell Programming and Scripting
Given a file such as this I need to remove the duplicates.
00060011 PAUL BOWSTEIN ad_waq3_921_20100826_010517.txt
00060011 PAUL BOWSTEIN ad_waq3_921_20100827_010528.txt
0624-01 RUT CORPORATION ad_sade3_10_20100827_010528.txt
0624-01 RUT CORPORATION ... (13 Replies)
Discussion started by: script_op2a
13 Replies
4. Shell Programming and Scripting
Hi guys!
I have a problem writing script that would convert this input
into this output:
I have an awk script that counts occurences of a sign in a column, but don't know how to change it so that I would give me desired output.
script
awk '{count++}END{for(j in count)... (2 Replies)
Discussion started by: grincz
2 Replies
5. Shell Programming and Scripting
Hello,
I am using Awk in UBUNTU 12.04.
I have a file as follows with 2172 rows and 44707 columns. ABO and GPO are the names of my populations.
ABO_1 1 2
ABO_1 1 2
ABO_2 1 1
ABO_2 1 2
GPO_1 1 1
GPO_1 2 2
GPO_2 1 0
GPO_2 2 0I want to count the number of 1s and 2s in... (7 Replies)
Discussion started by: Homa
7 Replies
6. Shell Programming and Scripting
Hi i have a file delimited with ","as below and i need to handle scenario like col1,col2 fields have special character '|', i need count of special character value column wise as given in col3 and col4. pls help me to reslove this.
Source file
name,col1,col2,col3,col4
one,2,3
two,2|3,2|3... (2 Replies)
Discussion started by: Ganesh L
2 Replies
7. Shell Programming and Scripting
Hello,
I have a file as follows:
ENSGALG00000000189
ENSGALG00000000189
ENSGALG00000000189
ENSGALG00000000215
ENSGALG00000000215
ENSGALG00000000218 ... (5 Replies)
Discussion started by: Homa
5 Replies
8. Shell Programming and Scripting
Hi,
I have a input file which contains following data
0 1
0 2
0 3
0 4
0 8
0 9
0 11
1 1
1 2
1 6
1 7
1 8
1 9
2 1
2 11
2 12 (12 Replies)
Discussion started by: Ryan Kim
12 Replies
9. Shell Programming and Scripting
Hi All ,
I got stuck on the below scenario.If anyone can help me ,that will be really helpful.
I have a target hdfs file layout.I need to know the no of column in that file.
Target_RECRD_layout
{
ABC_ID EN NOTNULLABLE,
ABC_COUNTRY CHARACTER ENCODING ASCII NOTNULLABLE,
... (5 Replies)
Discussion started by: STCET22
5 Replies
10. UNIX for Dummies Questions & Answers
I have a file that looks like this:
BC00001 GA 2 2 3 3 2 5 1 5 3 3 2 4 ... (6 Replies)
Discussion started by: Xterra
6 Replies