Collapsing and counting by key column in a sorted file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Collapsing and counting by key column in a sorted file
# 1  
Old 07-06-2011
Collapsing and counting by key column in a sorted file

Hi
I have a tab separated file with reads mappings of more than 2 million reads> the file is sorted by ID and looks like the following:

SeqIDSeqFreqSeqPosSeq
HWI-EA332_0036:5:100:10131:16361#ATGC/1GACTTGAGGTCTCCCCCGCA1TZRTMR_40497:317:+
HWI-EA332_0036:5:100:10131:16361#ATGC/1GACTTGAGGTCTCCCCCGCA1YSXAZZ_40497:317:+
HWI-EA332_0036:5:100:10131:16361#ATGC/1GACTTGAGGTCTCCCCCGCA1AZZOL148119:523:+
HWI-EA332_0036:5:100:10131:16361#ATGC/1GACTTGAGGTCTCCCCCGCA1VCXT148119:523:+
HWI-EA332_0036:5:100:10554:9799#ATGC/1GACTCCTAAATTAACAACAA1YSXAZZ_35135:573:+
HWI-EA332_0036:5:100:10554:9799#ATGC/1GACTCCTAAATTAACAACAA1TZRTMR_35135:573:+
HWI-EA332_0036:5:100:10791:13901#ATGC/1GACTAGTGAGTGACCCGCTC1TZRTMR_7034:497:+
HWI-EA332_0036:5:100:10791:13901#ATGC/1GACTAGTGAGTGACCCGCTC1YSXAZZ_7034:497:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1GACTAATATAAATAAGTCTC1YSXAZZ_3676:148:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1GACTAATATAAATAAGTCTC1TZRTMR_3676:148:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1GACTAATATAAATAAGTCTC1TZRTMR_2085:139:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1GACTAATATAAATAAGTCTC1YSXAZZ_2085:139:+
HWI-EA332_0036:5:100:13509:3643#ATGC/1GACTACCCGCCAAACCCCAG2YTTSTZ_566255:526:-
HWI-EA332_0036:5:100:13509:3643#ATGC/1GACTACCCGCCAAACCCCAG2YYTWQ_566255:526:-
HWI-EA332_0036:5:100:13837:1118#ATGC/1GACTCCCGCATCCCGCAAAC1PZXVXZ_21909:999:+
HWI-EA332_0036:5:100:13837:1118#ATGC/1GACTCCCGCATCCCGCAAAC1TZRTMR_21909:999:+

And I would like to collapse the columns [3,4] having the same ID to get a table like this :
SeqIDSeqFreqSeqPosSeqN_UpT
HWI-EA332_0036:5:100:10131:16361#ATGC/1GACTTGAGGTCTCCCCCGCA1,1,1,1TZRTMR_40497:317:+,YSXAZZ_40497:317:+,AZZOL148119:523:+,VCXT148119:523:+4
HWI-EA332_0036:5:100:10554:9799#ATGC/1GACTCCTAAATTAACAACAA1,1YSXAZZ_35135:573:+,TZRTMR_35135:573:+2
HWI-EA332_0036:5:100:10791:13901#ATGC/1GACTAGTGAGTGACCCGCTC1,1TZRTMR_7034:497:+,YSXAZZ_7034:497:+2
HWI-EA332_0036:5:100:11825:11517#ATGC/1GACTAATATAAATAAGTCTC1,1,1,1YSXAZZ_3676:148:+,TZRTMR_3676:148:+,TZRTMR_2085:139:+,YSXAZZ_2085:139:+4
HWI-EA332_0036:5:100:13509:3643#ATGC/1GACTACCCGCCAAACCCCAG2,2YTTSTZ_566255:526:-,YYTWQ_566255:526:-2
HWI-EA332_0036:5:100:13837:1118#ATGC/1GACTCCCGCATCCCGCAAAC1,1PZXVXZ_21909:999:+,TZRTMR_21909:999:+2

The column 2 is unique as well and the N_UpT is the count of number of lines matched by the id.


Any help idea about the best way of doing this (in AWK ? Perl ?) would be much appreciated.
Thanks in advance for the help and suggestions.
Best,
Ramzi
# 2  
Old 07-06-2011
Well, thanks for the detailed explanation and the pretty tables!
Here, try this awk solution:
Code:
awk '{
  if($1 in ps){
    ps[$1]=ps[$1]","$4; 
    fs[$1]=fs[$1]","$3
  } else {
     i[cnt++]=$1; 
     f2[$1]=$2; 
     ps[$1]=$4; 
     fs[$1]=$3
  } 
  mult[$1]++
}
END{
   n=asort(i);
   for(j=1; j<=n; j++) 
      print i[j] " " f2[i[j]] " " fs[i[j]] " " ps[i[j]]" "mult[i[j]];
}' mappings.txt 
HWI-EA332_0036:5:100:10131:16361#ATGC/1 GACTTGAGGTCTCCCCCGCA 1,1,1,1 TZRTMR_40497:317:+,YSXAZZ_40497:317:+,AZZOL148119:523:+,VCXT148119:523:+ 4
HWI-EA332_0036:5:100:10554:9799#ATGC/1 GACTCCTAAATTAACAACAA 1,1 YSXAZZ_35135:573:+,TZRTMR_35135:573:+ 2
HWI-EA332_0036:5:100:10791:13901#ATGC/1 GACTAGTGAGTGACCCGCTC 1,1 TZRTMR_7034:497:+,YSXAZZ_7034:497:+ 2
HWI-EA332_0036:5:100:11825:11517#ATGC/1 GACTAATATAAATAAGTCTC 1,1,1,1 YSXAZZ_3676:148:+,TZRTMR_3676:148:+,TZRTMR_2085:139:+,YSXAZZ_2085:139:+ 4
HWI-EA332_0036:5:100:13509:3643#ATGC/1 GACTACCCGCCAAACCCCAG 2,2 YTTSTZ_566255:526:-,YYTWQ_566255:526:- 2
HWI-EA332_0036:5:100:13837:1118#ATGC/1 GACTCCCGCATCCCGCAAAC 1,1 PZXVXZ_21909:999:+,TZRTMR_21909:999:+ 2

Use nawk on Solaris

Last edited by mirni; 07-06-2011 at 04:59 PM.. Reason: nawk comment
# 3  
Old 07-07-2011
Dear Mirni,
Thanks for your quick answer and the nice solution
It work pretty well and very fast, I just had to add a line with column header and that's all.
I'm wondering if it's possible to tune the code to get rid of the for loop as the data is already sorted by seqID ?
once again thanks for your help.
Best,
Ramzi
# 4  
Old 07-07-2011
The deal with associative arrays in awk is that they come out in pretty much unpredictable order, not in the order they were being added to the structure. So, if you care to have the order of entries kept the same, you have to use something like you see there -- an auxilliary array (i) that stores the keys ($1):
Code:
i[cnt++]=$1;

Then, you can retrieve the original order, by sorting this aux array, and thus you'll get the original order of keys.
Hope this makes sense, if you want more, just look up associative arrays in awk.

Perhaps there would be a different way to approach the problem, without the use of as. arrays, and use the fact that they are sorted already....


Like this:
Code:
awk '
  $1==last{
    third=third","$3; 
    fourth=fourth","$4; 
    cnt++
  }
  $1!=last{
    if(last)
       print last" "second" "third" "fourth" "cnt; 
    last=$1; 
    second=$2; 
    third=$3; 
    fourth=$4;
    cnt=1;
}' mappings.txt
HWI-EA332_0036:5:100:10131:16361#ATGC/1 GACTTGAGGTCTCCCCCGCA 1,1,1,1 TZRTMR_40497:317:+,YSXAZZ_40497:317:+,AZZOL148119:523:+,VCXT148119:523:+ 4
HWI-EA332_0036:5:100:10554:9799#ATGC/1 GACTCCTAAATTAACAACAA 1,1 YSXAZZ_35135:573:+,TZRTMR_35135:573:+ 2
HWI-EA332_0036:5:100:10791:13901#ATGC/1 GACTAGTGAGTGACCCGCTC 1,1 TZRTMR_7034:497:+,YSXAZZ_7034:497:+ 2
HWI-EA332_0036:5:100:11825:11517#ATGC/1 GACTAATATAAATAAGTCTC 1,1,1,1 YSXAZZ_3676:148:+,TZRTMR_3676:148:+,TZRTMR_2085:139:+,YSXAZZ_2085:139:+ 4
HWI-EA332_0036:5:100:13509:3643#ATGC/1 GACTACCCGCCAAACCCCAG 2,2 YTTSTZ_566255:526:-,YYTWQ_566255:526:- 2


Last edited by mirni; 07-07-2011 at 09:24 AM..
# 5  
Old 07-07-2011
Thanks for the explanations, I'll try to find out some nice documentation to better understand associative arrays, my knowledge are not that broad in awk but as i'm happy with the results a should invest more time on it. Best, Ramzi
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Collapsing similar strings

I have a file that looks like this: BC00001 GA 2 2 3 3 2 5 1 5 3 3 2 4 ... (6 Replies)
Discussion started by: Xterra
6 Replies

2. Shell Programming and Scripting

Need help of counting no of column of a file

Hi All , I got stuck on the below scenario.If anyone can help me ,that will be really helpful. I have a target hdfs file layout.I need to know the no of column in that file. Target_RECRD_layout { ABC_ID EN NOTNULLABLE, ABC_COUNTRY CHARACTER ENCODING ASCII NOTNULLABLE, ... (5 Replies)
Discussion started by: STCET22
5 Replies

3. Shell Programming and Scripting

Counting a consecutive number in column 2

Hi, I have a input file which contains following data 0 1 0 2 0 3 0 4 0 8 0 9 0 11 1 1 1 2 1 6 1 7 1 8 1 9 2 1 2 11 2 12 (12 Replies)
Discussion started by: Ryan Kim
12 Replies

4. Shell Programming and Scripting

Counting the number of element in each column

Hello, I have a file as follows: ENSGALG00000000189 ENSGALG00000000189 ENSGALG00000000189 ENSGALG00000000215 ENSGALG00000000215 ENSGALG00000000218 ... (5 Replies)
Discussion started by: Homa
5 Replies

5. Shell Programming and Scripting

Counting no of spl character occurance column wise

Hi i have a file delimited with ","as below and i need to handle scenario like col1,col2 fields have special character '|', i need count of special character value column wise as given in col3 and col4. pls help me to reslove this. Source file name,col1,col2,col3,col4 one,2,3 two,2|3,2|3... (2 Replies)
Discussion started by: Ganesh L
2 Replies

6. Shell Programming and Scripting

Help with awk in counting characters based on a column

Hello, I am using Awk in UBUNTU 12.04. I have a file as follows with 2172 rows and 44707 columns. ABO and GPO are the names of my populations. ABO_1 1 2 ABO_1 1 2 ABO_2 1 1 ABO_2 1 2 GPO_1 1 1 GPO_1 2 2 GPO_2 1 0 GPO_2 2 0I want to count the number of 1s and 2s in... (7 Replies)
Discussion started by: Homa
7 Replies

7. Shell Programming and Scripting

Counting occurences in column

Hi guys! I have a problem writing script that would convert this input into this output: I have an awk script that counts occurences of a sign in a column, but don't know how to change it so that I would give me desired output. script awk '{count++}END{for(j in count)... (2 Replies)
Discussion started by: grincz
2 Replies

8. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Given a file such as this I need to remove the duplicates. 00060011 PAUL BOWSTEIN ad_waq3_921_20100826_010517.txt 00060011 PAUL BOWSTEIN ad_waq3_921_20100827_010528.txt 0624-01 RUT CORPORATION ad_sade3_10_20100827_010528.txt 0624-01 RUT CORPORATION ... (13 Replies)
Discussion started by: script_op2a
13 Replies

9. Shell Programming and Scripting

combine multiple files by column into one files already sorted!

I have multiple files; each file contains a certain data in a column view simply i want to combine all those files into one file in columns example file1: a b c d file 2: 1 2 3 4 file 3: G (4 Replies)
Discussion started by: ahmedamro
4 Replies

10. Shell Programming and Scripting

Join 3 files using key column in a mapping file

I'm new of UNIX shell scripting. I'm recently generating a excel report in UNIX(file with delimiter is fine). How should I make a script to do it? 1 file to join comes from output of one UNIX command, the second from another UNIX command, and third from a database query. The key columes of all... (7 Replies)
Discussion started by: bigsmile
7 Replies
Login or Register to Ask a Question