Collapsing and counting by key column in a sorted file

07-06-2011

Registered User

3, 0

Join Date: Jul 2011

Last Activity: 7 July 2011, 8:29 AM EDT

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Collapsing and counting by key column in a sorted file

Hi
I have a tab separated file with reads mappings of more than 2 million reads> the file is sorted by ID and looks like the following:

SeqID	Seq	FreqSeq	PosSeq
HWI-EA332_0036:5:100:10131:16361#ATGC/1	GACTTGAGGTCTCCCCCGCA	1	TZRTMR_40497:317:+
HWI-EA332_0036:5:100:10131:16361#ATGC/1	GACTTGAGGTCTCCCCCGCA	1	YSXAZZ_40497:317:+
HWI-EA332_0036:5:100:10131:16361#ATGC/1	GACTTGAGGTCTCCCCCGCA	1	AZZOL148119:523:+
HWI-EA332_0036:5:100:10131:16361#ATGC/1	GACTTGAGGTCTCCCCCGCA	1	VCXT148119:523:+
HWI-EA332_0036:5:100:10554:9799#ATGC/1	GACTCCTAAATTAACAACAA	1	YSXAZZ_35135:573:+
HWI-EA332_0036:5:100:10554:9799#ATGC/1	GACTCCTAAATTAACAACAA	1	TZRTMR_35135:573:+
HWI-EA332_0036:5:100:10791:13901#ATGC/1	GACTAGTGAGTGACCCGCTC	1	TZRTMR_7034:497:+
HWI-EA332_0036:5:100:10791:13901#ATGC/1	GACTAGTGAGTGACCCGCTC	1	YSXAZZ_7034:497:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1	GACTAATATAAATAAGTCTC	1	YSXAZZ_3676:148:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1	GACTAATATAAATAAGTCTC	1	TZRTMR_3676:148:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1	GACTAATATAAATAAGTCTC	1	TZRTMR_2085:139:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1	GACTAATATAAATAAGTCTC	1	YSXAZZ_2085:139:+
HWI-EA332_0036:5:100:13509:3643#ATGC/1	GACTACCCGCCAAACCCCAG	2	YTTSTZ_566255:526:-
HWI-EA332_0036:5:100:13509:3643#ATGC/1	GACTACCCGCCAAACCCCAG	2	YYTWQ_566255:526:-
HWI-EA332_0036:5:100:13837:1118#ATGC/1	GACTCCCGCATCCCGCAAAC	1	PZXVXZ_21909:999:+
HWI-EA332_0036:5:100:13837:1118#ATGC/1	GACTCCCGCATCCCGCAAAC	1	TZRTMR_21909:999:+

And I would like to collapse the columns [3,4] having the same ID to get a table like this :

SeqID	Seq	FreqSeq	PosSeq	N_UpT
HWI-EA332_0036:5:100:10131:16361#ATGC/1	GACTTGAGGTCTCCCCCGCA	1,1,1,1	TZRTMR_40497:317:+,YSXAZZ_40497:317:+,AZZOL148119:523:+,VCXT148119:523:+	4
HWI-EA332_0036:5:100:10554:9799#ATGC/1	GACTCCTAAATTAACAACAA	1,1	YSXAZZ_35135:573:+,TZRTMR_35135:573:+	2
HWI-EA332_0036:5:100:10791:13901#ATGC/1	GACTAGTGAGTGACCCGCTC	1,1	TZRTMR_7034:497:+,YSXAZZ_7034:497:+	2
HWI-EA332_0036:5:100:11825:11517#ATGC/1	GACTAATATAAATAAGTCTC	1,1,1,1	YSXAZZ_3676:148:+,TZRTMR_3676:148:+,TZRTMR_2085:139:+,YSXAZZ_2085:139:+	4
HWI-EA332_0036:5:100:13509:3643#ATGC/1	GACTACCCGCCAAACCCCAG	2,2	YTTSTZ_566255:526:-,YYTWQ_566255:526:-	2
HWI-EA332_0036:5:100:13837:1118#ATGC/1	GACTCCCGCATCCCGCAAAC	1,1	PZXVXZ_21909:999:+,TZRTMR_21909:999:+	2

The column 2 is unique as well and the N_UpT is the count of number of lines matched by the id.

Any help idea about the best way of doing this (in AWK ? Perl ?) would be much appreciated.
Thanks in advance for the help and suggestions.
Best,
Ramzi

ramouz87

View Public Profile for ramouz87

Find all posts by ramouz87

07-06-2011

Registered User

686, 179

Join Date: Mar 2011

Last Activity: 17 March 2020, 9:58 PM EDT

Posts: 686

Thanks Given: 51

Thanked 179 Times in 171 Posts

Well, thanks for the detailed explanation and the pretty tables!
Here, try this awk solution:

Code:

awk '{
  if($1 in ps){
    ps[$1]=ps[$1]","$4; 
    fs[$1]=fs[$1]","$3
  } else {
     i[cnt++]=$1; 
     f2[$1]=$2; 
     ps[$1]=$4; 
     fs[$1]=$3
  } 
  mult[$1]++
}
END{
   n=asort(i);
   for(j=1; j<=n; j++) 
      print i[j] " " f2[i[j]] " " fs[i[j]] " " ps[i[j]]" "mult[i[j]];
}' mappings.txt 
HWI-EA332_0036:5:100:10131:16361#ATGC/1 GACTTGAGGTCTCCCCCGCA 1,1,1,1 TZRTMR_40497:317:+,YSXAZZ_40497:317:+,AZZOL148119:523:+,VCXT148119:523:+ 4
HWI-EA332_0036:5:100:10554:9799#ATGC/1 GACTCCTAAATTAACAACAA 1,1 YSXAZZ_35135:573:+,TZRTMR_35135:573:+ 2
HWI-EA332_0036:5:100:10791:13901#ATGC/1 GACTAGTGAGTGACCCGCTC 1,1 TZRTMR_7034:497:+,YSXAZZ_7034:497:+ 2
HWI-EA332_0036:5:100:11825:11517#ATGC/1 GACTAATATAAATAAGTCTC 1,1,1,1 YSXAZZ_3676:148:+,TZRTMR_3676:148:+,TZRTMR_2085:139:+,YSXAZZ_2085:139:+ 4
HWI-EA332_0036:5:100:13509:3643#ATGC/1 GACTACCCGCCAAACCCCAG 2,2 YTTSTZ_566255:526:-,YYTWQ_566255:526:- 2
HWI-EA332_0036:5:100:13837:1118#ATGC/1 GACTCCCGCATCCCGCAAAC 1,1 PZXVXZ_21909:999:+,TZRTMR_21909:999:+ 2

Use nawk on Solaris

Last edited by mirni; 07-06-2011 at 04:59 PM.. Reason: nawk comment

mirni

View Public Profile for mirni

Find all posts by mirni

07-07-2011

Registered User

3, 0

Join Date: Jul 2011

Last Activity: 7 July 2011, 8:29 AM EDT

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Dear Mirni,
Thanks for your quick answer and the nice solution
It work pretty well and very fast, I just had to add a line with column header and that's all.
I'm wondering if it's possible to tune the code to get rid of the for loop as the data is already sorted by seqID ?
once again thanks for your help.
Best,
Ramzi

ramouz87

View Public Profile for ramouz87

Find all posts by ramouz87

07-07-2011

Registered User

686, 179

Join Date: Mar 2011

Last Activity: 17 March 2020, 9:58 PM EDT

Posts: 686

Thanks Given: 51

Thanked 179 Times in 171 Posts

The deal with associative arrays in awk is that they come out in pretty much unpredictable order, not in the order they were being added to the structure. So, if you care to have the order of entries kept the same, you have to use something like you see there -- an auxilliary array (i) that stores the keys ($1):

Code:

i[cnt++]=$1;

Then, you can retrieve the original order, by sorting this aux array, and thus you'll get the original order of keys.
Hope this makes sense, if you want more, just look up associative arrays in awk.

Perhaps there would be a different way to approach the problem, without the use of as. arrays, and use the fact that they are sorted already....

Like this:

Code:

awk '
  $1==last{
    third=third","$3; 
    fourth=fourth","$4; 
    cnt++
  }
  $1!=last{
    if(last)
       print last" "second" "third" "fourth" "cnt; 
    last=$1; 
    second=$2; 
    third=$3; 
    fourth=$4;
    cnt=1;
}' mappings.txt
HWI-EA332_0036:5:100:10131:16361#ATGC/1 GACTTGAGGTCTCCCCCGCA 1,1,1,1 TZRTMR_40497:317:+,YSXAZZ_40497:317:+,AZZOL148119:523:+,VCXT148119:523:+ 4
HWI-EA332_0036:5:100:10554:9799#ATGC/1 GACTCCTAAATTAACAACAA 1,1 YSXAZZ_35135:573:+,TZRTMR_35135:573:+ 2
HWI-EA332_0036:5:100:10791:13901#ATGC/1 GACTAGTGAGTGACCCGCTC 1,1 TZRTMR_7034:497:+,YSXAZZ_7034:497:+ 2
HWI-EA332_0036:5:100:11825:11517#ATGC/1 GACTAATATAAATAAGTCTC 1,1,1,1 YSXAZZ_3676:148:+,TZRTMR_3676:148:+,TZRTMR_2085:139:+,YSXAZZ_2085:139:+ 4
HWI-EA332_0036:5:100:13509:3643#ATGC/1 GACTACCCGCCAAACCCCAG 2,2 YTTSTZ_566255:526:-,YYTWQ_566255:526:- 2

Last edited by mirni; 07-07-2011 at 09:24 AM..

mirni

View Public Profile for mirni

Find all posts by mirni

07-07-2011

Registered User

3, 0

Join Date: Jul 2011

Last Activity: 7 July 2011, 8:29 AM EDT

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks for the explanations, I'll try to find out some nice documentation to better understand associative arrays, my knowledge are not that broad in awk but as i'm happy with the results a should invest more time on it. Best, Ramzi

ramouz87

View Public Profile for ramouz87

Find all posts by ramouz87

Shell Programming and Scripting

Collapsing and counting by key column in a sorted file

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Collapsing similar strings

Discussion started by: Xterra

2. Shell Programming and Scripting

Need help of counting no of column of a file

Discussion started by: STCET22

3. Shell Programming and Scripting

Counting a consecutive number in column 2

Discussion started by: Ryan Kim

4. Shell Programming and Scripting

Counting the number of element in each column

Discussion started by: Homa

5. Shell Programming and Scripting

Counting no of spl character occurance column wise

Discussion started by: Ganesh L

6. Shell Programming and Scripting

Help with awk in counting characters based on a column

Discussion started by: Homa

7. Shell Programming and Scripting

Counting occurences in column

Discussion started by: grincz

8. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Discussion started by: script_op2a

9. Shell Programming and Scripting

combine multiple files by column into one files already sorted!

Discussion started by: ahmedamro

10. Shell Programming and Scripting

Join 3 files using key column in a mapping file

Discussion started by: bigsmile