Sponsored Content
Top Forums Shell Programming and Scripting Collapsing and counting by key column in a sorted file Post 302536924 by ramouz87 on Wednesday 6th of July 2011 01:38:51 PM
Old 07-06-2011
Collapsing and counting by key column in a sorted file

Hi
I have a tab separated file with reads mappings of more than 2 million reads> the file is sorted by ID and looks like the following:

SeqIDSeqFreqSeqPosSeq
HWI-EA332_0036:5:100:10131:16361#ATGC/1GACTTGAGGTCTCCCCCGCA1TZRTMR_40497:317:+
HWI-EA332_0036:5:100:10131:16361#ATGC/1GACTTGAGGTCTCCCCCGCA1YSXAZZ_40497:317:+
HWI-EA332_0036:5:100:10131:16361#ATGC/1GACTTGAGGTCTCCCCCGCA1AZZOL148119:523:+
HWI-EA332_0036:5:100:10131:16361#ATGC/1GACTTGAGGTCTCCCCCGCA1VCXT148119:523:+
HWI-EA332_0036:5:100:10554:9799#ATGC/1GACTCCTAAATTAACAACAA1YSXAZZ_35135:573:+
HWI-EA332_0036:5:100:10554:9799#ATGC/1GACTCCTAAATTAACAACAA1TZRTMR_35135:573:+
HWI-EA332_0036:5:100:10791:13901#ATGC/1GACTAGTGAGTGACCCGCTC1TZRTMR_7034:497:+
HWI-EA332_0036:5:100:10791:13901#ATGC/1GACTAGTGAGTGACCCGCTC1YSXAZZ_7034:497:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1GACTAATATAAATAAGTCTC1YSXAZZ_3676:148:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1GACTAATATAAATAAGTCTC1TZRTMR_3676:148:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1GACTAATATAAATAAGTCTC1TZRTMR_2085:139:+
HWI-EA332_0036:5:100:11825:11517#ATGC/1GACTAATATAAATAAGTCTC1YSXAZZ_2085:139:+
HWI-EA332_0036:5:100:13509:3643#ATGC/1GACTACCCGCCAAACCCCAG2YTTSTZ_566255:526:-
HWI-EA332_0036:5:100:13509:3643#ATGC/1GACTACCCGCCAAACCCCAG2YYTWQ_566255:526:-
HWI-EA332_0036:5:100:13837:1118#ATGC/1GACTCCCGCATCCCGCAAAC1PZXVXZ_21909:999:+
HWI-EA332_0036:5:100:13837:1118#ATGC/1GACTCCCGCATCCCGCAAAC1TZRTMR_21909:999:+

And I would like to collapse the columns [3,4] having the same ID to get a table like this :
SeqIDSeqFreqSeqPosSeqN_UpT
HWI-EA332_0036:5:100:10131:16361#ATGC/1GACTTGAGGTCTCCCCCGCA1,1,1,1TZRTMR_40497:317:+,YSXAZZ_40497:317:+,AZZOL148119:523:+,VCXT148119:523:+4
HWI-EA332_0036:5:100:10554:9799#ATGC/1GACTCCTAAATTAACAACAA1,1YSXAZZ_35135:573:+,TZRTMR_35135:573:+2
HWI-EA332_0036:5:100:10791:13901#ATGC/1GACTAGTGAGTGACCCGCTC1,1TZRTMR_7034:497:+,YSXAZZ_7034:497:+2
HWI-EA332_0036:5:100:11825:11517#ATGC/1GACTAATATAAATAAGTCTC1,1,1,1YSXAZZ_3676:148:+,TZRTMR_3676:148:+,TZRTMR_2085:139:+,YSXAZZ_2085:139:+4
HWI-EA332_0036:5:100:13509:3643#ATGC/1GACTACCCGCCAAACCCCAG2,2YTTSTZ_566255:526:-,YYTWQ_566255:526:-2
HWI-EA332_0036:5:100:13837:1118#ATGC/1GACTCCCGCATCCCGCAAAC1,1PZXVXZ_21909:999:+,TZRTMR_21909:999:+2

The column 2 is unique as well and the N_UpT is the count of number of lines matched by the id.


Any help idea about the best way of doing this (in AWK ? Perl ?) would be much appreciated.
Thanks in advance for the help and suggestions.
Best,
Ramzi
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Join 3 files using key column in a mapping file

I'm new of UNIX shell scripting. I'm recently generating a excel report in UNIX(file with delimiter is fine). How should I make a script to do it? 1 file to join comes from output of one UNIX command, the second from another UNIX command, and third from a database query. The key columes of all... (7 Replies)
Discussion started by: bigsmile
7 Replies

2. Shell Programming and Scripting

combine multiple files by column into one files already sorted!

I have multiple files; each file contains a certain data in a column view simply i want to combine all those files into one file in columns example file1: a b c d file 2: 1 2 3 4 file 3: G (4 Replies)
Discussion started by: ahmedamro
4 Replies

3. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Given a file such as this I need to remove the duplicates. 00060011 PAUL BOWSTEIN ad_waq3_921_20100826_010517.txt 00060011 PAUL BOWSTEIN ad_waq3_921_20100827_010528.txt 0624-01 RUT CORPORATION ad_sade3_10_20100827_010528.txt 0624-01 RUT CORPORATION ... (13 Replies)
Discussion started by: script_op2a
13 Replies

4. Shell Programming and Scripting

Counting occurences in column

Hi guys! I have a problem writing script that would convert this input into this output: I have an awk script that counts occurences of a sign in a column, but don't know how to change it so that I would give me desired output. script awk '{count++}END{for(j in count)... (2 Replies)
Discussion started by: grincz
2 Replies

5. Shell Programming and Scripting

Help with awk in counting characters based on a column

Hello, I am using Awk in UBUNTU 12.04. I have a file as follows with 2172 rows and 44707 columns. ABO and GPO are the names of my populations. ABO_1 1 2 ABO_1 1 2 ABO_2 1 1 ABO_2 1 2 GPO_1 1 1 GPO_1 2 2 GPO_2 1 0 GPO_2 2 0I want to count the number of 1s and 2s in... (7 Replies)
Discussion started by: Homa
7 Replies

6. Shell Programming and Scripting

Counting no of spl character occurance column wise

Hi i have a file delimited with ","as below and i need to handle scenario like col1,col2 fields have special character '|', i need count of special character value column wise as given in col3 and col4. pls help me to reslove this. Source file name,col1,col2,col3,col4 one,2,3 two,2|3,2|3... (2 Replies)
Discussion started by: Ganesh L
2 Replies

7. Shell Programming and Scripting

Counting the number of element in each column

Hello, I have a file as follows: ENSGALG00000000189 ENSGALG00000000189 ENSGALG00000000189 ENSGALG00000000215 ENSGALG00000000215 ENSGALG00000000218 ... (5 Replies)
Discussion started by: Homa
5 Replies

8. Shell Programming and Scripting

Counting a consecutive number in column 2

Hi, I have a input file which contains following data 0 1 0 2 0 3 0 4 0 8 0 9 0 11 1 1 1 2 1 6 1 7 1 8 1 9 2 1 2 11 2 12 (12 Replies)
Discussion started by: Ryan Kim
12 Replies

9. Shell Programming and Scripting

Need help of counting no of column of a file

Hi All , I got stuck on the below scenario.If anyone can help me ,that will be really helpful. I have a target hdfs file layout.I need to know the no of column in that file. Target_RECRD_layout { ABC_ID EN NOTNULLABLE, ABC_COUNTRY CHARACTER ENCODING ASCII NOTNULLABLE, ... (5 Replies)
Discussion started by: STCET22
5 Replies

10. UNIX for Dummies Questions & Answers

Collapsing similar strings

I have a file that looks like this: BC00001 GA 2 2 3 3 2 5 1 5 3 3 2 4 ... (6 Replies)
Discussion started by: Xterra
6 Replies
All times are GMT -4. The time now is 09:13 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy