Merging Frequencies in a File


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Merging Frequencies in a File
# 1  
Old 03-23-2011
Merging Frequencies in a File

hello,
I have a file which has the following structure:
word <TAB> frequency
The same word can have multiple frequencies:
John <TAB> 60
John <TAB> 20
John <TAB> 30
Mary <TAB> 1000
Mary <TAB> 800
Mary <TAB> 20
What I need is a script which could merge all these frequencies into one single frequency. The output would be
John<TAB> 110
Mary TAB> 1820
I have written a program in C which does it but is agonizingly slow, since the number of such instances is 100,000.
Could anybody help me with a perl script or an awk script which could do the job faster. I am a tyro in awk and perl and hence the request.
Many thanks in advance
# 2  
Old 03-23-2011
This seems pretty quick - tested on 130,000 record file and took about 2 secs.

Code:
awk '{T[$1]+=$2} END { for(u in T) print u"\t"T[u] }' FS="\t" infile

BTW C should be faster again, make sure you store your words (John/Mary/etc) in a B+tree or some other sort of indexed structure.
This User Gave Thanks to Chubler_XL For This Post:
# 3  
Old 03-23-2011
Hello,
Many thanks. But the script did not give correct result.
Here is a sample of the input:
aabha 21
aabha 69
aabha 90
aabid 115
aabid 14
aabid 215
aabid 86
aabida 121
aabida 42
aabida 79
aadesa 197
aadesh 52
aadil 31
aaditykumar 69
aaekar 253
aaekaranata 86
aaekaranath 66
aaemakar 315
aaemapal 204
aaemaprakas 145
aaemaprakasa 2500
aaemaprakash 1494
aaemavati 754
aaenkar 177
aaereelal 333
aaeri 358
aafat 66
aafatab 192
aaftab 35
aagani 229
aagani 278
aagani 49

The output was:

aabha 21
aabha 69
aabha 90
aabid 115
aabid 14
aabid 215
aabid 86
aabida 121
aabida 42
aabida 79
aadesa 197
aadesh 52
aadil 31
aaditykumar 69
aaekar 253
aaekaranata 86
aaekaranath 66
aaemakar 315
aaemapal 204
aaemaprakas 145
aaemaprakasa 2500
aaemaprakash 1494
aaemavati 754
aaenkar 177
aaereelal 333
aaeri 358
aafat 66
aafatab 192
aaftab 35
aagani 229
aagani 278
aagani 49
As you can see, the frequencies did not merge.
The program is blazingly fast but the merging does not take place.
Many thanks,
# 4  
Old 03-23-2011
Hi gimley,

I hope is what you need, try with:

Code:
echo "John 60
John 20
John 30
Mary 1000
Mary 800
Mary 20" | awk '{a[$1]+=$2}{b[$1]=$1" "a[$1]}END{for (c in b) print b[c]}' | sort
John 110
Mary 1820

With the other input gives:
Code:
awk '{a[$1]+=$2}{b[$1]=$1" "a[$1]}END{for (c in b) print b[c]}' inputfile | sort
aabha 180
aabid 430
aabida 242
aadesa 197
aadesh 52
aadil 31
aaditykumar 69
aaekar 253
aaekaranata 86
aaekaranath 66
aaemakar 315
aaemapal 204
aaemaprakas 145
aaemaprakasa 2500
aaemaprakash 1494
aaemavati 754
aaenkar 177
aaereelal 333
aaeri 358
aafat 66
aafatab 192
aaftab 35
aagani 556

Best regards
This User Gave Thanks to cgkmal For This Post:
# 5  
Old 03-23-2011
Code:
awk '{arr[$1]+= $2}END{for (i in arr) print i" "arr[i]}' inputfile

This User Gave Thanks to tene For This Post:
# 6  
Old 03-23-2011
Many thanks. The solution worked. As I mentioned I had written a C program to do the job and in spite of bucketing the data, it was very slow. I compared the output of the C program with the awk output and both were the same with the basic difference that the awk program took a very short time: hardly a few seconds to run through a file of 700,000 records.
The only difference is that my program sorts as per frequency: highest to lowest, but that is not a big issue.
How does one massage a numeric data in awk to sort data in terms of frequency.
Thanks once again

Gimley
# 7  
Old 03-23-2011
Quote:
Originally Posted by gimley
The only difference is that my program sorts as per frequency: highest to lowest, but that is not a big issue.
How does one massage a numeric data in awk to sort data in terms of frequency.
Thanks once again

Gimley
Gimley,

Try formating the output with sort:
Code:
awk '{a[$1]+=$2}{b[$1]=$1" "a[$1]}END{for (c in b) print b[c]}' inputfile |  sort -r -k2 -n
aaemaprakasa 2500
aaemaprakash 1494
aaemavati 754
aagani 556
aabid 430
aaeri 358
aaereelal 333
aaemakar 315
aaekar 253
aabida 242
aaemapal 204
aadesa 197
aafatab 192
aabha 180
aaenkar 177
aaemaprakas 145
aaekaranata 86
aaditykumar 69
aafat 66
aaekaranath 66
aadesh 52
aaftab 35
aadil 31

Regards.
This User Gave Thanks to cgkmal For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Gaps and frequencies

I have this infile: >GHL8OVD01BNNCA Freq 10 TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGT-GCAGCA-TA >GHL8OVD01CMQVT Freq 1 TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGT-GCAGCA-TA >GHL8OVD01CMQVT Freq 1 TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGC-TA >GHL8OVD01CMQVW Freq 1... (1 Reply)
Discussion started by: Xterra
1 Replies

2. Shell Programming and Scripting

Removal of extra spaces in *.log files to allow extraction of frequencies

Our university has upgraded its version of a computational chemistry program that our group uses quite regularly. In the past we have been able to extract frequency spectra from log files that are generated. Since the upgrade, the viewing program errors out. I've been able to trace down the changes... (16 Replies)
Discussion started by: wsuchem
16 Replies

3. Shell Programming and Scripting

merging two file

Dear All, I have two file like this: file1: a1234 b1235 c4678 d7859 file2 : e4575 f7869 g7689 h9687 I want output like this: a1234 b1235 c4678 (2 Replies)
Discussion started by: attila
2 Replies

4. Shell Programming and Scripting

Merging data from one file into another

Hello, I have a master database of a dictionary with the following structure: a=b (b is a Unicode string) a is the English part and b is the equivalent in a foreign language I have also another file which has a database where the /b/ part of the string has been corrected by an expert. let us... (5 Replies)
Discussion started by: gimley
5 Replies

5. Shell Programming and Scripting

Appending lines with word frequencies, ordering and indexing a column

Dear All, I have the following input data: w1 20 g1 w1 10 g1 w2 12 g1 w2 23 g1 w3 10 g1 w3 17 g1 w3 12.5 g1 w3 21 g1 w4 11 g1 w4 13.2 g1 w4 23 g1 w4 18 g1 First I seek to find the word frequencies in col1 and sort col2 in ascending order for each change in a col1 word. Second,... (5 Replies)
Discussion started by: Ghetz
5 Replies

6. Shell Programming and Scripting

Recalculating frequencies

My file looks like this The first 2 sequences are identical (different ID and frequencies though). The same thing for the last 2. What I need is to compare all sequences within the file and if they are identical, they need to be 'compressed' to one entry and the frequency should be recalculated.... (8 Replies)
Discussion started by: Xterra
8 Replies

7. Shell Programming and Scripting

Extracting a column from a file and merging with other file using awk

Hi All: I have following files: File 1: <header> text... text .. text .. text .. <\header> x y z ... File 2: <header> text... text .. text .. (4 Replies)
Discussion started by: mrn006
4 Replies

8. UNIX for Dummies Questions & Answers

merging 2 file

I have 2 files file1.txt a 123 aqsw c 234 sfdr fil2.txt b 345 hgy d 4653 jgut I want to merger in such a manner the the output file should be outfile.txt a 123 aqsw b 345 hgy c 234 sfdr d 4653 jgut Do we have any command to achive this? (8 Replies)
Discussion started by: siba.s.nayak
8 Replies

9. UNIX for Dummies Questions & Answers

merging two lines in a file

Hi All, I want to merge two lines in a file till the end of the file. So what could be the command to get so. say file name : sample.txt contents: country=1 send apps =1 rece=2 country=2 send apps =3 rece=3 .. ... output: country=1;send apps =1 rece=2 country=2;send apps =3... (6 Replies)
Discussion started by: thaduka
6 Replies

10. UNIX for Advanced & Expert Users

Merging Two File Horizontally

I am trying to merge two large file horizontally using paste command. Every thing is working fine except for time. Its taking lot of time. Is there any effiecient way of doing the same thing or is there anyway by which I can improve its perfomance programatically? Thanks, Yeheya (1 Reply)
Discussion started by: yeheyaansari
1 Replies
Login or Register to Ask a Question