Replicate merging and frequency calculation


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Replicate merging and frequency calculation
# 1  
Old 02-22-2014
Replicate merging and frequency calculation

Hello, I have a 2 column file with an ID column and a column with some string.

Code:
ID     String
EN03 typehellobyedogcatcatdog
EN09 typehellobye
EN08 dogcatcatdog
EN09 catcattypehello
EN10 typehellobyedogcatcatdog
EN10 typehellobyedogcatcatdogdog

I would like to count the amount of times an ID repeats in the "ID" column and record it in a new column. Similarly, I would like to count the amount of times the substring "dog" appears for the lines with the same "ID" and record it in a new column. The output should look like this:

Code:
ID      dog_frequency   ID_frequency
EN03  2                       1
EN09  0                       2
EN08  2                       1
EN10  5                       2

any ideas?

Last edited by Scrutinizer; 02-22-2014 at 03:36 PM.. Reason: code tags, not quote tags
# 2  
Old 02-22-2014
Code:
$ cat file
ID String
EN03 typehellobyedogcatcatdog
EN09 typehellobye
EN08 dogcatcatdog
EN09 catcattypehello
EN10 typehellobyedogcatcatdog
EN10 typehellobyedogcatcatdogdog

Code:
awk '
        NR==1{
		print "ID", "dog_frequency", "ID_frequency"
		next
	     }
      FNR==NR{
		A[$1]++
		B[$1]+=gsub(/dog/,x,$2)
		next
	     } 
    ($1 in A){ 
		print $1,B[$1],A[$1]
		delete A[$1]
		delete B[$1]
	     }
    ' OFS='\t' file{,}

Resulting
Code:
ID	dog_frequency	ID_frequency
EN03	2	1
EN09	0	2
EN08	2	1
EN10	5	2

This User Gave Thanks to Akshay Hegde For This Post:
# 3  
Old 02-22-2014
Try:
Code:
awk 'NR>1{D[$1]+=gsub(/dog/,x,$2); F[$1]++} END{for (i in D) print i,D[i],F[i]}' file

This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 02-22-2014
Thanks. Any way to find say the top 5 most abundant substrings from column 2 that are of length 6 without knowing them beforehand?

---------- Post updated at 06:20 PM ---------- Previous update was at 04:40 PM ----------

also, any idea how to add together the length of each of the strings for each ID and print them as well? For example the output should look like this:

Code:
ID      dog_frequency   ID_frequency string_length
EN03  2                       1                24
EN09  0                       2                27
EN08  2                       1                12
EN10  5                       2                51


Last edited by Scrutinizer; 02-22-2014 at 08:15 PM.. Reason: changed quote tags to code tags
# 5  
Old 02-23-2014
Okay it looks like you continued your discussion in new thread with different title

Finding most common substrings | Unix Linux Forums | Shell Programming and Scripting
# 6  
Old 02-24-2014
sorry about that! I wasn't aware they were the same forum.
# 7  
Old 03-04-2014
Scrutinizer, I am trying to incorporate your code from #3 in a while loop but I need awk to use the substrings from file1 to search for a match in file2 and to count it based on the info from file2. However, awk is just counting the frequency of the ID and not counting the frequency of the substring that exists in file2. How can I get this to work? Here is the code:

Code:
while read A B
do
         mawk 'NR>1{D[$1]+=gsub(/${A}/,x,$2); F[$1]++}; END{for (i in D) print i,D[i],F[i]}' file2 > $A.out
done; < file1

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

How to replicate Ruby´s binary file reading with Java?

Hello to all guys, Maybe some expert could help me. I have a working ruby script shown below that reads a big binary file (more than 2GB). The chunks of data I want to analyze is separated by the sequence FF47 withing the binary. So, in the ruby script is defined as "line separator" =... (10 Replies)
Discussion started by: Ophiuchus
10 Replies

2. AIX

AIX install and replicate to 5 server

Hi, I have to install 5 servers with same OS level and same packs, i wonder if there is an way to install 1 and then copy or clone the instalattion to the other 4? any suggetions? Rgs, (1 Reply)
Discussion started by: prpkrk
1 Replies

3. Shell Programming and Scripting

Merging strings which have deviation in frequency

Dear all, I need a little help. I am working on a frequency driven database in which the structure is as under: headword=gloss<space>Frequency The data which I am working with has dupes i.e. the Headword is repeated more than once with a different gloss variant on the right hand side and... (8 Replies)
Discussion started by: gimley
8 Replies

4. Shell Programming and Scripting

Find the replicate record using awk

We usually use the following awk code to delete of find out the replicate record. awk -F, '{a++} END {for (i in a) if (a>=2) print i a}' file My question is how can I print the whole record. The following code doesn't work. awk -F, '{a++} END {for (i in a) if (a>=2) print $0}' file ... (8 Replies)
Discussion started by: xshang
8 Replies

5. Shell Programming and Scripting

replicate lines - awk

Is it possible to replicate the lines based on 4th column of the input like the below ? input ar1 10 100 -1 ar1 20 200 -2 arX 34 140 +1 arY 7 1 +4 output ar1 10 100 - ar1 20 200 - ar1 20 200 - arX 34 140 + arY ... (1 Reply)
Discussion started by: quincyjones
1 Replies

6. Shell Programming and Scripting

Replicate history commands in multiple terminal

Hi, I am using putty client to connect to my remote Linux server box, and I am connecting through ssh. That system runs bash shell. So, if I use multiple putty terminal, how can I replicate those commands that I ran in other terminals to be available/shared in the current terminal window (i.e)... (1 Reply)
Discussion started by: royalibrahim
1 Replies

7. Shell Programming and Scripting

Replicate one directory with another

I have a dir as /library/utility/apache-tomcat/tbase-6001/repositories which has many huge directories and files. I am planning to replicate it to another folder /library/utility/apache-tomcat/tbase2008-6001/repositories Normal copy command is taking a hell lot of time and getting hung in... (1 Reply)
Discussion started by: Tuxidow
1 Replies

8. UNIX for Advanced & Expert Users

Replicate CPU load to other processors in server

Hi Folks, We have 6 processors in our sun server. I do see that CPU usage by one of the processor is always more than 70-80% and for remaining 5 processors, its only 20%. Is there a way to delegate the excess CPU load on one of the processors in server to other processors in same server? Is... (3 Replies)
Discussion started by: vaibhav.kanchan
3 Replies

9. Shell Programming and Scripting

How to replicate data using Uniq or awk

Hi, I have this scenario; where there are two classes:- apple and orange. 1,2,3,4,5,6,apple 1,1,0,4,2,3,apple 1,3,3,3,3,4,apple 1,1,1,1,1,1,orange 1,2,3,1,1,1,orange Basically for apple, i have 3 entries in the file, and for orange, I have 2 entries. Im trying to edit the file and find... (5 Replies)
Discussion started by: ahjiefreak
5 Replies

10. Linux

how replicate linux server

hai the point I want one server and additional backup server just like winnt pdc , bdc concept . anybody know any solutions for this. the thing is i don't have that much amount to spend for additional hardwares. Is any software or anything ????? rgds sun (5 Replies)
Discussion started by: sun
5 Replies
Login or Register to Ask a Question