Merging strings which have deviation in frequency


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Merging strings which have deviation in frequency
# 1  
Old 08-10-2014
Merging strings which have deviation in frequency

Dear all,
I need a little help. I am working on a frequency driven database in which the structure is as under:

Code:
headword=gloss<space>Frequency

The data which I am working with has dupes i.e. the Headword is repeated more than once with a different gloss variant on the right hand side and also with a different frequency, as in the pseudo example given below:
Code:
John=Jean 1200
John=Jehan 300
John=Jan 1100
John=Johann 22

I have written a perl script which extracts all such instances. Assuming [I know that some may not like this approach] that the headword with the highest frequency is possibly the right candidate, I want to merge the ones with the lower frequency into the higher frequency with the caveat that if the difference in frequency is lower than 10% both be retained as such. Thus in the example provided the output would be as under:
Code:
John=Jean 1200
John=Jean 300
John=Jan 1100
John=Jean 22

Code:
John=Jean and John=Jan

are not merged since the deviation range is less than 10%. Others are merged since the deviation range is more than 10%.

I have been able to identify through a Perl Script all such instances of Dupes and also have a Script which merges frequencies when both headword and gloss are identical but in spite of all attempts cannot get the frequency problem to solve itself, especially calculating the deviation range. Since I am still a newbie, I cannot just get the operation right
Basically the steps are as under:
Code:
1. Identifying dupes [Already done]
2. Collecting all dupes i.e. the headword is identical in one list
3. Checking frequency
4. Identifying highest frequency in the list
5. Calculating frequency deviation range in the list. 
6. If more than 10% merging the lower frequency to the higher frequency i.e replacing the lower frequency gloss by the higher frequency gloss
7. If the deviation range is less than 10%; do not modify.
8. Finally merge frequencies of all words generated out by step 6 i.e. headword and gloss are identical [Already done]

In case someone would like to tangle with live data, here is a small sample:
Code:
trimbak=ત્રિંબક 87
trimbak= ત્રીંબક 35
trimbak=ત્રીંબક 14
trimbakbhai= ત્રીંબકભાઈ 55
trimbakbhai=ત્રિંબકભાઈ 7
trimbakbhai=ત્રીબંકભાઇ 4
tripathi=ત્રિપાટી 7
tripathi=ત્રિપાઠી 369
tripathi=ત્રિપાથી 4
tripathi=ત્રીપાઠી 8
trivedi=ત્રિવેદિ 28
trivedi=ત્રિવેદી 78
trivedi=ત્રીવેદી 8
trupti=તૃપ્તિ 4
trupti=તૃપ્તી 13

I have checked on the forum but the scripts there do not answer what I need. Many thanks for all help.
# 2  
Old 08-10-2014
Hi, try:
Code:
 awk -F'= *| ' 'NR==FNR{if($3>M[$1]){M[$1]=$3; P[$1]=$2} next} $3<0.9*M[$1]{$2=P[$1]}{print $1"="$2,$3}' file file

The input file is specified twice..

The field separator is = *| because in your sample there are sometimes spurious space behind the = signs. If that is not the case with your real date, you could use -F'[= ]'
# 3  
Old 08-10-2014
Hello,
Many thanks for your solution. I tried it out, but it does not seem to work.
I removed all spaces before and after the equal to sign in the database.
I loaded the program and ran it. Since I was getting an error for the file separator you had suggested, I modified it as
Code:
FS="[= ]"

The program ran, but the output is the same as the input.
Did I goof up somewhere in changing the File separator? I normally use
Code:
FS=" "

with whatever is the file separator loaded in i between inverted commas.
Please help. The solution is frustratingly close.
Many thanks for your patience.
# 4  
Old 08-10-2014
Hi, did you specify the input file twice? That is essential because then the program reads the file twice.. Also with the field separator, it is best to use single quotes, like in my example. I noticed you use FS= rather than -F . Where did you specify this in the code? What error were you getting with my suggested field separator?

With your input files these are the results I am getting:

Code:
John=Jean 1200
John=Jean 300
John=Jan 1100
John=Jean 22

and
Code:
trimbak=ત્રિંબક 87
trimbak=ત્રિંબક 35
trimbak=ત્રિંબક 14
trimbakbhai=ત્રીંબકભાઈ 55
trimbakbhai=ત્રીંબકભાઈ 7
trimbakbhai=ત્રીંબકભાઈ 4
tripathi=ત્રિપાઠી 7
tripathi=ત્રિપાઠી 369
tripathi=ત્રિપાઠી 4
tripathi=ત્રિપાઠી 8
trivedi=ત્રિવેદી 28
trivedi=ત્રિવેદી 78
trivedi=ત્રિવેદી 8
trupti=તૃપ્તી 4
trupti=તૃપ્તી 13

What result are you getting?
If it still does not work, what is your OS and version?
# 5  
Old 08-10-2014
Yes, I did run the file twice instead of one, which I deduced from the awk syntax and piped it out to an out file.
Maybe this is because I am using DOS and Windows Vista and Gawk32 to accommodate a 32 bit version of Windows.
I thought that maybe it was an OS issue. But Awk runs seamlessly along all OS's.(at least that's what I thought). Maybe this is why your syntax for the file separator did not work. I can see it works beautifully for you: the output is flawless.
Any solution to make the program run in my environment.
Once again many thanks for your patience and time. This is a learning experience for me for which I am grateful.
# 6  
Old 08-10-2014
awk should run seamlessly, but on Windows there are all sorts of quoting issues (which are not caused by awk but by the shell environment). Try putting the script in a file:

Code:
BEGIN {
  FS="[ =]"
}

NR==FNR {
  if($3>M[$1]) {
    M[$1]=$3
    P[$1]=$2
  }
  next
}

$3<0.9*M[$1] {
  $2=P[$1]
}

{
  print $1"="$2,$3
}

And run it like
Code:
awk -f scriptfile file file

This User Gave Thanks to Scrutinizer For This Post:
# 7  
Old 08-10-2014
Many thanks for your patience and help. It ran beautifully and that too fast. I will test it out on a large file but I am sure it will give perfect output.
Many thanks once more.
I will study the script and try to understand the logic of the file separator.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replicate merging and frequency calculation

Hello, I have a 2 column file with an ID column and a column with some string. ID String EN03 typehellobyedogcatcatdog EN09 typehellobye EN08 dogcatcatdog EN09 catcattypehello EN10 typehellobyedogcatcatdog EN10 typehellobyedogcatcatdogdog I would like to count the amount of times... (9 Replies)
Discussion started by: verse123
9 Replies

2. Ubuntu

Merging strings that have identical rownames in a dataframe

Hi I have a data frame with repeated names in column 1, and different descriptors in column 2. I want to merge/cat strings that have same entry in column 1 into one row with any separator. Example for input: Cvel_1 KOG0155 Cvel_1 KOG0306 Cvel_1 KOG3259 Cvel_1 ... (4 Replies)
Discussion started by: Alyaa
4 Replies

3. Shell Programming and Scripting

Output mean and standard deviation of a row

I have a file that looks that this: 820 890 530 1650 1600 1800 1850 1900 2270 1640 2300 1670 2080 2200 2350 1150 1630 2210 I would like to output the mean and standard deviation of each row so that my final output would look like this 820 890 530 746.667 155.849 1650 1600 1800... (5 Replies)
Discussion started by: kayak
5 Replies

4. Shell Programming and Scripting

Calculate Mean absolute Deviation

Hi, I am trying to use an statistical formula. I tried in excel, but I get different values when I use calculator. The formula is (1/n) ∑|x - mean| n=no. of observations x=each individual expression value mean is median of all observations I have a file with 1000 rows.. So it needs... (1 Reply)
Discussion started by: Diya123
1 Replies

5. Shell Programming and Scripting

AWK script for standard deviation / root mean square deviation

I have a file with say 50 columns, each containing a whole lot of data. Each column contains data from a separate simulation, but each simulation is related to the data in the last (REFERENCE) column $50 I need to calculate the RMS deviation for each data line, i.e. column 1 relative to... (12 Replies)
Discussion started by: chrisjorg
12 Replies

6. Shell Programming and Scripting

Standard deviation in awk

Hi all, I need to find the standard deviation of each column of a dataset below for each hour. The data is given in 5 second intervals as shown below DATE TIME FRAC_DAYS_SINCE_JAN1 FRAC_HRS_SINCE_JAN1 EPOCH_TIME ... (11 Replies)
Discussion started by: gd9629
11 Replies

7. Shell Programming and Scripting

using awk to print average and standard deviation into a file

Hi I want to use awk to print avg and st deviation but it does not go into a file for column 1 only. I can do average and # of records but i cannot get st deviation. awk '{sum+=$1} END { print "Average = ",sum/NR}' thanks (1 Reply)
Discussion started by: phil_heath
1 Replies

8. UNIX for Dummies Questions & Answers

Calculating the Standard Deviation for a column

Hi all, I want to calculate the standard deviation for a column (happens to be column 3). Does any know of simple awk script to do this? Thanks (1 Reply)
Discussion started by: kylle345
1 Replies

9. Shell Programming and Scripting

Mean and Standard deviation

Hi all, I am new to shell scripting and wanna calculate the mean and standard deviation using shell programming. I have a file with letters that are repeating and their corresponding duration a 0.32 a 0.89 aa 0.34 aa 0.23 au 0.012 au 0.26... (4 Replies)
Discussion started by: lakshmikanth.pg
4 Replies

10. Shell Programming and Scripting

Script for finding standard deviation

I have a CSV file that looks like 0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0 10,11,7,0,4,12,2,3,7,0,11,3,12,4,0,5,5,4,5,0,8,6,12,0,9,3,3,0,2,7,8 19,11,7,0,4,14,16,10,8,2,13,7,15,6,0,76,6,4,10,0,18,10,17,1,11,3,3,0,9,9,8... (7 Replies)
Discussion started by: RJ17
7 Replies
Login or Register to Ask a Question