Merging strings which have deviation in frequency

08-10-2014

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Merging strings which have deviation in frequency

Dear all,
I need a little help. I am working on a frequency driven database in which the structure is as under:

Code:

headword=gloss<space>Frequency

The data which I am working with has dupes i.e. the Headword is repeated more than once with a different gloss variant on the right hand side and also with a different frequency, as in the pseudo example given below:

Code:

John=Jean 1200
John=Jehan 300
John=Jan 1100
John=Johann 22

I have written a perl script which extracts all such instances. Assuming [I know that some may not like this approach] that the headword with the highest frequency is possibly the right candidate, I want to merge the ones with the lower frequency into the higher frequency with the caveat that if the difference in frequency is lower than 10% both be retained as such. Thus in the example provided the output would be as under:

Code:

John=Jean 1200
John=Jean 300
John=Jan 1100
John=Jean 22

Code:

John=Jean and John=Jan

are not merged since the deviation range is less than 10%. Others are merged since the deviation range is more than 10%.

I have been able to identify through a Perl Script all such instances of Dupes and also have a Script which merges frequencies when both headword and gloss are identical but in spite of all attempts cannot get the frequency problem to solve itself, especially calculating the deviation range. Since I am still a newbie, I cannot just get the operation right
Basically the steps are as under:

Code:

1. Identifying dupes [Already done]
2. Collecting all dupes i.e. the headword is identical in one list
3. Checking frequency
4. Identifying highest frequency in the list
5. Calculating frequency deviation range in the list. 
6. If more than 10% merging the lower frequency to the higher frequency i.e replacing the lower frequency gloss by the higher frequency gloss
7. If the deviation range is less than 10%; do not modify.
8. Finally merge frequencies of all words generated out by step 6 i.e. headword and gloss are identical [Already done]

In case someone would like to tangle with live data, here is a small sample:

Code:

trimbak=ત્રિંબક 87
trimbak= ત્રીંબક 35
trimbak=ત્રીંબક 14
trimbakbhai= ત્રીંબકભાઈ 55
trimbakbhai=ત્રિંબકભાઈ 7
trimbakbhai=ત્રીબંકભાઇ 4
tripathi=ત્રિપાટી 7
tripathi=ત્રિપાઠી 369
tripathi=ત્રિપાથી 4
tripathi=ત્રીપાઠી 8
trivedi=ત્રિવેદિ 28
trivedi=ત્રિવેદી 78
trivedi=ત્રીવેદી 8
trupti=તૃપ્તિ 4
trupti=તૃપ્તી 13

I have checked on the forum but the scripts there do not answer what I need. Many thanks for all help.

gimley

View Public Profile for gimley

Find all posts by gimley

08-10-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi, try:

Code:

 awk -F'= *| ' 'NR==FNR{if($3>M[$1]){M[$1]=$3; P[$1]=$2} next} $3<0.9*M[$1]{$2=P[$1]}{print $1"="$2,$3}' file file

The input file is specified twice..

The field separator is = *| because in your sample there are sometimes spurious space behind the = signs. If that is not the case with your real date, you could use -F'[= ]'

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

08-10-2014

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Hello,
Many thanks for your solution. I tried it out, but it does not seem to work.
I removed all spaces before and after the equal to sign in the database.
I loaded the program and ran it. Since I was getting an error for the file separator you had suggested, I modified it as

Code:

FS="[= ]"

The program ran, but the output is the same as the input.
Did I goof up somewhere in changing the File separator? I normally use

Code:

FS=" "

with whatever is the file separator loaded in i between inverted commas.
Please help. The solution is frustratingly close.
Many thanks for your patience.

gimley

View Public Profile for gimley

Find all posts by gimley

08-10-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi, did you specify the input file twice? That is essential because then the program reads the file twice.. Also with the field separator, it is best to use single quotes, like in my example. I noticed you use FS= rather than -F . Where did you specify this in the code? What error were you getting with my suggested field separator?

With your input files these are the results I am getting:

Code:

John=Jean 1200
John=Jean 300
John=Jan 1100
John=Jean 22

and

Code:

trimbak=ત્રિંબક 87
trimbak=ત્રિંબક 35
trimbak=ત્રિંબક 14
trimbakbhai=ત્રીંબકભાઈ 55
trimbakbhai=ત્રીંબકભાઈ 7
trimbakbhai=ત્રીંબકભાઈ 4
tripathi=ત્રિપાઠી 7
tripathi=ત્રિપાઠી 369
tripathi=ત્રિપાઠી 4
tripathi=ત્રિપાઠી 8
trivedi=ત્રિવેદી 28
trivedi=ત્રિવેદી 78
trivedi=ત્રિવેદી 8
trupti=તૃપ્તી 4
trupti=તૃપ્તી 13

What result are you getting?
If it still does not work, what is your OS and version?

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

08-10-2014

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Yes, I did run the file twice instead of one, which I deduced from the awk syntax and piped it out to an out file.
Maybe this is because I am using DOS and Windows Vista and Gawk32 to accommodate a 32 bit version of Windows.
I thought that maybe it was an OS issue. But Awk runs seamlessly along all OS's.(at least that's what I thought). Maybe this is why your syntax for the file separator did not work. I can see it works beautifully for you: the output is flawless.
Any solution to make the program run in my environment.
Once again many thanks for your patience and time. This is a learning experience for me for which I am grateful.

gimley

View Public Profile for gimley

Find all posts by gimley

08-10-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

awk should run seamlessly, but on Windows there are all sorts of quoting issues (which are not caused by awk but by the shell environment). Try putting the script in a file:

Code:

BEGIN {
  FS="[ =]"
}

NR==FNR {
  if($3>M[$1]) {
    M[$1]=$3
    P[$1]=$2
  }
  next
}

$3<0.9*M[$1] {
  $2=P[$1]
}

{
  print $1"="$2,$3
}

And run it like

Code:

awk -f scriptfile file file

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

08-10-2014

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks for your patience and help. It ran beautifully and that too fast. I will test it out on a large file but I am sure it will give perfect output.
Many thanks once more.
I will study the script and try to understand the logic of the file separator.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Merging strings which have deviation in frequency

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replicate merging and frequency calculation

Discussion started by: verse123

2. Ubuntu

Merging strings that have identical rownames in a dataframe

Discussion started by: Alyaa

3. Shell Programming and Scripting

Output mean and standard deviation of a row

Discussion started by: kayak

4. Shell Programming and Scripting

Calculate Mean absolute Deviation

Discussion started by: Diya123

5. Shell Programming and Scripting

AWK script for standard deviation / root mean square deviation

Discussion started by: chrisjorg

6. Shell Programming and Scripting

Standard deviation in awk

Discussion started by: gd9629

7. Shell Programming and Scripting

using awk to print average and standard deviation into a file

Discussion started by: phil_heath

8. UNIX for Dummies Questions & Answers

Calculating the Standard Deviation for a column

Discussion started by: kylle345

9. Shell Programming and Scripting

Mean and Standard deviation

Discussion started by: lakshmikanth.pg

10. Shell Programming and Scripting

Script for finding standard deviation

Discussion started by: RJ17