Sponsored Content
Top Forums Shell Programming and Scripting Merging strings which have deviation in frequency Post 302912573 by gimley on Sunday 10th of August 2014 12:12:51 AM
Old 08-10-2014
Merging strings which have deviation in frequency

Dear all,
I need a little help. I am working on a frequency driven database in which the structure is as under:

Code:
headword=gloss<space>Frequency

The data which I am working with has dupes i.e. the Headword is repeated more than once with a different gloss variant on the right hand side and also with a different frequency, as in the pseudo example given below:
Code:
John=Jean 1200
John=Jehan 300
John=Jan 1100
John=Johann 22

I have written a perl script which extracts all such instances. Assuming [I know that some may not like this approach] that the headword with the highest frequency is possibly the right candidate, I want to merge the ones with the lower frequency into the higher frequency with the caveat that if the difference in frequency is lower than 10% both be retained as such. Thus in the example provided the output would be as under:
Code:
John=Jean 1200
John=Jean 300
John=Jan 1100
John=Jean 22

Code:
John=Jean and John=Jan

are not merged since the deviation range is less than 10%. Others are merged since the deviation range is more than 10%.

I have been able to identify through a Perl Script all such instances of Dupes and also have a Script which merges frequencies when both headword and gloss are identical but in spite of all attempts cannot get the frequency problem to solve itself, especially calculating the deviation range. Since I am still a newbie, I cannot just get the operation right
Basically the steps are as under:
Code:
1. Identifying dupes [Already done]
2. Collecting all dupes i.e. the headword is identical in one list
3. Checking frequency
4. Identifying highest frequency in the list
5. Calculating frequency deviation range in the list. 
6. If more than 10% merging the lower frequency to the higher frequency i.e replacing the lower frequency gloss by the higher frequency gloss
7. If the deviation range is less than 10%; do not modify.
8. Finally merge frequencies of all words generated out by step 6 i.e. headword and gloss are identical [Already done]

In case someone would like to tangle with live data, here is a small sample:
Code:
trimbak=ત્રિંબક 87
trimbak= ત્રીંબક 35
trimbak=ત્રીંબક 14
trimbakbhai= ત્રીંબકભાઈ 55
trimbakbhai=ત્રિંબકભાઈ 7
trimbakbhai=ત્રીબંકભાઇ 4
tripathi=ત્રિપાટી 7
tripathi=ત્રિપાઠી 369
tripathi=ત્રિપાથી 4
tripathi=ત્રીપાઠી 8
trivedi=ત્રિવેદિ 28
trivedi=ત્રિવેદી 78
trivedi=ત્રીવેદી 8
trupti=તૃપ્તિ 4
trupti=તૃપ્તી 13

I have checked on the forum but the scripts there do not answer what I need. Many thanks for all help.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script for finding standard deviation

I have a CSV file that looks like 0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0 10,11,7,0,4,12,2,3,7,0,11,3,12,4,0,5,5,4,5,0,8,6,12,0,9,3,3,0,2,7,8 19,11,7,0,4,14,16,10,8,2,13,7,15,6,0,76,6,4,10,0,18,10,17,1,11,3,3,0,9,9,8... (7 Replies)
Discussion started by: RJ17
7 Replies

2. Shell Programming and Scripting

Mean and Standard deviation

Hi all, I am new to shell scripting and wanna calculate the mean and standard deviation using shell programming. I have a file with letters that are repeating and their corresponding duration a 0.32 a 0.89 aa 0.34 aa 0.23 au 0.012 au 0.26... (4 Replies)
Discussion started by: lakshmikanth.pg
4 Replies

3. UNIX for Dummies Questions & Answers

Calculating the Standard Deviation for a column

Hi all, I want to calculate the standard deviation for a column (happens to be column 3). Does any know of simple awk script to do this? Thanks (1 Reply)
Discussion started by: kylle345
1 Replies

4. Shell Programming and Scripting

using awk to print average and standard deviation into a file

Hi I want to use awk to print avg and st deviation but it does not go into a file for column 1 only. I can do average and # of records but i cannot get st deviation. awk '{sum+=$1} END { print "Average = ",sum/NR}' thanks (1 Reply)
Discussion started by: phil_heath
1 Replies

5. Shell Programming and Scripting

Standard deviation in awk

Hi all, I need to find the standard deviation of each column of a dataset below for each hour. The data is given in 5 second intervals as shown below DATE TIME FRAC_DAYS_SINCE_JAN1 FRAC_HRS_SINCE_JAN1 EPOCH_TIME ... (11 Replies)
Discussion started by: gd9629
11 Replies

6. Shell Programming and Scripting

AWK script for standard deviation / root mean square deviation

I have a file with say 50 columns, each containing a whole lot of data. Each column contains data from a separate simulation, but each simulation is related to the data in the last (REFERENCE) column $50 I need to calculate the RMS deviation for each data line, i.e. column 1 relative to... (12 Replies)
Discussion started by: chrisjorg
12 Replies

7. Shell Programming and Scripting

Calculate Mean absolute Deviation

Hi, I am trying to use an statistical formula. I tried in excel, but I get different values when I use calculator. The formula is (1/n) ∑|x - mean| n=no. of observations x=each individual expression value mean is median of all observations I have a file with 1000 rows.. So it needs... (1 Reply)
Discussion started by: Diya123
1 Replies

8. Shell Programming and Scripting

Output mean and standard deviation of a row

I have a file that looks that this: 820 890 530 1650 1600 1800 1850 1900 2270 1640 2300 1670 2080 2200 2350 1150 1630 2210 I would like to output the mean and standard deviation of each row so that my final output would look like this 820 890 530 746.667 155.849 1650 1600 1800... (5 Replies)
Discussion started by: kayak
5 Replies

9. Ubuntu

Merging strings that have identical rownames in a dataframe

Hi I have a data frame with repeated names in column 1, and different descriptors in column 2. I want to merge/cat strings that have same entry in column 1 into one row with any separator. Example for input: Cvel_1 KOG0155 Cvel_1 KOG0306 Cvel_1 KOG3259 Cvel_1 ... (4 Replies)
Discussion started by: Alyaa
4 Replies

10. Shell Programming and Scripting

Replicate merging and frequency calculation

Hello, I have a 2 column file with an ID column and a column with some string. ID String EN03 typehellobyedogcatcatdog EN09 typehellobye EN08 dogcatcatdog EN09 catcattypehello EN10 typehellobyedogcatcatdog EN10 typehellobyedogcatcatdogdog I would like to count the amount of times... (9 Replies)
Discussion started by: verse123
9 Replies
All times are GMT -4. The time now is 12:02 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy