Assigning the same frequency to more than one words in a file
I have a file of names with the following structure
i.e. more than one name is assigned the same frequency. An example will make this clear
I want to assign the same frequency to both names or to all three names to ensure that statistically both or all three names within a field retain their frequency.
The expected output would be
I am doing this field separation by means of a Macro in Excel but since the database is huge, the process is long and tedious.
Would it be possible to do the same with the help of a PERL/AWK script ? I already have written an awk tool to merge all frequencies, which I could use to merge the frequencies. Aa an example all occurencies of
would thus have a merged frequency.
I work under the Windows OS and UNIX (sigh) is not my OS. No shell scripts please.
Many thanks.
The following awk script provides two lists (with an empty line between them). The first list provides:
linesfor all NAMES on an input line and the second list provides
for all FREQUENCY entries for each NAME entry found in the input.
I know you're doing this on Windows, but if someone else wants to try it on a Solaris/SunOS system, they would need to use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk. With your sample input, the output produced is:
PS Note that the output produced here seems to match your input (but even after sorting is VERY different from the output you said you wanted). As an example, your input contains the lines:
but your output lines for GHOSH are:
instead of:
???
Last edited by Don Cragun; 09-05-2013 at 11:48 PM..
Reason: Note differences between my results and expected output.
Many thanks it worked fast and zipped through over 700,000 records in no time.
The only hassle:
when a word has more than one occurence and therefore frequencies, all the frequencies belonging to that word are stored on one line.
example
How do I get the script to store these on separate lines. My frequency merge script accepts
and merges them.
If it is not too much of a hassle could you please comment that code. I tried to modify the script but it mangled the results.
Many thanks once more
Many thanks it worked fast and zipped through over 700,000 records in no time.
The only hassle:
when a word has more than one occurence and therefore frequencies, all the frequencies belonging to that word are stored on one line.
example
How do I get the script to store these on separate lines. My frequency merge script accepts
and merges them.
If it is not too much of a hassle could you please comment that code. I tried to modify the script but it mangled the results.
Many thanks once more
I'm sorry I confused you.
Please look at the output again! You will see two sets of output. The 1st set only contains
exactly as you requested, but the output data matches your sample input instead of your sample output. And the output provided by my script prints entries in the ouput in the order they were found in the input file. (Your sample output seemed to be in a fairly random order and had FREQUENCY values for some NAMEs that were not present in your sample input.)
You said you had a second awk program that would give you a merged list of all frequencies associated with a NAME. The second part of the output produced by my awk script did that without needing a second script.
Looking at my script again:
If you don't want the 2nd part of the output, remove the code shown in red. That will leave you with:
which prints a line for each field except the last from every input line in the file. Each output line will contain the name found in one of the 1st (NF - 1) fields ($i) from a line in the file and the frequency found in the last field ($NF) separated by a tab character.
The stuff that was in red created an array (a[]) where the array index was a NAME and the valie of a[NAME] is a list of the frequencies found in the input. The END clause in the awk script printed an empty line to separate the two parts of the output followed by the elements of the array giving the NAME and the list of frequencies found for that NAME in a random output order. (The frequencies for a given NAME in the output appear in the order in which the entries were found in the input file.)
Please let me know if this still is not clear.
This User Gave Thanks to Don Cragun For This Post:
Hi All,
I need one help to replace particular words in file based on if finds another words in that file .
i.e.
my self is peter@king.
i am staying at north sydney.
we all are peter@king.
How to replace peter to sham if it finds @king in any line of that file.
Please help me... (8 Replies)
tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2 | sed ${1:-25} < book7.txt
This is not my script, it can be found way back from 1980 but once it worked fine to give me the most used words in a text file.
Now the shell is complaining about an error in sed
sed: -e... (5 Replies)
Hello,
I would like to change my setting in a file to the setting that user input.
For example, by default it is
ONBOOT=ON
When user key in "YES", it would be
ONBOOT=YES
--------------
This code only adds in the entire user input, but didn't replace it.
How do i go about... (5 Replies)
Hi ,
I need to count the number of errors associated with the two words occurring in the file. It's about counting the occurrences of the word "error" for where is the word "index.js". As such the command should look like. Please kindly help. I was trying: grep "error" log.txt | wc -l (1 Reply)
Hello,
I have a large file of syllables /strings in Urdu. Each word is on a separate line.
Example in English:
be
at
for
if
being
attract
I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and... (7 Replies)
Hello,
I have a file which has the following structure
word space Frequency
The file is around 30,000 headwords each along with its frequency. The words have different lengths. What I need is a PERL or AWK script which can sort the file on length of the headword and once the file is sorted on... (12 Replies)
Hello,
I have a very large file of around 2 million records which has the following structure:
I have used the standard awk program to sort:
# wordfreq.awk --- print list of word frequencies
{
# remove punctuation
#gsub(/_]/, "", $0)
for (i = 1; i <= NF; i++)
freq++
}
END {
for (word... (3 Replies)
Dear all,
I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list.
An example would make this clear
annamarie
mariechristine
johnsmith
johnjoseph smith
john
smith... (8 Replies)
I need to write a shell script "cmn" that, given an integer k, print the k most common words in descending order of frequency.
Example Usage:
user@ubuntu:/$ cmn 4 < example.txt :b: (3 Replies)
Hello,
I have a complex problem. I have a file in which words have been joined together:
Theboy ranslowly
I want to be able to correctly split the words using a lookup file in which all the words occur:
the
boy
ran
slowly
slow
put
child
ly
The lookup file which is meant for look up... (21 Replies)