Grepping a list of words from one file in a master database of homophones


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Grepping a list of words from one file in a master database of homophones
# 1  
Old 12-11-2013
Grepping a list of words from one file in a master database of homophones

Hello,
I am sorry if the title is confusing, but I need a script to grep a list of Names from a Source file in a Master database in which all the homophonic variants of the name are listed along with a single indexing key and store all of these in an output file. I need this because I am testing the accuracy of a Homophone algorithm which I have written.

An example will make this clear.Let us assume that the source file has the following entries:
Code:
John
Mary

and the Master file has the following
Code:
Jon<Tab>2003
Jean<Tab>2003
John<Tab>2003
Johan<Tab>2003
Johann<Tab>2003

Mary<Tab>21978
Marie<Tab>21978
Mariam<Tab>21978
Marium<Tab>21978

Each indexed entry is separated by a Space.

The output file would identify all homophones of the Word found in the master file and which are linked by the common index and store them.
The source file has around 30,00+ entries.At present I have to open both files in a text editor. Select a word in the source file and search for it in the master database, copy to clipboard and store in the Output file. Since this is a long and tedious operation, I was wondering if there is a PERL or AWK script which could do the job.
My OS is Windows and all the wonderful UNIX tools don't help.
Many thanks for your help.
# 2  
Old 12-11-2013
Please show us what format you want to use in the output file.
# 3  
Old 12-11-2013
So sorry I should have specified the format of the output. The structure should be as under:
Code:
keyword from source file followed by:
list of words from master along with index key

The desired out put would be as under:
Code:
John:
John<tab>word index
Jon<tab>word index
Johann<tab>word index
Jan<tab>word index

Many thanks for your interest and prompt response.
# 4  
Old 12-11-2013
For output like this:

Code:
John(2003) : Jon Jean John Johan Johann
Mary(21978) : Mary Marie Mariam Marium

try:
Code:
awk 'FNR==NR {H[$1]=$2; L[$2]=L[$2]" "$1 ; next }
{ print $1"(" H[$1]") : "substr(L[H[$1]],2) } ' homophones names

---------- Post updated at 12:57 PM ---------- Previous update was at 12:54 PM ----------

For your listed output try:

Code:
awk 'FNR==NR {H[$1]=$2; L[$2]=L[$2]" "$1 ; next }
{ print $1":"
  for(i=split(L[H[$1]], r);i;i--)
     print r[i]"\t"H[$1] } ' homophones names

# 5  
Old 12-11-2013
Many thanks. Both worked and are pretty fast. The second fits into my scheme of things for further massaging the output data.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Deleting a list of words from a text file

Hello, I have a list of words separated by spaces I am trying to delete from a text file, and I could not figure out what is the best way to do this. what I tried (does not work) : delete="password key number verify" arr=($delete) for i in arr { sed "s/\<${arr}\>]*//g" in.txt } >... (5 Replies)
Discussion started by: Hawk4520
5 Replies

2. Shell Programming and Scripting

Regex to identify unique words in a dictionary database

Hello, I have a dictionary which I am building for the Open Source Community. The data structure is as under HEADWORD=PARTOFSPEECH=ENGLISH MEANING as shown in the example below अ=m=Prefix signifying negation. अँहँ=ind=Interjection expressing disapprobation. अं=int=An interjection... (2 Replies)
Discussion started by: gimley
2 Replies

3. UNIX for Advanced & Expert Users

List all file names that contain two specific words. ( follow up )

Being new to the forum, I tried finding a solution to find files containing 2 words not necessarily on the same line. This thread "List all file names that contain two specific words." answered it in part, but I was looking for a more concise solution. Here's a one-line suggestion... (8 Replies)
Discussion started by: Symbo53
8 Replies

4. Shell Programming and Scripting

Identifying single words in a dictionary database

I am reworking a Marathi-English dictionary to be out on open-source. My dictionary has the Headword in Marathi, followed by its Part of Speech and subsequently by its English glosses as in the examples below; अकरसणें v i To contract, shrink. अकरा a Eleven. अकराळ a Frightful, terrible. विकराळ... (2 Replies)
Discussion started by: gimley
2 Replies

5. UNIX Desktop Questions & Answers

How can I replicate master master and master master MySQL databse replication and HA?

I have an application desigend in PHP and MySQl running on apache web server that I is running on a Amazon EC2 server Centos. I want to implement the master-master and master slave replication and high availability disaster recovery on this application database. For this I have created two... (0 Replies)
Discussion started by: Palak Sharma
0 Replies

6. UNIX for Dummies Questions & Answers

Grepping Words with No Vowels

Hi! I was trying to grep all the words in a wordlist, (twl), with no vowels. I had a hard time figuring out how to do it, but I finally lit on the -v flag. Here's my question: Why does this work: grep -v '' twl But this doesn't: grep '' twl In the second example, we're asking for lines... (6 Replies)
Discussion started by: sudon't
6 Replies

7. Shell Programming and Scripting

Grepping large list of files

Hi All, I need help to know the exact command when I grep large list of files. Either using ls or find command. However I do not want to find in the subdirectories as the number of subdirectories are not fixed. How do I achieve that. I want something like this: find ./ -name "MYFILE*.txt"... (2 Replies)
Discussion started by: angshuman
2 Replies

8. Shell Programming and Scripting

indexing list of words in a file

Hey all, I'm doing a project currently and want to index words in a webpage. So there would be a file with webpage content and a file with list of words, I want an output file with true and false that would show which word exists in the webpage. example: Webpage content data.html ... (2 Replies)
Discussion started by: Johanni
2 Replies

9. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

10. Shell Programming and Scripting

List all file names that contain two specific words.

Hi, all: I would like to search all files under "./" and its subfolders recursively to find out those files contain both word "A" and word "B", and list the filenames finally. How to realize that? Cheers JIA (18 Replies)
Discussion started by: jiapei100
18 Replies
Login or Register to Ask a Question