Splitting Concatenated Words in Input File with Words from a Master File


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Splitting Concatenated Words in Input File with Words from a Master File
# 1  
Old 02-23-2011
Splitting Concatenated Words in Input File with Words from a Master File

Hello,
I have a complex problem. I have a file in which words have been joined together:
Theboy ranslowly
I want to be able to correctly split the words using a lookup file in which all the words occur:
the
boy
ran
slowly
slow
put
child
ly
The lookup file which is meant for look up for splitting the words is huge and serves as a look up to correctly segment the input file which has “runon” words. The input file could also be very large.
It could also contain upto three to four words concatenated together.
I have 2 requirements:
1. Only the largest string should be used for splitting. Thus given that both slow and ly occur, I do not want the split to be :
the boy ran slow ly
But rather
the boy ran slowly.

2. In case a word is not found in the master list, all other largest strings should be spewed out
E.g. Assume that boy is not in the lookup file, I would still want the cut to be:
The boy ran slowly
i.e.” boy” is flagged as residue and tagged as such if possible.
I have tried to write a program which does this (both in Perl as well as in AWK, but it just fails and spews out incorrect forms, especially when I try to meet condition 1.
I am still a tyro at PERL and AWK since all my experience has been in C for the past 20 years and I am fascinated by AWK as well as PERL because of their speed and elegance.
Help would be most appreciated and gratefully acknowledged to help me learn a new skill. A commented code would be a great learning experience, if someone could have the patience to do that for me as well as for others like me who are learners,
Manythanks, (Many thanks)

GIMLEY
# 2  
Old 02-23-2011
try:
**** first, make a sort on your lookup file based on length of word, largest to smalles, like:
slowly
child
slow
put
the
boy
ran
ly

then run it.
Code:
 awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{gsub(a[j]," "a[j]" ",$0)}}}}END{$1=toupper(substr($1,1,1))substr($1,2);print}' lookup raw
The boy ran slowly


Last edited by yinyuemi; 02-23-2011 at 08:44 PM..
# 3  
Old 02-23-2011
Hello,
Many thanks for the prompt reply. It does work
But there are two issues:
A small glitch instead of handling the largest string:
slowly
it takes slow and ly and breaks up the catted sentence as :
the boy ran slow ly.

Residual data at thend is identified correctly but when the unknown word is in the middle, things seem to go wrong:

When I gave the string
theboyranthroughslowly
The output was:
the boy ran throughs low ly
Since low is not in the small lookup file, I am perplexed how it was generated.
Many thanks once more for the script and I hope these two bugs are soluble.

Gimley
This is precisely the problem, I have not been able to solve apart from the residue issue.
An add on to the awk script to handle this would be of great help.
# 4  
Old 02-23-2011
Hi Gimley,

I have modified the code as the above,please try it,see how it is?

Best,

Y
This User Gave Thanks to yinyuemi For This Post:
# 5  
Old 02-23-2011
Hi Yinyuemi,
Many thanks for the timely help. The residue problem seems to be sorted with the new code. However the largest string issue still remains.
I used the code which you had posted (reproduced below)
Code:
awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{gsub(a[j]," "a[j]" ",$0)}}}}END{$1=toupper(substr($1,1,1))substr($1,2);print}' lookup raw

And I still get
The boy ran through slow ly
for
theboyranthroughslowly

Sorry to hassle you, but the largest string split is vital for the dictionary work I am doing.
Many thanks once again and hoping to read you,
Best regards,
Gimley

Last edited by Franklin52; 02-24-2011 at 03:32 AM.. Reason: Please use code tags
# 6  
Old 02-23-2011
It seems to work on my computerSmilie
Code:
cat lookup
slowly
child
slow
put
the 
boy 
ran
ly
 
cat raw
theboyranthroughslowly
 
awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{gsub(a[j]," "a[j]" ",$0)}}}}END{$1=toupper(substr($1,1,1))substr($1,2);print}' lookup raw
The boy ran through slowly

# 7  
Old 02-23-2011
Sorry I did not see the sor routine and just blindly copied the code. I got the idea. it grabs the largest string first.
Two questions:
1. Any awk command to handle the largest to smallest sort.
2. I gave three sentences: it worked only on the first. How do I get the code to loop through the whole input file.
Sorry for such stupid questions but I am still learning awk programming,
Many thanks and excuses for not reading through your mail,

Best regards,

Gimley
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replace particular words in file based on if finds another words in that line

Hi All, I need one help to replace particular words in file based on if finds another words in that file . i.e. my self is peter@king. i am staying at north sydney. we all are peter@king. How to replace peter to sham if it finds @king in any line of that file. Please help me... (8 Replies)
Discussion started by: Rajib Podder
8 Replies

2. UNIX for Dummies Questions & Answers

Replace the words in the file to the words that user type?

Hello, I would like to change my setting in a file to the setting that user input. For example, by default it is ONBOOT=ON When user key in "YES", it would be ONBOOT=YES -------------- This code only adds in the entire user input, but didn't replace it. How do i go about... (5 Replies)
Discussion started by: malfolozy
5 Replies

3. Shell Programming and Scripting

Gawk gensub, match capital words and lowercase words

Hi I have strings like these : Vengeance mitt Men Vengeance gloves Women Quatro Windstopper Etip gloves Quatro Windstopper Etip gloves Girls Thermobite hooded jacket Thermobite Triclimate snow jacket Boys Thermobite Triclimate snow jacket and I would like to get the lower case words at... (2 Replies)
Discussion started by: louisJ
2 Replies

4. Shell Programming and Scripting

How count the number of two words associated with the two words occurring in the file?

Hi , I need to count the number of errors associated with the two words occurring in the file. It's about counting the occurrences of the word "error" for where is the word "index.js". As such the command should look like. Please kindly help. I was trying: grep "error" log.txt | wc -l (1 Reply)
Discussion started by: jmarx
1 Replies

5. Shell Programming and Scripting

Grepping a list of words from one file in a master database of homophones

Hello, I am sorry if the title is confusing, but I need a script to grep a list of Names from a Source file in a Master database in which all the homophonic variants of the name are listed along with a single indexing key and store all of these in an output file. I need this because I am testing... (4 Replies)
Discussion started by: gimley
4 Replies

6. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Dear all, I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list. An example would make this clear annamarie mariechristine johnsmith johnjoseph smith john smith... (8 Replies)
Discussion started by: gimley
8 Replies

7. Shell Programming and Scripting

Splitting Concatenated Words With Largest Strings First

hello, I had posted earlier help for a script for splitting concatenated words . The script was supposed to read words from a master file and split concatenated words in the slave/input file. Thanks to the help I got, the following script which works very well was posted. It detects residues by... (14 Replies)
Discussion started by: gimley
14 Replies

8. Shell Programming and Scripting

Awk splitting words into files problem

Hi, I am trying to split the words having the delimiter as colon ';' in to separate files using awk. Here's my code. echo "f1;f2;f3" | awk '/;/{c=sprintf("%02d",++i); close("out" c)} {print > "out" c}' echo "f1;f2;f3" | awk -v i=0 '/;/{close("out"i); i++; next} {print > "out"i}' But... (4 Replies)
Discussion started by: royalibrahim
4 Replies

9. Shell Programming and Scripting

Shell script to find out words, replace them and count words

hello, i 'd like your help about a bash script which: 1. finds inside the html file (it is attached with my post) the code number of the Latest Stable Kernel, 2.finds the link which leads to the download location of the Latest Stable Kernel version, (the right link should lead to the file... (3 Replies)
Discussion started by: alex83
3 Replies

10. Shell Programming and Scripting

splitting words from a string

Hi, I have a string like this in a file, I want to retrive the words separated by comma's in 3 variables. like How do i get that.plz advice (2 Replies)
Discussion started by: suresh_kb211
2 Replies
Login or Register to Ask a Question