Splitting Concatenated Words in Input File with Words from a Master File


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Splitting Concatenated Words in Input File with Words from a Master File
# 8  
Old 02-23-2011
I know this is longer but I feel it sould be safer (no gsub calls):

Code:
awk 'NR==FNR{a[$1]; next}
 function lsr(c,p) {
    for(p=length(c);p;p--)
           if(substr(c,1,p) in a) break;
    if (p) return substr(c,1,p);
    return "";
 }
 {IGNORECASE=1;
  while(length) {
     s=lsr($0);
     while (!s) {
         printf substr($0,1,1);
         $0=substr($0,2);
         s=lsr($0);
         if (s) printf " ";
     }
     printf "%s ", s;
     $0=substr($0,length(s)+1)
  }
  printf "\n"; }' lookup raw


Last edited by Chubler_XL; 02-23-2011 at 10:03 PM.. Reason: Cleanup formatting
This User Gave Thanks to Chubler_XL For This Post:
# 9  
Old 02-23-2011
It works beautifully. Many thanks to you and Y for your timely help.
I'll walk though the code and in case I don't get something, I'll try and hassle the forum for an answer.
Many thanks once again,

Gimley
# 10  
Old 02-23-2011
1.
Code:
 awk '{a[$1]=length($1)}END{for(i in a) print a[i],i|"sort -nr"}' lookup |awk '{print $2}' >new_lookup

2.
Code:
awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR;next}
{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{gsub(a[j]," "a[j]" ",$0)}}}}{$1=toupper(substr($1,1,1))substr($1,2);print}' new_lookup raw

hopefully it works for you
# 11  
Old 02-23-2011
Sorry for the hassle. While it worked beautifully for the earlier strings I tried the following:
LOOKUP
subramanian
raghava
rajendra
manian
prasad

INPUT
rajendraprasadsubramaniam
perisubramaniam
rajendraperisubramaniam

The program gave the first answer and then did not progress further. I had to CTRL C to get out of the dos prompt.
Any answer to that please. Many thanks
Gimley
# 12  
Old 02-23-2011
Y, your revised solution works for me now but it dosn't work for this


lookup
Code:
slowball
slowly
play
child
quick
slow
not
put
the 
boy 
ran
ly
is

Code:
theboyranthroughslowly
heistoslowtoplayslowball

---------- Post updated at 11:25 AM ---------- Previous update was at 11:20 AM ----------

OK fixed it now change

Code:
while (!s) {

to
Code:
while (length && !s) {

This User Gave Thanks to Chubler_XL For This Post:
# 13  
Old 02-23-2011
a little change on "gsub" to "sub" and adding "$0=$0" to make NF changed
Code:
awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR;next}
{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{sub(a[j]," "a[j]" ",$i);$0=$0}}}}{$1=toupper(substr($1,1,1))substr($1,2);print}' lookup raw


Last edited by yinyuemi; 02-23-2011 at 10:04 PM..
# 14  
Old 02-23-2011
Y did you try against my posted test data it's outputting "slow ly" again.

Gimley, This method is still pretty poor - I loaded an english dictionary into lookup (140K words) and run a test against the ls manual with all spaces taken out, it was quick but result is pretty average:

Code:
SEE ALSO 
The full documentation for l si s maintained as aTe xi n fo manual . If 
thein fo and l sprog rams are properly installed at yours it e , the com ‐ 
man d

This test was good cause I found that IGNORECASE wasn't working properly (fix below):

Code:
awk 'NR==FNR{a[$1]; next}
 function lsr(c,p) {
    for(p=length(c);p;p--)
           if(tolower(substr(c,1,p)) in a) break;
    if (p) return substr(c,1,p);
    return "";
 }
 {while(length) {
     s=lsr($0);
     while (!s && length) {
         printf substr($0,1,1);
         $0=substr($0,2);
         s=lsr($0);
         if (s) printf " ";
     }
     printf "%s ", s;
     $0=substr($0,length(s)+1)
  }
  printf "\n"; }' lookup raw


Last edited by Chubler_XL; 02-23-2011 at 10:08 PM..
This User Gave Thanks to Chubler_XL For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replace particular words in file based on if finds another words in that line

Hi All, I need one help to replace particular words in file based on if finds another words in that file . i.e. my self is peter@king. i am staying at north sydney. we all are peter@king. How to replace peter to sham if it finds @king in any line of that file. Please help me... (8 Replies)
Discussion started by: Rajib Podder
8 Replies

2. UNIX for Dummies Questions & Answers

Replace the words in the file to the words that user type?

Hello, I would like to change my setting in a file to the setting that user input. For example, by default it is ONBOOT=ON When user key in "YES", it would be ONBOOT=YES -------------- This code only adds in the entire user input, but didn't replace it. How do i go about... (5 Replies)
Discussion started by: malfolozy
5 Replies

3. Shell Programming and Scripting

Gawk gensub, match capital words and lowercase words

Hi I have strings like these : Vengeance mitt Men Vengeance gloves Women Quatro Windstopper Etip gloves Quatro Windstopper Etip gloves Girls Thermobite hooded jacket Thermobite Triclimate snow jacket Boys Thermobite Triclimate snow jacket and I would like to get the lower case words at... (2 Replies)
Discussion started by: louisJ
2 Replies

4. Shell Programming and Scripting

How count the number of two words associated with the two words occurring in the file?

Hi , I need to count the number of errors associated with the two words occurring in the file. It's about counting the occurrences of the word "error" for where is the word "index.js". As such the command should look like. Please kindly help. I was trying: grep "error" log.txt | wc -l (1 Reply)
Discussion started by: jmarx
1 Replies

5. Shell Programming and Scripting

Grepping a list of words from one file in a master database of homophones

Hello, I am sorry if the title is confusing, but I need a script to grep a list of Names from a Source file in a Master database in which all the homophonic variants of the name are listed along with a single indexing key and store all of these in an output file. I need this because I am testing... (4 Replies)
Discussion started by: gimley
4 Replies

6. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Dear all, I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list. An example would make this clear annamarie mariechristine johnsmith johnjoseph smith john smith... (8 Replies)
Discussion started by: gimley
8 Replies

7. Shell Programming and Scripting

Splitting Concatenated Words With Largest Strings First

hello, I had posted earlier help for a script for splitting concatenated words . The script was supposed to read words from a master file and split concatenated words in the slave/input file. Thanks to the help I got, the following script which works very well was posted. It detects residues by... (14 Replies)
Discussion started by: gimley
14 Replies

8. Shell Programming and Scripting

Awk splitting words into files problem

Hi, I am trying to split the words having the delimiter as colon ';' in to separate files using awk. Here's my code. echo "f1;f2;f3" | awk '/;/{c=sprintf("%02d",++i); close("out" c)} {print > "out" c}' echo "f1;f2;f3" | awk -v i=0 '/;/{close("out"i); i++; next} {print > "out"i}' But... (4 Replies)
Discussion started by: royalibrahim
4 Replies

9. Shell Programming and Scripting

Shell script to find out words, replace them and count words

hello, i 'd like your help about a bash script which: 1. finds inside the html file (it is attached with my post) the code number of the Latest Stable Kernel, 2.finds the link which leads to the download location of the Latest Stable Kernel version, (the right link should lead to the file... (3 Replies)
Discussion started by: alex83
3 Replies

10. Shell Programming and Scripting

splitting words from a string

Hi, I have a string like this in a file, I want to retrive the words separated by comma's in 3 variables. like How do i get that.plz advice (2 Replies)
Discussion started by: suresh_kb211
2 Replies
Login or Register to Ask a Question