Splitting Concatenated Words With Largest Strings First


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Splitting Concatenated Words With Largest Strings First
# 1  
Old 04-05-2011
Splitting Concatenated Words With Largest Strings First

hello,
I had posted earlier help for a script for splitting concatenated words . The script was supposed to read words from a master file and split concatenated words in the slave/input file.
Thanks to the help I got, the following script which works very well was posted. It detects residues by placing a ! before the residual element.
However the script does not take the largest string for splitting which leads to problems.
An example will help:
given that the master file has
Code:
narayan 
narayana 
prakash
aprak
ash

In the case of narayanaprakash, I get:
Code:
narayan, aprak and ash

instead of
Code:
narayana prakash.

How do I get the script to produce the second instead of the first?

Many thanks for all the earlier help and hope this problem of largest string first can be resolved:
Code:
#Util to split names which are conjoined
NR==FNR{a[$1]; next}
function lsr(c,p) {
    for(p=length(c);p;p--)
        if(tolower(substr(c,1,p)) in a) break;
    if (p) return substr(c,1,p);
    return "";
}
{while(length) {
    s=lsr($0);
    if (!s) printf "!";
    while (!s && length) {
        printf substr($0,1,1);
        $0=substr($0,2);
        s=lsr($0);
        if (s) printf "! ";
    }
    printf "%s ", s;
    $0=substr($0,length(s)+1)
}
printf "\n"; }


Last edited by fpmurphy; 04-05-2011 at 10:56 PM.. Reason: Code tags please!
# 2  
Old 04-05-2011
It's a lot easier to read code and sample data if they're surrounded by code tags.
# 3  
Old 04-05-2011
Sorry am a newbie. Will try and post the tags next time.
# 4  
Old 04-06-2011
Seems to be working fine for me:

Code:
$ cat lookup
narayan 
narayana 
prakash
aprak
ash
 
$ cat raw
narayanaprakash
 
$ awk 'NR==FNR{a[$1]; next}
>  function lsr(c,p) {
>     for(p=length(c);p;p--)
>            if(tolower(substr(c,1,p)) in a) break;
>         if (p) return substr(c,1,p);
>         return "";
>  }
>  {while(length) {
>      s=lsr($0);
>          if(!s) printf "!";
>          while (!s && length) {
>            printf substr($0,1,1);
>            $0=substr($0,2);
>        s=lsr($0);
>            if (s) printf "! ";
>          }
>          printf "%s ", s;
>          $0=substr($0,length(s)+1)
>   }
>   printf "\n" }' lookup raw
narayana prakash

This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 04-07-2011
Hello,
It seems to be working fine. Can I test it thoroughly out and get back to you.
Many thanks.

---------- Post updated at 10:24 PM ---------- Previous update was at 11:09 AM ----------

Hello,
I tested the script and it works fine. I know I asked for the largest string extraction but this proves troublesome at times since the valid words within the master dictionary are eaten up and treated as residue. An example will make this clear. Imagine that the master has the following:
hassam
alikhan
ali
khan
khana

and the slave is made up of a single word to be split: hassamalikhan
Technically with the present script I get hassam alikhan !a
What I need (after checking carefully the data) is an output such as: hassam ali khana
I am sorry the goof up was mine. Instead of asking for a large string search, I should have asked for a valid string search.
Help in the script to get valid strings and split accordingly would be most appreciated.
Many thanks.
# 6  
Old 04-07-2011
lol, I remember mentioning this exact problem in this post nearly 2 months ago.

The best way to tackle this is to try all possible substrings and select solution with smallest number of residual characters. Let me try and throw something together.

---------- Post updated at 02:36 PM ---------- Previous update was at 01:52 PM ----------

OK have a solution but it's much slower as it has to try all possible combinations:

Code:
awk 'NR==FNR{a[$1]; next}
function clean(s) {
   split(cs(s),S,SUBSEP);
   return S[1]
}
function cs(s,i,p,r,b,bs,t) {
 b=9999;
 for(i=length(s);b && i;i--) {
   r=0;
   p=tolower(substr(s,1,i));
   if(!(p in a)) r=i;
   t=cs(substr(s,i+1));
   split(t,V,SUBSEP);
   if(r+V[2]<b) { b=r+V[2]; bs=substr(s,1,i)" "V[1] SUBSEP b }
  }
  return bs;
}
{ print clean($0) }' lookup raw

---------- Post updated at 03:12 PM ---------- Previous update was at 02:36 PM ----------

Couple of Performance improvement
- No need to check strings longer than longest word
- Skip if current mismatch is worse than best found so far

Code:
awk 'NR==FNR{a[$1]; m=m<length?length:m; next}
function clean(s) {
   split(cs(s),S,SUBSEP);
   return S[1]
}
function cs(s,i,p,r,b,bs,t) {
 b=9999;
 for(i=length(s)>m?m:length(s);b && i;i--) {
   r=0;
   p=tolower(substr(s,1,i));
   if(!(p in a)) r=i;
   if(r<b) {
     t=cs(substr(s,i+1));
     split(t,V,SUBSEP);
     if(r+V[2]<b) { b=r+V[2]; bs=substr(s,1,i)" "V[1] SUBSEP b }
   }
  }
  return bs;
}
{ print clean($0) }'


Last edited by Chubler_XL; 04-07-2011 at 01:57 AM..
This User Gave Thanks to Chubler_XL For This Post:
# 7  
Old 04-07-2011
Many thanks.
Many thanks. It works pretty well. Sorry could not respond earlier. Am in review meetings the whole day and the only free time-slot is late night (now) or early morning (when I posted the message. Will be free this weekend. But will have tested the code by then.
Many thanks once again for all your hard efforts.

---------- Post updated at 11:37 AM ---------- Previous update was at 11:17 AM ----------

Have tested it on a dictionary (master of around 108527 sorted words) and gave it an input file of around (20 words. It is slow. The first split is spewed out fast but the succeeding ones just don't come out.
I know that it has to test a lot of possible combos and hence the slow speed. Any method of speeding up the script.
Many thanks once more for all your help.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Splitting strings based on delimiter

i have a snippet from server log delimited by forward slash. /a/b/c/d/filename i need to cut until last delimiter. So desired output should look like: /a/b/c/d can you please help? Thanks in advance. (7 Replies)
Discussion started by: alpha_1
7 Replies

2. UNIX for Dummies Questions & Answers

Splitting strings

I have a file that has two columns. I first column is an identifier and the second is a column of strings. I want to split the characters in the second column into substrings of length 5. So if the first line of the file has a string of length 10, the output should have the identifier repeated 2... (3 Replies)
Discussion started by: verse123
3 Replies

3. Shell Programming and Scripting

awk Splitting strings

Hi All, There is a file with a data. If the line is longer than 'n', we splitting the line on the parts and print them. Each of the parts is less than or equal 'n'. For example: n = 2; "ABCDEFGHIJK" -> length 11 Results: "AB" "CD" EF" GH" "IJ" "K" Code, but there are some errors.... (9 Replies)
Discussion started by: booyaka
9 Replies

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

5. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Dear all, I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list. An example would make this clear annamarie mariechristine johnsmith johnjoseph smith john smith... (8 Replies)
Discussion started by: gimley
8 Replies

6. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

7. Shell Programming and Scripting

Awk splitting words into files problem

Hi, I am trying to split the words having the delimiter as colon ';' in to separate files using awk. Here's my code. echo "f1;f2;f3" | awk '/;/{c=sprintf("%02d",++i); close("out" c)} {print > "out" c}' echo "f1;f2;f3" | awk -v i=0 '/;/{close("out"i); i++; next} {print > "out"i}' But... (4 Replies)
Discussion started by: royalibrahim
4 Replies

8. Shell Programming and Scripting

splitting words from a string

Hi, I have a string like this in a file, I want to retrive the words separated by comma's in 3 variables. like How do i get that.plz advice (2 Replies)
Discussion started by: suresh_kb211
2 Replies

9. Programming

Splitting strings from file

Hi All I need help writing a Java program to split strings reading from a FILE and writing output into a FILE. e.g., My input is : International NNP Rockwell NNP Corp. NNP 's POS Tulsa NNP unit NN said VBDExpected output is: International I In Int Inte l al... (2 Replies)
Discussion started by: my_Perl
2 Replies

10. UNIX for Dummies Questions & Answers

splitting strings

Hi you, I have the following problem: I have a string like the followings: '166Mhz' or '128MB' or '300sec' or ... What I want to do is, I want to split the strings in a part with the numbers and a part with letters. Since the strings are not allway three digits and than text i couldn't do... (3 Replies)
Discussion started by: bensky
3 Replies
Login or Register to Ask a Question