Splitting Concatenated Words With Largest Strings First


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Splitting Concatenated Words With Largest Strings First
# 8  
Old 04-07-2011
Here are some more performance tweaks (it can still be slow if there are lots of residual characters in the string):

Code:
awk 'NR==FNR{a[$1]; m=m<length?length:m; next}
function clean(s) {
   split(cs(s,9999),S,SUBSEP);
   return S[1]
}
function cs(s,b,i,p,r,bs,t) {
 for(i=length(s)>m?m:length(s);b && i;i--) {
   r=0;
   p=tolower(substr(s,1,i));
   if(!(p in a)) r=i;
   if(r<b) {
     t=cs(substr(s,i+1),b-r);
     split(t,V,SUBSEP);
     if(r+V[2]<b) { b=r+V[2]; bs=substr(s,1,i)" "V[1] SUBSEP b }
   }
  }
  if (bs == "") return s SUBSEP length(s);
  return bs;
}
{ print clean($0) }' lookup raw


Last edited by Chubler_XL; 04-07-2011 at 10:26 PM..
This User Gave Thanks to Chubler_XL For This Post:
# 9  
Old 04-07-2011
Hello,
Many thanks for the tweak but all the awk programs which I have signal an error in execution:

Code:
'C:\Users\XP-HOME\Desktop>gawk -f splitternew.gk en.dic slave
gawk: splitternew.gk:7:  for(i=length(s)>m?m:length(s);b&&amp;amp;amp;amp;i;i--) {
gawk: splitternew.gk:7:                                          ^ parse error'

The ; is consistently treated as a parse error.
I have been testing the earlier script quite intensively and have found that that the spliter slows down only when
a. the strings are long and
b. when an unknown residue is located in the string to be parsed.
Many thanks for all the help.

Last edited by Franklin52; 04-08-2011 at 03:39 AM.. Reason: Please use code tags
# 10  
Old 04-07-2011
Quote:
Originally Posted by gimley
Hello,
Many thanks for the tweak but all the awk programs which I have signal an error in execution:
sorry the forum is expanding & to &amp; for some reason - I fixed original post by putting spaces around the &&.
This User Gave Thanks to Chubler_XL For This Post:
# 11  
Old 04-08-2011
Wow, it works and is blazing fast. Many thanks. Will get back to you in case of a bug, but I doubt if there is any.

---------- Post updated at 10:03 PM ---------- Previous update was at 08:36 PM ----------

Hello,
Tested the script
here is the speed: 22000 word split in around half a minute and pretty accurate.
Code:
C:\Users\XP-HOME\Desktop>time
The current time is:  8:29:27.99
C:\Users\XP-HOME\Desktop>gawk -f splitternew.gk en.lng hyd 1>telu.txt
C:\Users\XP-HOME\Desktop>time
The current time is:  8:29:55.85

Many thanks. One last request: Tried flagging the residues with a !> But there is no marker for this. Would appreciate if the residues were flagged.
Many thanks once more,

Last edited by Franklin52; 04-08-2011 at 03:40 AM.. Reason: Please use code tags
# 12  
Old 04-08-2011
Flagging residues:

Code:
awk 'NR==FNR{a[$1]; m=m<length?length:m; next}
function clean(s) {
   split(cs(s,9999),S,SUBSEP);
   return S[1]
}
function cs(s,b,i,p,r,bs,t) {
 for(i=length(s)>m?m:length(s);b&&i;i--) {
   r=0;
   p=tolower(substr(s,1,i));
   gsub("[-0-9=,():_?\\.]", "",p);
   if(!(p in a)) r=i;
   if(r<b) {
     t=cs(substr(s,i+1),b-r);
     split(t,V,SUBSEP);
     if(r+V[2]<b) { b=r+V[2]; bs= (r?"!":"") substr(s,1,i) (r?"!":"") " "V[1] SUBSEP b }
   }
  }
  if (bs == "") return (s?"!"s"!":"") SUBSEP length(s);
  return bs;
}
{ print clean($0) }' lookup raw

This User Gave Thanks to Chubler_XL For This Post:
# 13  
Old 04-08-2011
Hello,
Many thanks. Sorry for the delay, but was testing extensively the script. The script works just fine and the output is perfect, especially when the residue is flagged with a !.
Speedwise it is really blazing fast and can handle all types of long strings.
Many thanks once more for all your help.
# 14  
Old 04-08-2011
This problem can be ambiguous for example if

you have some valid pattern like :
Code:
ana
anak
ali
kali

And you must parse

Code:
anakali

you will not be able to determine wheter it should be

Code:
ana!kali!

or
Code:
anak!ali!

...and in both case the matching is exact (0 residue)

So inconsistencies may theoritically occure however the chosen parsing rules
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Splitting strings based on delimiter

i have a snippet from server log delimited by forward slash. /a/b/c/d/filename i need to cut until last delimiter. So desired output should look like: /a/b/c/d can you please help? Thanks in advance. (7 Replies)
Discussion started by: alpha_1
7 Replies

2. UNIX for Dummies Questions & Answers

Splitting strings

I have a file that has two columns. I first column is an identifier and the second is a column of strings. I want to split the characters in the second column into substrings of length 5. So if the first line of the file has a string of length 10, the output should have the identifier repeated 2... (3 Replies)
Discussion started by: verse123
3 Replies

3. Shell Programming and Scripting

awk Splitting strings

Hi All, There is a file with a data. If the line is longer than 'n', we splitting the line on the parts and print them. Each of the parts is less than or equal 'n'. For example: n = 2; "ABCDEFGHIJK" -> length 11 Results: "AB" "CD" EF" GH" "IJ" "K" Code, but there are some errors.... (9 Replies)
Discussion started by: booyaka
9 Replies

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

5. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Dear all, I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list. An example would make this clear annamarie mariechristine johnsmith johnjoseph smith john smith... (8 Replies)
Discussion started by: gimley
8 Replies

6. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

7. Shell Programming and Scripting

Awk splitting words into files problem

Hi, I am trying to split the words having the delimiter as colon ';' in to separate files using awk. Here's my code. echo "f1;f2;f3" | awk '/;/{c=sprintf("%02d",++i); close("out" c)} {print > "out" c}' echo "f1;f2;f3" | awk -v i=0 '/;/{close("out"i); i++; next} {print > "out"i}' But... (4 Replies)
Discussion started by: royalibrahim
4 Replies

8. Shell Programming and Scripting

splitting words from a string

Hi, I have a string like this in a file, I want to retrive the words separated by comma's in 3 variables. like How do i get that.plz advice (2 Replies)
Discussion started by: suresh_kb211
2 Replies

9. Programming

Splitting strings from file

Hi All I need help writing a Java program to split strings reading from a FILE and writing output into a FILE. e.g., My input is : International NNP Rockwell NNP Corp. NNP 's POS Tulsa NNP unit NN said VBDExpected output is: International I In Int Inte l al... (2 Replies)
Discussion started by: my_Perl
2 Replies

10. UNIX for Dummies Questions & Answers

splitting strings

Hi you, I have the following problem: I have a string like the followings: '166Mhz' or '128MB' or '300sec' or ... What I want to do is, I want to split the strings in a part with the numbers and a part with letters. Since the strings are not allway three digits and than text i couldn't do... (3 Replies)
Discussion started by: bensky
3 Replies
Login or Register to Ask a Question