Sponsored Content
Top Forums Shell Programming and Scripting Splitting Concatenated Words With Largest Strings First Post 302511480 by Chubler_XL on Thursday 7th of April 2011 01:12:54 AM
Old 04-07-2011
lol, I remember mentioning this exact problem in this post nearly 2 months ago.

The best way to tackle this is to try all possible substrings and select solution with smallest number of residual characters. Let me try and throw something together.

---------- Post updated at 02:36 PM ---------- Previous update was at 01:52 PM ----------

OK have a solution but it's much slower as it has to try all possible combinations:

Code:
awk 'NR==FNR{a[$1]; next}
function clean(s) {
   split(cs(s),S,SUBSEP);
   return S[1]
}
function cs(s,i,p,r,b,bs,t) {
 b=9999;
 for(i=length(s);b && i;i--) {
   r=0;
   p=tolower(substr(s,1,i));
   if(!(p in a)) r=i;
   t=cs(substr(s,i+1));
   split(t,V,SUBSEP);
   if(r+V[2]<b) { b=r+V[2]; bs=substr(s,1,i)" "V[1] SUBSEP b }
  }
  return bs;
}
{ print clean($0) }' lookup raw

---------- Post updated at 03:12 PM ---------- Previous update was at 02:36 PM ----------

Couple of Performance improvement
- No need to check strings longer than longest word
- Skip if current mismatch is worse than best found so far

Code:
awk 'NR==FNR{a[$1]; m=m<length?length:m; next}
function clean(s) {
   split(cs(s),S,SUBSEP);
   return S[1]
}
function cs(s,i,p,r,b,bs,t) {
 b=9999;
 for(i=length(s)>m?m:length(s);b && i;i--) {
   r=0;
   p=tolower(substr(s,1,i));
   if(!(p in a)) r=i;
   if(r<b) {
     t=cs(substr(s,i+1));
     split(t,V,SUBSEP);
     if(r+V[2]<b) { b=r+V[2]; bs=substr(s,1,i)" "V[1] SUBSEP b }
   }
  }
  return bs;
}
{ print clean($0) }'


Last edited by Chubler_XL; 04-07-2011 at 01:57 AM..
This User Gave Thanks to Chubler_XL For This Post:
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

splitting strings

Hi you, I have the following problem: I have a string like the followings: '166Mhz' or '128MB' or '300sec' or ... What I want to do is, I want to split the strings in a part with the numbers and a part with letters. Since the strings are not allway three digits and than text i couldn't do... (3 Replies)
Discussion started by: bensky
3 Replies

2. Programming

Splitting strings from file

Hi All I need help writing a Java program to split strings reading from a FILE and writing output into a FILE. e.g., My input is : International NNP Rockwell NNP Corp. NNP 's POS Tulsa NNP unit NN said VBDExpected output is: International I In Int Inte l al... (2 Replies)
Discussion started by: my_Perl
2 Replies

3. Shell Programming and Scripting

splitting words from a string

Hi, I have a string like this in a file, I want to retrive the words separated by comma's in 3 variables. like How do i get that.plz advice (2 Replies)
Discussion started by: suresh_kb211
2 Replies

4. Shell Programming and Scripting

Awk splitting words into files problem

Hi, I am trying to split the words having the delimiter as colon ';' in to separate files using awk. Here's my code. echo "f1;f2;f3" | awk '/;/{c=sprintf("%02d",++i); close("out" c)} {print > "out" c}' echo "f1;f2;f3" | awk -v i=0 '/;/{close("out"i); i++; next} {print > "out"i}' But... (4 Replies)
Discussion started by: royalibrahim
4 Replies

5. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

6. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Dear all, I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list. An example would make this clear annamarie mariechristine johnsmith johnjoseph smith john smith... (8 Replies)
Discussion started by: gimley
8 Replies

7. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

8. Shell Programming and Scripting

awk Splitting strings

Hi All, There is a file with a data. If the line is longer than 'n', we splitting the line on the parts and print them. Each of the parts is less than or equal 'n'. For example: n = 2; "ABCDEFGHIJK" -> length 11 Results: "AB" "CD" EF" GH" "IJ" "K" Code, but there are some errors.... (9 Replies)
Discussion started by: booyaka
9 Replies

9. UNIX for Dummies Questions & Answers

Splitting strings

I have a file that has two columns. I first column is an identifier and the second is a column of strings. I want to split the characters in the second column into substrings of length 5. So if the first line of the file has a string of length 10, the output should have the identifier repeated 2... (3 Replies)
Discussion started by: verse123
3 Replies

10. UNIX for Dummies Questions & Answers

Splitting strings based on delimiter

i have a snippet from server log delimited by forward slash. /a/b/c/d/filename i need to cut until last delimiter. So desired output should look like: /a/b/c/d can you please help? Thanks in advance. (7 Replies)
Discussion started by: alpha_1
7 Replies
textutil::split(3tcl)				    Text and string utilities, macro processing 			     textutil::split(3tcl)

__________________________________________________________________________________________________________________________________________________

NAME
textutil::split - Procedures to split texts SYNOPSIS
package require Tcl 8.2 package require textutil::split ?0.7? ::textutil::split::splitn string ?len? ::textutil::split::splitx string ?regexp? _________________________________________________________________ DESCRIPTION
The package textutil::split provides commands that split strings by size and arbitrary regular expressions. The complete set of procedures is described below. ::textutil::split::splitn string ?len? This command splits the given string into chunks of len characters and returns a list containing these chunks. The argument len defaults to 1 if none is specified. A negative length is not allowed and will cause the command to throw an error. Providing an empty string as input is allowed, the command will then return an empty list. If the length of the string is not an entire multiple of the chunk length, then the last chunk in the generated list will be shorter than len. ::textutil::split::splitx string ?regexp? This command splits the string and return a list. The string is split according to the regular expression regexp instead of a simple list of chars. Note that if you parentheses are added into the regexp, the parentheses part of separator will be added into the result list as additional element. If the string is empty the result is the empty list, like for split. If regexp is empty the string is split at every character, like split does. The regular expression regexp defaults to "[\t \r\n]+". BUGS, IDEAS, FEEDBACK This document, and the package it describes, will undoubtedly contain bugs and other problems. Please report such in the category textutil of the Tcllib SF Trackers [http://sourceforge.net/tracker/?group_id=12883]. Please also report any ideas for enhancements you may have for either package and/or documentation. SEE ALSO
regexp(3tcl), split(3tcl), string(3tcl) KEYWORDS
regular expression, split, string CATEGORY
Text processing textutil 0.7 textutil::split(3tcl)
All times are GMT -4. The time now is 10:57 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy