Splitting concatenated words in input file with words from the same file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Splitting concatenated words in input file with words from the same file
# 1  
Old 05-02-2012
Splitting concatenated words in input file with words from the same file

Dear all,
I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list.
An example would make this clear
Code:
annamarie
mariechristine
johnsmith
johnjoseph smith
john
smith
anna
marie
mary
christine

The program should split the words in the list basing itself on the single forms which are there. Thus
Code:
annamarie anna-marie
mariechristine marie christine
johnsmith john smith
johnjosephsmith

In the case of the last since
Code:
joseph

is missing, the program could suitably tag the missing element and show the word as
Code:
john !joseph! smith

The script would prove especially helpful in separating words in languages such as German whch have a large number of compounded words.
Could the script in awk posted on this site (thanks to yinyuemi) and which I am posting below (which does something similar but it takes words from an external dictionary), be modified to work within the same database instead of referring to an external dictionary. I have tried to modify it but it just does not work.
Code:
awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{gsub(a[j]," "a[j]" ",$0)}}}}END{$1=toupper(substr($1,1,1))substr($1,2);print}' lookup raw

Any help given would be gratefully acknowledged.
Moderator's Comments:
Mod Comment Please use [code][/code] tags instead of [quote][/quote] tags for code and samples

Last edited by Scrutinizer; 05-03-2012 at 01:55 AM.. Reason: code tags instead of quote tags
# 2  
Old 05-03-2012
How about this:

Code:
awk '
NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{
  IGNORECASE=1;
  for(j=1;j<=x;j++){
      for(i=1;i<=NF;i++) {
          if(length($i)>length(a[j]) && $i~a[j] && $i!=a[j])
             gsub(a[j]," "a[j]" ",$0)
          }
      }
      for(i=1;i<=NF;i++)
         printf (i>1?" ":"") (($i in b)?$i:"!"$i"!")
         print ""
}' infile infile

# 3  
Old 05-03-2012
Many thanks. I copied the script and ran it on the file which I had proposed as a sample. I got no results.
Have I done something wrong ? I am on Windows and maybe this is the cause; but awk/gawk should run on any environment.
This is tantalising to see a solution and not be able to use it.
Many thanks once more for your kind help.
# 4  
Old 05-03-2012
Make sure you copy solution exactly as it appears (including the file name on the end of the line twice):

Output:
Code:
anna marie
marie christine
john smith
john !joseph! smith
john
smith
anna
marie
mary
christine

# 5  
Old 05-03-2012
Many thanks for taking the trouble to help me out. I copied the program as such retaining the instructions you had given:
Quote:
Make sure you copy solution exactly as it appears (including the file name on the end of the line twice)
When I retain the
Quote:
'
, I get the following error message:

Code:
gawk: singlesplit.awk:13: }' infile infile
gawk: singlesplit.awk:13:  ^ Invalid char ''' in expression

When I do away with the
Quote:
'
, I get no response: the output file does not pop up on the screen.
Is there a problem in copying the code. This what I copied and got :

Code:
NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{
  IGNORECASE=1;
  for(j=1;j<=x;j++){
      for(i=1;i<=NF;i++) {
          if(length($i)>length(a[j]) && $i~a[j] && $i!=a[j])
             gsub(a[j]," "a[j]" ",$0)
          }
      }
      for(i=1;i<=NF;i++)
         printf (i>1?" ":"") (($i in b)?$i:"!"$i"!")
         print ""
}' infile infile

Sorry to hassle you like this, but I am really desperate to get the solution.
Many thanks once again for your patience and kindness.
# 6  
Old 05-03-2012
OK I think I might know what is going on, you are putting the awk code in a file and then calling it with the awk -f progfile option


Remove ' infile infile from your singlesplit.awk program file and call awk like this:

Code:
awk -f singlesplit.awk infile infile

This User Gave Thanks to Chubler_XL For This Post:
# 7  
Old 05-04-2012
Many thanks. You made my day. The script works. I should have thought of removing the infile infile and giving them at command prompt.
Many thanks once again for all your kind help and your patience.

---------- Post updated at 10:05 PM ---------- Previous update was at 07:42 PM ----------

Sorry to sound ungrateful. The script works. But my file is around 300 thousand words and the script is very slow.
Any means of speeding it up, an array or some such device. Many thanks for all your help and sorry to pester you like this.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replace particular words in file based on if finds another words in that line

Hi All, I need one help to replace particular words in file based on if finds another words in that file . i.e. my self is peter@king. i am staying at north sydney. we all are peter@king. How to replace peter to sham if it finds @king in any line of that file. Please help me... (8 Replies)
Discussion started by: Rajib Podder
8 Replies

2. Shell Programming and Scripting

Search words in any quote position and then change the words

hi, i need to replace all words in any quote position and then need to change the words inside the file thousand of raw. textfile data : "Ninguno","Confirma","JuicioABC" "JuicioCOMP","Recurso","JuicioABC" "JuicioDELL","Nulidad","Nosino" "Solidade","JuicioEUR","Segundo" need... (1 Reply)
Discussion started by: benjietambling
1 Replies

3. UNIX for Dummies Questions & Answers

Replace the words in the file to the words that user type?

Hello, I would like to change my setting in a file to the setting that user input. For example, by default it is ONBOOT=ON When user key in "YES", it would be ONBOOT=YES -------------- This code only adds in the entire user input, but didn't replace it. How do i go about... (5 Replies)
Discussion started by: malfolozy
5 Replies

4. Shell Programming and Scripting

Gawk gensub, match capital words and lowercase words

Hi I have strings like these : Vengeance mitt Men Vengeance gloves Women Quatro Windstopper Etip gloves Quatro Windstopper Etip gloves Girls Thermobite hooded jacket Thermobite Triclimate snow jacket Boys Thermobite Triclimate snow jacket and I would like to get the lower case words at... (2 Replies)
Discussion started by: louisJ
2 Replies

5. Shell Programming and Scripting

How count the number of two words associated with the two words occurring in the file?

Hi , I need to count the number of errors associated with the two words occurring in the file. It's about counting the occurrences of the word "error" for where is the word "index.js". As such the command should look like. Please kindly help. I was trying: grep "error" log.txt | wc -l (1 Reply)
Discussion started by: jmarx
1 Replies

6. Shell Programming and Scripting

Splitting Concatenated Words With Largest Strings First

hello, I had posted earlier help for a script for splitting concatenated words . The script was supposed to read words from a master file and split concatenated words in the slave/input file. Thanks to the help I got, the following script which works very well was posted. It detects residues by... (14 Replies)
Discussion started by: gimley
14 Replies

7. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

8. Shell Programming and Scripting

Awk splitting words into files problem

Hi, I am trying to split the words having the delimiter as colon ';' in to separate files using awk. Here's my code. echo "f1;f2;f3" | awk '/;/{c=sprintf("%02d",++i); close("out" c)} {print > "out" c}' echo "f1;f2;f3" | awk -v i=0 '/;/{close("out"i); i++; next} {print > "out"i}' But... (4 Replies)
Discussion started by: royalibrahim
4 Replies

9. Shell Programming and Scripting

Shell script to find out words, replace them and count words

hello, i 'd like your help about a bash script which: 1. finds inside the html file (it is attached with my post) the code number of the Latest Stable Kernel, 2.finds the link which leads to the download location of the Latest Stable Kernel version, (the right link should lead to the file... (3 Replies)
Discussion started by: alex83
3 Replies

10. Shell Programming and Scripting

splitting words from a string

Hi, I have a string like this in a file, I want to retrive the words separated by comma's in 3 variables. like How do i get that.plz advice (2 Replies)
Discussion started by: suresh_kb211
2 Replies
Login or Register to Ask a Question