Merging words splitted into characters with awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Merging words splitted into characters with awk
# 1  
Old 02-12-2011
Merging words splitted into characters with awk

I have an OCR output with some words splitted into single characters separated by blank spaces,
and I want the same text with these words written correctly.

Example:
This is a text w i t h some s p l i t e d W o r d s .

The regular expression for matching splitted words could be something like this (I'm not so much worried about that):
Code:
grep -E "([A-Z])?( [a-z]){2,100} [.,]?"

My question is:
Once I've matched the string, how can I delete this annoing blank spaces?
I tried with the awk gsub and gensub functions but I'm not so hard with this.

I just reach to do this:
Code:
awk '{ S=gensub(" ([a-z]) ([a-z]) ([a-z]) ", " \\1\\2\\3 ", "g", $0); print S}'

That is not sufficient at all: the number of splitted characters is always different (not 3!)

Which is the right way to do this? any help? Thanks
# 2  
Old 02-12-2011
This may get you started:
Code:
awk '
{
  while(match($0,/ [^ ] [^ ]( [^ ])+ /)>0) {
    x = substr($0, RSTART, RLENGTH)
    gsub(/ /, "", x)
    $0 = substr($0,1,RSTART) x substr($0,RSTART+RLENGTH-1)
  }
  print
}'

This User Gave Thanks to binlib For This Post:
# 3  
Old 02-13-2011
dokamo,

Working based on your input example, the better solution I get so far I´ve divided in 4 sed parts for better understanding, you can try the "echo" followed by one sed command at a time to see what it does each one.

The problem is when a splitted word is followed by another splitted word, in this case, in the output, both words appear joined.

If it is close what you want, you only need to join 4 sed parts in a unique sed command.

Code:
echo " This is a text w i t h some s p l i t e d W o r d s ." | 
sed 's/\([a-z][a-z]?*\)\( \)/\1|/g' | 
sed 's/\([a-z]\)\( \)\([a-z][a-z]\)/\1|\3/g' | 
sed 's/ //g' | 
sed 's/|/ /g'
This is a text with some splitedWords.

Hope it helps.

Regards
This User Gave Thanks to cgkmal For This Post:
# 4  
Old 02-13-2011
Code:
echo "This is a text w i t h some s p l i t e d W o r d s ." |
awk '{a[NR]=$1;b[NR]=length($1)}
 
 END{
       for(i=1;i<=NR;i++) 
       {
         if(b[i]>1) {printf a[i]" "} 
         else if (b[i]==1 && a[i]~/[aA]/ && b[i-1]>1 && b[i+1]>1) {printf a[i]" "}  
         else if (b[i]==1 && b[i-1]>1 && b[i+1]==1) {printf " "a[i]} 
         else if (b[i]==1 && b[i-1]==1 && b[i+1]>1){printf a[i]" "} 
         else {printf a[i]}
        }
      }' RS=" " |
tr -s " "
This is a text with some splitedWords.

This User Gave Thanks to yinyuemi For This Post:
# 5  
Old 02-13-2011
Oh, yes!
I will adjust the script in order to match the "most" cases as possible and then I will post it here.
Thank you all!
# 6  
Old 03-07-2011
Ok, this script matches my case:

Code:
awk '
BEGIN{
	q='"\"'\""'; 
	o=""
}
{
	for(i=1;i<=NF;i++)
	{
		f=$i;fl=length($i)
		nf=$(i+1);nfl=length($(i+1))
		if (nfl==0) {o=o f}
		else if(fl>1) {o=o f" "} 
		else if ((fl==1) && (nfl>1)) {o=o f" "}
		else if ((f~/[[:lower:]]/) && (nf~/[[:upper:]]/ || nf~/[[:digit:]]/ || nf~/[\(]/) ) {o=o f" "} 
		else if ((f~/[[:lower:]]/) && (nf~/[[:digit:]]/ ) ) {o=o f" "} 
		else if ((f~/[[:digit:]]/) && (nf~/[[:alpha:]]/)) {o=o f" "}
		else if ((f~/[[:punct:]]/) && (f!~/[\(\-]/ && f!=q && nf!~/[[:punct:]]/)) {o=o f" "}
		else {o=o f}
	} 
	o=o RS
	printf "%s", o > output.txt
	o=""
}
END{

}' input.txt

Of course this can't keep separated two lower/upper case words.
Fortunetly, my ocr text is full of punctuation and capitalizations, so that the result was good enougth for my aim.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need to extract characters between two search words in a script!!

Hi, I have a log file which is the output from a xml script : <?xml version="1.0" ?> <!DOCTYPE svc_result SYSTEM "MLP_SVC_RESULT_320.DTD"> <svc_result ver="3.2.0"> <slia ver="3.0.0"> <pos> <msid type="MSISDN" enc="ASC">8093078040</msid> <poserr> ... (4 Replies)
Discussion started by: arjunstarz
4 Replies

2. Shell Programming and Scripting

Get characters between two words

Guys, Here is the txt file... SLIC N0SLU704034789 rŒ° EJ00 ó<NL DMRG>11 100 4B 2 SLIC N0SLU704034789 rŒ° TJ10 <4000><NL> 2 SLIC N0SLU704034789 ... (2 Replies)
Discussion started by: gowrishankar05
2 Replies

3. Shell Programming and Scripting

Replace words with the first characters

Hello folks, I have a simple request but I can't find a simple solution. Hare is my problem. I have some dates, I need to replace months with only the first 3 characters (jan for january, feb for february, ... all in lower case) ~$ echo '3 october 2010' | sed 3 oct 2010I thought of something... (8 Replies)
Discussion started by: tukuyomi
8 Replies

4. Shell Programming and Scripting

Need Header for all splitted files - awk

Input file: i have a file and need to split into multiple files based on first column. i need the header for all the splitted files. I'm unable to get the header. $ cat log.txt id,mailtype,value 1252468812,yahoo,3.5 1252468812,hotmail,2.4 1252468819,yahoo,1.2 1252468812,msn,8.9... (6 Replies)
Discussion started by: mannefromdetroi
6 Replies

5. Shell Programming and Scripting

awk help needed in trying to count lines,words and characters

Hello, i am trying to write a script file in awk which yields me the number of lines,characters and words, i checked it many many times but i am not able to find any mistake in it. Please tell me where i went wrong. BEGIN{ print "Filename Lines Words Chars\n" } { filename=filename + 1... (2 Replies)
Discussion started by: salman4u
2 Replies

6. Shell Programming and Scripting

Urgent help needed on merging lines with similar words

Hi everyone, I need help with a merging problem. Basically, I have a file with several lines (in this example 9 lines) such as: Amie, Jay, Sasha, Rob, Kay Mia, Frank Jay, Nancy, Cecil Paul, Ked, Nancy, 17, Fred 14, 16, 18, 20 9, 11 12, Frank 18, Peter, 62 Nancy, 27 A delimiter is... (3 Replies)
Discussion started by: awb221
3 Replies

7. Shell Programming and Scripting

deleting symbols and characters between two words

Hi Please tell me how could i delete symbols, whitespaces, characters, words everything between two words in a line. Let my file is aaa BB ccc ddd eee FF kkk xxx 123456 BB 44^& iop FF 999 xxx uuu rrr BB hhh nnn FF 000 I want to delete everything comes in between BB and FF( deletion... (3 Replies)
Discussion started by: rish_max
3 Replies

8. Shell Programming and Scripting

Script for pulling words of 4 to 7 characters from a file

Even just advice on where to start would be helpful. Thank You (2 Replies)
Discussion started by: Azeus
2 Replies

9. Shell Programming and Scripting

Display text between two words/characters

Using sed or awk, I need to display text between two words/characters. Below are two example inputs and the desired output. In a nutshell, I need the date-range value between the quotes (but only the first occurance of date-range as there can be more than one). Example One Input: xml-report... (1 Reply)
Discussion started by: cmichaelson
1 Replies

10. UNIX for Dummies Questions & Answers

merging 2 lines with awk and stripping first two words

Hey all i am pretty new to awk... here my problem. My input is something like this: type: NSR client; name: pegasus; save set: /, /var, /part, /part/part2, /testpartition, /foo/bar,... (9 Replies)
Discussion started by: bazzed
9 Replies
Login or Register to Ask a Question