OCR text that needs cleaning


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting OCR text that needs cleaning
# 1  
Old 09-22-2016
OCR text that needs cleaning

Hi,

I have OCR'ed text that needs cleaning.
Lines are delimited by parts of speech (POS), for example,
each line will have either an
adj. OR s. f. OR s. m. etc
I need to uppercase all text before the POS
but all text within parentheses to be lowercase
Text after (and including) the POS to remain as is

filename: munge
Code:
fuiASSO, FIEIASSO (b.), fuluasso (a. l.), foulhasso (for.), (b. lat. folzîacia), s. f. grosse feuille,
FUMFULHUT  (l.), felhut (g.), FOULhuolhut, (it.) FOGLIUTO, adj. Feuillu, ue, v. uiaru, pampous,
FUIEMT, fuiret  (rh.), fulheiret, ramoner  (l.), fulhoret (rouerg.), s. m. Feuilleret, petit rabot qui sert faire des feuillures.
FULmjnacioun, FULMINACIEN  (m.), fulminacieu  (l.),  (rom. lat. fulminatzo, cat. fulminaciô, esp. fulminacion, it. fwlminasione), s. f. Fulmination, v. trounado.
FULMINANT, ANTO  (port. fulminante), adj. Fulminant, ante, v. trounant. R. fulmana.

I have uppercased everything before POS with

Code:
sed -r -i -f doup.sed munge

doup.sed
Code:
s/ n. de l. /^ n. de l. /
s/ s. m. /^ s. m. /
s/ s. f. /^ s. f. /
s/ adj. /^ adj. /
s/ n. p. /^ n. p. /
s/ v. n. /^ v. n. /
s/ v. a. /^ v. a. /
s/ adv. /^ adv. /
s/^(.*)\^/\U\1\E/

and tried to lowercase between the parentheses with

Code:
sed -r -i 's/\((.*)\)/\L&/g' munge

but this retains uppercaseing until first parentheses and lowercases everything else up the POS like:

Code:
FUIASSO, FIEIASSO (b.), fuluasso (a. l.), foulhasso (for.), (b. lat. folzîacia), s. f. grosse feuille,
etc
etc

Any GNU sed 4.2.2 or GAWK 4.1.3 solutions please
Thanks in advance


Moderator's Comments:
Mod Comment Please use CODE tags as required by forum rules!

Last edited by RudiC; 09-22-2016 at 04:54 AM.. Reason: Added CODE tags.
# 2  
Old 09-22-2016
Hello safran,

Could you please try following and let me know if this helps you.
Code:
awk '{match($0,/.*s\. f\./);if(substr($0,RSTART,RLENGTH)){print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)};match($0,/.*s\. m\./);if(substr($0,RSTART,RLENGTH)){print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)};match($0,/.*adj\./);if(substr($0,RSTART,RLENGTH)){print toupper(substr($0,RSTART,RLENGTH-4)) substr($0,RLENGTH-4)};}'  Input_file
OR a non-one liner form of above solution:
awk '{match($0,/.*s\. f\./);
      if(substr($0,RSTART,RLENGTH))  {
                                        print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)
                                     };
      match($0,/.*s\. m\./);
      if(substr($0,RSTART,RLENGTH)){
                                        print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)
                                     };
      match($0,/.*adj\./);
      if(substr($0,RSTART,RLENGTH))  {
                                        print toupper(substr($0,RSTART,RLENGTH-4)) substr($0,RLENGTH-4)
                                     };
     }
    '  Input_file

In case you need to get all strings upper case till the POS then following may help you in same.
Code:
awk '{match($0,/.*s\. f\.|.*adj\.|.*s\. m\./);print toupper(substr($0,RSTART,RLENGTH)) substr($0,RLENGTH+1)}'  Input_file

NOTE: I am trying to do it with a function, will post when able to do so.

Thanks,
R. Singh
# 3  
Old 09-22-2016
OCR text that needs cleaning - reply

Hi,

Thanks for the quick response but your AWK one-liners just uppercase everything before the POS.
I'm already doing this uppercasing when I run doup.sed
The code I'm stuck on is the lowercasing of everything within the parentheses before the POS

Thanks
# 4  
Old 09-22-2016
Try
Code:
s/^(.*) (n. de l|s. m|s. f|adj|n. p|v. n|v. a|adv)/\U\1\E \2/
s/\([^)]*\)/\L&/g

for doup.sed to result in

Code:
FUIASSO, FIEIASSO (b.), FULUASSO (a. l.), FOULHASSO (for.), (b. lat. folzîacia), s. f. grosse feuille,
FUMFULHUT  (l.), FELHUT (g.), FOULHUOLHUT, (it.) FOGLIUTO, adj. Feuillu, ue, v. uiaru, pampous,
FUIEMT, FUIRET  (rh.), FULHEIRET, RAMONER  (l.), FULHORET (rouerg.), s. m. Feuilleret, petit rabot qui sert faire des feuillures.
FULMJNACIOUN, FULMINACIEN  (m.), FULMINACIEU  (l.),  (rom. lat. fulminatzo, cat. fulminaciô, esp. fulminacion, it. fwlminasione), s. f. Fulmination, v. trounado.
FULMINANT, ANTO  (port. fulminante), adj. Fulminant, ante, v. trounant. R. fulmana.

This User Gave Thanks to RudiC For This Post:
# 5  
Old 09-22-2016
OCR text that needs cleaning - reply

Thank you RudiC, that work fine
# 6  
Old 09-22-2016
You still could try improving the regex, e.g. like
Code:
s/^(.*) (n\. (de l|p)|s\. [mf]|ad[jv]|v\. [na])/\U\1\E \2/

interesting esp. when the list of POS' gets longer and longer.
# 7  
Old 09-22-2016
OCR

Hi RudiC,
I prefer to keep the POS list as
Code:
s/ n. de l. /^ n. de l. /
s/ s. m. /^ s. m. /
s/ s. f. /^ s. f. /
s/ adj. /^ adj. /
s/ n. p. /^ n. p. /
s/ v. n. /^ v. n. /
s/ v. a. /^ v. a. /
s/ adv. /^ adv. /
s/^(.*)\^/\U\1\E/

adding to it as needs be and use your second sed line
Code:
s/\([^)]*\)/\L&/g

to do the lowercasing.

There will be more items to add to the POS list, as with english, words can have different parts of speech, for example,
in english, 'fast' can be a verb, a noun, an adjective and an adverb.

I've already seen examples such as
s. m. pl. and s. f. pl. (both plural forms of masculine/feminine nouns)
but I don't think there will be more than 20 to 25 catogeries

Thanks again for your help
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

cleaning up files using find...

I am trying to cleanup a directory with around 4000 files, and using the below command to delete all .gz files older than 60 days, I am having the same issue of arguments being too long. is there a way i can use the same command to do what I intend to do. find /opt/et/logs/Archive/*.log.*.gz... (4 Replies)
Discussion started by: Shellslave
4 Replies

2. UNIX for Advanced & Expert Users

Regular expression for finding OCR mistakes.

I have a large file of plain text, created using some OCR software. Some words have inevitably been got wrong. I've been trying to create grep or sed, etc., regular expressions to find them - but haven't quite managed to get it right. Here's what I'm trying to achieve: Output all lines which... (2 Replies)
Discussion started by: gencon
2 Replies

3. Shell Programming and Scripting

cleaning the file

Hi, I have a file with multiple rows. each row has 8 columns. Column 8 has entries separated by commas. I want to exclude all the rows in which column 8 has more than 3 commas. 1234#0/1 - ABC_1234 3 ATGCATGCATGC HHHIIIGIHVF 1 49:T>C,60:T>C,78:C>A,76:G>T,65:T>G Thanks, Diya (3 Replies)
Discussion started by: Diya123
3 Replies

4. Shell Programming and Scripting

File cleaning

HI , I am getting the source data as below. Source Data CDR_Data,,,,, F1,F2,F3,F4,F5,F6 5,5,6,7,8,7 6,6,g,,, 7,7,76,,, 8,8,gt,,, 9,9,df ,d,d,d ,,,,, (4 Replies)
Discussion started by: wangkc
4 Replies

5. UNIX and Linux Applications

Ocr

Is there any open-source software that OCRs PDFs? (2 Replies)
Discussion started by: CRGreathouse
2 Replies

6. Shell Programming and Scripting

Working with OCR text inside PDF files

I'm trying to find a way to automate cleanup of OCR for a large number of scanned pages - due to limitations of the access mechanism where these are to end up, I need to create pdf files that include the background text for searching. Going in I have Tif images too dirty to OCR and re-keyed text... (2 Replies)
Discussion started by: dorcas
2 Replies

7. UNIX for Dummies Questions & Answers

Cleaning text files

I wish to clean a text file of the following characters 1/2, 1/4, o (degrees) I cant display these characters. I have tried ALT+189 etc (my terminal emulator is set to ASCII). How do I display the above ? I am using HP UX 10. (5 Replies)
Discussion started by: ferretman
5 Replies

8. AIX

doing some spring cleaning....

USERS="me you jim joe sue" for user in ${USERS}; do rmuser -p $user usrdir=`cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` rm -fr `cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` echo Deleting: $user '\t' REMOVING: $usrdir done This is for AIX ONLY!!! but easily ported to... (0 Replies)
Discussion started by: Optimus_P
0 Replies
Login or Register to Ask a Question