Sponsored Content
Full Discussion: OCR text that needs cleaning
Top Forums Shell Programming and Scripting OCR text that needs cleaning Post 302981979 by RavinderSingh13 on Thursday 22nd of September 2016 04:31:24 AM
Old 09-22-2016
Hello safran,

Could you please try following and let me know if this helps you.
Code:
awk '{match($0,/.*s\. f\./);if(substr($0,RSTART,RLENGTH)){print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)};match($0,/.*s\. m\./);if(substr($0,RSTART,RLENGTH)){print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)};match($0,/.*adj\./);if(substr($0,RSTART,RLENGTH)){print toupper(substr($0,RSTART,RLENGTH-4)) substr($0,RLENGTH-4)};}'  Input_file
OR a non-one liner form of above solution:
awk '{match($0,/.*s\. f\./);
      if(substr($0,RSTART,RLENGTH))  {
                                        print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)
                                     };
      match($0,/.*s\. m\./);
      if(substr($0,RSTART,RLENGTH)){
                                        print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)
                                     };
      match($0,/.*adj\./);
      if(substr($0,RSTART,RLENGTH))  {
                                        print toupper(substr($0,RSTART,RLENGTH-4)) substr($0,RLENGTH-4)
                                     };
     }
    '  Input_file

In case you need to get all strings upper case till the POS then following may help you in same.
Code:
awk '{match($0,/.*s\. f\.|.*adj\.|.*s\. m\./);print toupper(substr($0,RSTART,RLENGTH)) substr($0,RLENGTH+1)}'  Input_file

NOTE: I am trying to do it with a function, will post when able to do so.

Thanks,
R. Singh
 

8 More Discussions You Might Find Interesting

1. AIX

doing some spring cleaning....

USERS="me you jim joe sue" for user in ${USERS}; do rmuser -p $user usrdir=`cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` rm -fr `cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` echo Deleting: $user '\t' REMOVING: $usrdir done This is for AIX ONLY!!! but easily ported to... (0 Replies)
Discussion started by: Optimus_P
0 Replies

2. UNIX for Dummies Questions & Answers

Cleaning text files

I wish to clean a text file of the following characters 1/2, 1/4, o (degrees) I cant display these characters. I have tried ALT+189 etc (my terminal emulator is set to ASCII). How do I display the above ? I am using HP UX 10. (5 Replies)
Discussion started by: ferretman
5 Replies

3. Shell Programming and Scripting

Working with OCR text inside PDF files

I'm trying to find a way to automate cleanup of OCR for a large number of scanned pages - due to limitations of the access mechanism where these are to end up, I need to create pdf files that include the background text for searching. Going in I have Tif images too dirty to OCR and re-keyed text... (2 Replies)
Discussion started by: dorcas
2 Replies

4. UNIX and Linux Applications

Ocr

Is there any open-source software that OCRs PDFs? (2 Replies)
Discussion started by: CRGreathouse
2 Replies

5. Shell Programming and Scripting

File cleaning

HI , I am getting the source data as below. Source Data CDR_Data,,,,, F1,F2,F3,F4,F5,F6 5,5,6,7,8,7 6,6,g,,, 7,7,76,,, 8,8,gt,,, 9,9,df ,d,d,d ,,,,, (4 Replies)
Discussion started by: wangkc
4 Replies

6. Shell Programming and Scripting

cleaning the file

Hi, I have a file with multiple rows. each row has 8 columns. Column 8 has entries separated by commas. I want to exclude all the rows in which column 8 has more than 3 commas. 1234#0/1 - ABC_1234 3 ATGCATGCATGC HHHIIIGIHVF 1 49:T>C,60:T>C,78:C>A,76:G>T,65:T>G Thanks, Diya (3 Replies)
Discussion started by: Diya123
3 Replies

7. UNIX for Advanced & Expert Users

Regular expression for finding OCR mistakes.

I have a large file of plain text, created using some OCR software. Some words have inevitably been got wrong. I've been trying to create grep or sed, etc., regular expressions to find them - but haven't quite managed to get it right. Here's what I'm trying to achieve: Output all lines which... (2 Replies)
Discussion started by: gencon
2 Replies

8. Shell Programming and Scripting

cleaning up files using find...

I am trying to cleanup a directory with around 4000 files, and using the below command to delete all .gz files older than 60 days, I am having the same issue of arguments being too long. is there a way i can use the same command to do what I intend to do. find /opt/et/logs/Archive/*.log.*.gz... (4 Replies)
Discussion started by: Shellslave
4 Replies
FESTIVAL(1)						      General Commands Manual						       FESTIVAL(1)

NAME
festival - a text-to-speech system. SYNOPSIS
festival [options] [file0] [file1] ... DESCRIPTION
Festival is a general purpose text-to-speech system. As well as simply rendering text as speech it can be used in an interactive command mode for testing and developing various aspects of speech synthesis technology. Festival has two major modes, command and tts (text-to-speech). When in command mode input (from file or interactively) is interpreted by the command interpreter. When in tts mode input is rendered as speech. When in command mode filenames that start with a left parenthesis are treated as literal commands and evaluated. OPTIONS
-q Load no default setup files --datadir <string> Set data directory pathname --libdir <string> Set library directory pathname -b Run in batch mode (no interaction) --batch Run in batch mode (no interaction) --tts Synthesize text in files as speech no files means read from stdin (implies no interaction by default) -i Run in interactive mode (default) --interactive Run in interactive mode (default) --pipe Run in pipe mode, reading commands from stdin, but no prompt or return values are printed (default if stdin not a tty) --language <string> Run in named language, default is english, spanish, russian, welsh and others are available --server Run in server mode waiting for clients of server_port (1314) --script <ifile> Used in #! scripts, runs in batch mode on file and passes all other args to Scheme --heap <int> {1000000} Set size of Lisp heap, should not normally need to be changed from its default -v Display version number and exit --version Display version number and exit BUGS
More than you can imagine. A manual with much detail (though not complete) is available in distributed as part of the system and is also accessible at http://www.cstr.ed.ac.uk/projects/festival/manual/ Although we cannot guarantee the time required to fix bugs, we would appreciated it if they were reported to festival-bug@cstr.ed.ac.uk AUTHOR
Alan W Black, Richard Caley and Paul Taylor (C) Centre for Speech Technology Research, 1996-1998 University of Edinburgh 80 South Bridge Edinburgh EH1 1HN http://www.cstr.ed.ac.uk/projects/festival.html 6th Apr 1998 FESTIVAL(1)
All times are GMT -4. The time now is 08:32 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy