Regular expression for finding OCR mistakes.


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Regular expression for finding OCR mistakes.
# 1  
Old 05-17-2012
Regular expression for finding OCR mistakes.

I have a large file of plain text, created using some OCR software. Some words have inevitably been got wrong. I've been trying to create grep or sed, etc., regular expressions to find them - but haven't quite managed to get it right. Here's what I'm trying to achieve:

Output all lines which contain a word which begins with, or contains, a number or non-alpha-numeric character. Eg. th1s, mi|k, !nert, etc.

Output all lines which contain a word which ends with a number or non-alpha-numeric character which is also not a common punctuation symbol like, '.', ','. Eg. Cra6, Chemica(, etc.

If possible it would be great to have the line numbers printed as well, but not essential at all.

Can you gurus help please? Thanks.
# 2  
Old 05-17-2012
Code:
$ cat data

This line contains a 1 but is not a mistake
The small brown fox jumped over the lazy dog.
This line contains a 1 but is a m1stake
How are you today?
mi|k
That's fine;  this isn't.
!nert
Hey hey hey!
cra6
chemica(

$ cat ocr.awk

{
        P=0
        for(N=1; (!P) && (N<=NF); N++)
        {
                # Ignore words that are pure numbers?
                if($N ~ /^[0-9]*$/) continue;
                # Flag words that contain non a-zA-Z'
                if($N ~ /[^a-zA-Z']./) P=1;
                # Flag words that end in non a-zA-Z.,;?!
                if($N ~ /[^a-zA-Z.,;?!]$/) P=1;
        }

        $0=NR"\t"$0;
} P

$ awk -f ocr.awk data

3       This line contains a 1 but is a m1stake
5       mi|k
7       !nert
9       cra6
10      chemica(

$

These 2 Users Gave Thanks to Corona688 For This Post:
# 3  
Old 05-17-2012
Thank you so much Corona, I really appreciate it. That works brilliantly, well, with a few modifications of things I hadn't mentioned, but just minor details. Now I've got to plough through the results - oh well just a few hours work, but that's instead of reading the whole thing. Many, many, thanks, that's saved me hours. Cheers.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

sed: -e expression #1, char 0: no previous regular expression

Hello All, I'm trying to extract the lines between two consecutive elements of an array from a file. My array looks like: problem_arr=(PRS111 PRS213 PRS234) j=0 while } ] do k=`expr $j + 1` sed -n "/${problem_arr}/,/${problem_arr}/p" problemid.txt ---some operation goes... (11 Replies)
Discussion started by: InduInduIndu
11 Replies

2. UNIX for Dummies Questions & Answers

Finding lines with a regular expression, replacing them with blank lines

So the tag for this forum says all newbies welcome... All I want to do is go through my file and find lines which contain a given string of characters then replace these with a blank line. I really tried to find a simple command to do this but failed. Here's what I did come up with though: ... (2 Replies)
Discussion started by: Golpette
2 Replies

3. Programming

Perl: How to read from a file, do regular expression and then replace the found regular expression

Hi all, How am I read a file, find the match regular expression and overwrite to the same files. open DESTINATION_FILE, "<tmptravl.dat" or die "tmptravl.dat"; open NEW_DESTINATION_FILE, ">new_tmptravl.dat" or die "new_tmptravl.dat"; while (<DESTINATION_FILE>) { # print... (1 Reply)
Discussion started by: jessy83
1 Replies

4. Shell Programming and Scripting

Regular expression

I have a flat tab delimited file of the following format 1 A:23 A:45 A:789 2 A:2 A:47 3 A:78 A:345 A:9 A:10 4 A:34 A:98 I want to modify the file to the following format with insertions of "//" in between 1 A:23 // A:45 // A:789 2 A:2 // A:47 3 A:78 // A:345 // A:9 // A:10 4 A:34... (7 Replies)
Discussion started by: Lucky Ali
7 Replies

5. Shell Programming and Scripting

Integer expression expected: with regular expression

CA_RELEASE has a value of 6. I need to check if that this is a numeric value. if not error. source $CA_VERSION_DATA if * ] then echo "CA_RELESE $CA_RELEASE is invalid" exit -1 fi + source /etc/ncgl/ca_version_data ++ CA_PRODUCT_ID=samxts ++ CA_RELEASE=6 ++ CA_WEEK_NO=7 ++... (3 Replies)
Discussion started by: ketkee1985
3 Replies

6. UNIX for Dummies Questions & Answers

Regular expression help

HI All, I want to list a file with the below format : testfile_nnnnn.xxxx where n and x can be any digit 0-9. n repeats 5 times and x 4 times... I tried with something like below: ls -l testfile_/\{5\}/* to start with but its not working. Please could anyone help? Thanks D (1 Reply)
Discussion started by: deepakgang
1 Replies

7. Linux

Regular expression to extract "y" from "abc/x.y.z" .... i need regular expression

Regular expression to extract "y" from "abc/x.y.z" (2 Replies)
Discussion started by: rag84dec
2 Replies

8. Shell Programming and Scripting

regular expression

Hi all, My log file is like 19:40:22 INFO :Total time taken to Service External Request---15ms 19:40:22 INFO : External service failed with status KO 19:40:22 FATAL: External service failed with status KO 19:40:22 DEBUG : Batch started with 19:40:22 ERROR: Member: dmidecode.x86_64... (1 Reply)
Discussion started by: subin_bala
1 Replies

9. Shell Programming and Scripting

regular expression help

hello all.. I'm a bit new to this site.. and I hope to learn alot.. but I've been having a hard time figuring this out. I'm horrible with regular expressions.. so any help would be greatly appreciated. I have a file with a list of names like this: LASTNAME, FIRSTNAME, MIDDLEINITIAL how can... (5 Replies)
Discussion started by: mac2118
5 Replies

10. Shell Programming and Scripting

Regular Expression + Aritmetical Expression

Is it possible to combine a regular expression with a aritmetical expression? For example, taking a 8-numbers caracter sequece and casting each output of a grep, comparing to a constant. THX! (2 Replies)
Discussion started by: Z0mby
2 Replies
Login or Register to Ask a Question