Sponsored Content
Top Forums UNIX for Advanced & Expert Users Regular expression for finding OCR mistakes. Post 302642511 by gencon on Thursday 17th of May 2012 02:03:47 PM
Old 05-17-2012
Regular expression for finding OCR mistakes.

I have a large file of plain text, created using some OCR software. Some words have inevitably been got wrong. I've been trying to create grep or sed, etc., regular expressions to find them - but haven't quite managed to get it right. Here's what I'm trying to achieve:

Output all lines which contain a word which begins with, or contains, a number or non-alpha-numeric character. Eg. th1s, mi|k, !nert, etc.

Output all lines which contain a word which ends with a number or non-alpha-numeric character which is also not a common punctuation symbol like, '.', ','. Eg. Cra6, Chemica(, etc.

If possible it would be great to have the line numbers printed as well, but not essential at all.

Can you gurus help please? Thanks.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regular Expression + Aritmetical Expression

Is it possible to combine a regular expression with a aritmetical expression? For example, taking a 8-numbers caracter sequece and casting each output of a grep, comparing to a constant. THX! (2 Replies)
Discussion started by: Z0mby
2 Replies

2. Shell Programming and Scripting

regular expression help

hello all.. I'm a bit new to this site.. and I hope to learn alot.. but I've been having a hard time figuring this out. I'm horrible with regular expressions.. so any help would be greatly appreciated. I have a file with a list of names like this: LASTNAME, FIRSTNAME, MIDDLEINITIAL how can... (5 Replies)
Discussion started by: mac2118
5 Replies

3. Shell Programming and Scripting

regular expression

Hi all, My log file is like 19:40:22 INFO :Total time taken to Service External Request---15ms 19:40:22 INFO : External service failed with status KO 19:40:22 FATAL: External service failed with status KO 19:40:22 DEBUG : Batch started with 19:40:22 ERROR: Member: dmidecode.x86_64... (1 Reply)
Discussion started by: subin_bala
1 Replies

4. Linux

Regular expression to extract "y" from "abc/x.y.z" .... i need regular expression

Regular expression to extract "y" from "abc/x.y.z" (2 Replies)
Discussion started by: rag84dec
2 Replies

5. UNIX for Dummies Questions & Answers

Regular expression help

HI All, I want to list a file with the below format : testfile_nnnnn.xxxx where n and x can be any digit 0-9. n repeats 5 times and x 4 times... I tried with something like below: ls -l testfile_/\{5\}/* to start with but its not working. Please could anyone help? Thanks D (1 Reply)
Discussion started by: deepakgang
1 Replies

6. Shell Programming and Scripting

Integer expression expected: with regular expression

CA_RELEASE has a value of 6. I need to check if that this is a numeric value. if not error. source $CA_VERSION_DATA if * ] then echo "CA_RELESE $CA_RELEASE is invalid" exit -1 fi + source /etc/ncgl/ca_version_data ++ CA_PRODUCT_ID=samxts ++ CA_RELEASE=6 ++ CA_WEEK_NO=7 ++... (3 Replies)
Discussion started by: ketkee1985
3 Replies

7. Shell Programming and Scripting

Regular expression

I have a flat tab delimited file of the following format 1 A:23 A:45 A:789 2 A:2 A:47 3 A:78 A:345 A:9 A:10 4 A:34 A:98 I want to modify the file to the following format with insertions of "//" in between 1 A:23 // A:45 // A:789 2 A:2 // A:47 3 A:78 // A:345 // A:9 // A:10 4 A:34... (7 Replies)
Discussion started by: Lucky Ali
7 Replies

8. Programming

Perl: How to read from a file, do regular expression and then replace the found regular expression

Hi all, How am I read a file, find the match regular expression and overwrite to the same files. open DESTINATION_FILE, "<tmptravl.dat" or die "tmptravl.dat"; open NEW_DESTINATION_FILE, ">new_tmptravl.dat" or die "new_tmptravl.dat"; while (<DESTINATION_FILE>) { # print... (1 Reply)
Discussion started by: jessy83
1 Replies

9. UNIX for Dummies Questions & Answers

Finding lines with a regular expression, replacing them with blank lines

So the tag for this forum says all newbies welcome... All I want to do is go through my file and find lines which contain a given string of characters then replace these with a blank line. I really tried to find a simple command to do this but failed. Here's what I did come up with though: ... (2 Replies)
Discussion started by: Golpette
2 Replies

10. UNIX for Advanced & Expert Users

sed: -e expression #1, char 0: no previous regular expression

Hello All, I'm trying to extract the lines between two consecutive elements of an array from a file. My array looks like: problem_arr=(PRS111 PRS213 PRS234) j=0 while } ] do k=`expr $j + 1` sed -n "/${problem_arr}/,/${problem_arr}/p" problemid.txt ---some operation goes... (11 Replies)
Discussion started by: InduInduIndu
11 Replies
TextBuffer(3I)						    InterViews Reference Manual 					    TextBuffer(3I)

NAME
TextBuffer - operations on unstructured text SYNOPSIS
#include <InterViews/textbuffer.h> DESCRIPTION
TextBuffer defines common editing, searching, and text movement operations on a buffer of unstructured text. Text positions are specified by an index into the buffer and logically refer to positions between characters. For example, the position referred to by the index 0 is before the first character in the text. Indices can be compared for equality or ordering, but they should not be used to directly access the buffer because TextBuffer might rearrange the text to improve the efficiency of some operations. PUBLIC OPERATIONS
TextBuffer(char* buffer, int length, int size) ~TextBuffer() Create or destroy an instance of TextBuffer. All operations on the text contained in buffer should be performed through TextBuffer functions. The text is assumed to be of length length, and the total available buffer size is size. int Search(Regexp* regexp, int index, int range, int stop) int ForwardSearch(Regexp* regexp, int index) int BackwardSearch(Regexp* regexp, int index) Search for a match with the regular expression regexp, beginning at position index. Search searches the part of the buffer speci- fied by range and stop and returns the index of the beginning of the matched text. Positive values of range specify forward searches, and negative values specify backward searches. In either case, the matched text will not extend beyond the position given by stop. ForwardSearch searches for matches from index to the end of the text and returns the index of the end of the match. Back- wardSearch searches from index to the start of the text and returns the index of the beginning of the match. All three functions return a negative number if there was no match. int Match(Regexp* regexp, int index, int stop) boolean ForwardMatch(Regexp* regexp, int index) boolean BackwardMatch(Regexp* regexp, int index) Attempt to match the regular expression regexp at the position index. Match returns the length of the matching string, or a nega- tive number if there was no match. Matching will not succeed beyond the position given by stop. ForwardMatch looks for a match that begins at index. BackwardMatch looks for a match that ends at index. int Insert(int index, const char* string, int count) int Delete(int index, int count) int Copy(int index, char* buffer, int count) Edit the text in the buffer. Insert inserts count characters from string at the position index. It returns the actual number of characters inserted, which might be less than count if there is insufficient space in the buffer. Delete deletes count characters from the buffer. A positive count deletes characters after index, and a negative value deletes character before index. Delete returns the actual number of characters deleted, which might be less than count if index is near the beginning or the end of the text. Copy copies count characters into buffer. A positive count copies characters after index and a negative count copies charac- ters before index. Count returns the actual number of characters copied. It is the caller's responsibility to ensure that buffer contains sufficient space for the copied text. int Height() int Width() int Length() Return information about the text. Height returns the number of lines in the text, Width returns the number of characters in the longest line, and Length returns the total number of characters. const char* Text() const char* Text(int index) const char* Text(int index1, int index2) char Char (int index) Access the contents of the text. Char returns the character immediately following index. The three Text calls return pointers to character strings representing the text. They make various guarantees about the format of the returned string. With no parameters, Text returns a pointer to a string that contains the entire text of the buffer. With a single parameter the string contains at least the text from index to the end of the line. With two parameters, the returned string contains at least the text between index1 and index2. In any case, the returned string should be considered temporary and its contents subject to change. To maximize efficiency, you should prefer the more restricted forms of Text. int LineIndex(int line) int LinesBetween(int index1, int index2) int LineNumber(int index) int LineOffset (int index) Map between text indices and line and offset positions. LineIndex returns the index of the beginning of line line. LineNumber returns the number of the line that contains index. LineOffset returns the offset of index from the beginning of its containing line. LinesBetween returns the difference between the numbers of the lines containings index1 and index2; a return value of zero indicates that index1 and index2 are on the same line, and a positive value indicates that the line containing index2 is after the line containing index1. Lines are numbered starting from zero. int PreviousCharacter(int index) int NextCharacter(int index) Return the index immediately following or preceding index. The returned value is never before the beginning or after the end of the text. boolean IsBeginningOfText(int index) int BeginningOfText() boolean IsEndOfText(int index) int EndOfText() Return the index of the beginning or end of the text, or query whether index is at the beginning or end of the text. boolean IsBeginningOfLine(int index) int BeginningOfLine(int index) int BeginningOfNextLine(int index) boolean IsEndOfLine(int index) int EndOfLine(int index) int EndOfPreviousLine(int index) Return information about the line structure of the text around index. BeginningOfLine returns the index of the beginning of the line containing index. BeginningOfNextLine returns the index of the beginning of the next line that begins after index. EndOfLine returns the index of the end of the line containing index. EndOfPreviousLine returns the index of the end of the last line that ends before index. The beginning of a line is logically immediately after a newline character, and the end of a line is logically immediately before a newline character. The beginning and end of the text are considered to be the beginning and end of the first and last lines, respectively. boolean IsBeginningOfWord(int index) int BeginningOfWord(int index) int BeginningOfNextWord(int index) boolean IsEndOfWord(int index) int EndOfWord(int index) int EndOfPreviousWord(int index) Return information about the word structure of the text around index. BeginningOfWord returns the index of the beginning of the word containing index. BeginningOfNextWord return the index of the beginning of the nest word that begins after index. EndOfWord returns the index of the end of the word that contains index. EndOfPreviousWord returns the index of the end of the last word that ends before index. A word is defined as a sequence of alpha-numeric characters. The beginning and end of the text are considered to be the beginning and end of the first and last words, respectively. SEE ALSO
Regexp(3I) InterViews 23 May 1989 TextBuffer(3I)
All times are GMT -4. The time now is 07:42 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy