Sponsored Content
Full Discussion: OCR text that needs cleaning
Top Forums Shell Programming and Scripting OCR text that needs cleaning Post 302981986 by RudiC on Thursday 22nd of September 2016 05:14:21 AM
Old 09-22-2016
Try
Code:
s/^(.*) (n. de l|s. m|s. f|adj|n. p|v. n|v. a|adv)/\U\1\E \2/
s/\([^)]*\)/\L&/g

for doup.sed to result in

Code:
FUIASSO, FIEIASSO (b.), FULUASSO (a. l.), FOULHASSO (for.), (b. lat. folzîacia), s. f. grosse feuille,
FUMFULHUT  (l.), FELHUT (g.), FOULHUOLHUT, (it.) FOGLIUTO, adj. Feuillu, ue, v. uiaru, pampous,
FUIEMT, FUIRET  (rh.), FULHEIRET, RAMONER  (l.), FULHORET (rouerg.), s. m. Feuilleret, petit rabot qui sert faire des feuillures.
FULMJNACIOUN, FULMINACIEN  (m.), FULMINACIEU  (l.),  (rom. lat. fulminatzo, cat. fulminaciô, esp. fulminacion, it. fwlminasione), s. f. Fulmination, v. trounado.
FULMINANT, ANTO  (port. fulminante), adj. Fulminant, ante, v. trounant. R. fulmana.

This User Gave Thanks to RudiC For This Post:
 

8 More Discussions You Might Find Interesting

1. AIX

doing some spring cleaning....

USERS="me you jim joe sue" for user in ${USERS}; do rmuser -p $user usrdir=`cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` rm -fr `cat /etc/passwd|grep $user|awk -F":" '{ print $6 }'` echo Deleting: $user '\t' REMOVING: $usrdir done This is for AIX ONLY!!! but easily ported to... (0 Replies)
Discussion started by: Optimus_P
0 Replies

2. UNIX for Dummies Questions & Answers

Cleaning text files

I wish to clean a text file of the following characters 1/2, 1/4, o (degrees) I cant display these characters. I have tried ALT+189 etc (my terminal emulator is set to ASCII). How do I display the above ? I am using HP UX 10. (5 Replies)
Discussion started by: ferretman
5 Replies

3. Shell Programming and Scripting

Working with OCR text inside PDF files

I'm trying to find a way to automate cleanup of OCR for a large number of scanned pages - due to limitations of the access mechanism where these are to end up, I need to create pdf files that include the background text for searching. Going in I have Tif images too dirty to OCR and re-keyed text... (2 Replies)
Discussion started by: dorcas
2 Replies

4. UNIX and Linux Applications

Ocr

Is there any open-source software that OCRs PDFs? (2 Replies)
Discussion started by: CRGreathouse
2 Replies

5. Shell Programming and Scripting

File cleaning

HI , I am getting the source data as below. Source Data CDR_Data,,,,, F1,F2,F3,F4,F5,F6 5,5,6,7,8,7 6,6,g,,, 7,7,76,,, 8,8,gt,,, 9,9,df ,d,d,d ,,,,, (4 Replies)
Discussion started by: wangkc
4 Replies

6. Shell Programming and Scripting

cleaning the file

Hi, I have a file with multiple rows. each row has 8 columns. Column 8 has entries separated by commas. I want to exclude all the rows in which column 8 has more than 3 commas. 1234#0/1 - ABC_1234 3 ATGCATGCATGC HHHIIIGIHVF 1 49:T>C,60:T>C,78:C>A,76:G>T,65:T>G Thanks, Diya (3 Replies)
Discussion started by: Diya123
3 Replies

7. UNIX for Advanced & Expert Users

Regular expression for finding OCR mistakes.

I have a large file of plain text, created using some OCR software. Some words have inevitably been got wrong. I've been trying to create grep or sed, etc., regular expressions to find them - but haven't quite managed to get it right. Here's what I'm trying to achieve: Output all lines which... (2 Replies)
Discussion started by: gencon
2 Replies

8. Shell Programming and Scripting

cleaning up files using find...

I am trying to cleanup a directory with around 4000 files, and using the below command to delete all .gz files older than 60 days, I am having the same issue of arguments being too long. is there a way i can use the same command to do what I intend to do. find /opt/et/logs/Archive/*.log.*.gz... (4 Replies)
Discussion started by: Shellslave
4 Replies
HOCR(1) 							   User Commands							   HOCR(1)

NAME
hocr - Hebrew OCR utility DESCRIPTION
Usage: hocr [OPTION...] - Hebrew OCR utility Help Options: -?, --help Show help options --help-all Show all help options --help-file Show file options --help-image-proccesing Show image proccesing options --help-segmentation Show segmentation options --help-debug Show debug options File options -O, --images-out-path=PATH use PATH for output images -u, --data-out=FILE use FILE as output data file name -C, --save-copy save a compy of original image -b, --save-bw save proccesd bw image -B, --save-bw-exit save proccesd bw image and exit -l, --save-layout save layout image -L, --save-layout-exit save layout image and exit -f, --save-fonts save fonts -F, --save-fonts-exit save fonts images and exit Image proccesing options -T, --thresholding-type=NUM thresholding type, 0 normal, 1 none, 2 fine -t, --threshold=NUM use NUM as threshold value, 1..100 -a, --adaptive-threshold=NUM use NUM as adaptive threshold value, 1..100 -s, --scale=SCALE scale input image by SCALE 1..9, 0 auto -S, --no-auto-scale do not auto acale image -q, --rotate=DEG rotate image clockwise in deg. -Q, --no-auto-rotate do not auto rotate image Segmentation options -c, --colums setup=NUM colums setup: 1.. #colums, 0 auto, 255 free -x, --slicing=NUM use NUM as font slicing threshold, 1..250 -X, --slicing-width=NUM use NUM as font slicing width, 50..250 -w, --font-spacing=NUM font spacing: tight ..-1, 0, 1.. spaced Debug options -g, --draw-grid draw grid on output images -d, --debug print debuging information while running -D, --debug-extra print extra debuging information -y, --font-filter=NUM debug a font filter, use filter NUM -Y, --font-filter-list print a list of debug a font filters -j, --font-num-out print font numbers in output text Application Options: -i, --image-in=FILE use FILE as input image file name -o, --text-out=FILE use FILE as output text file name -h, --html-out output text in html format -N, --no-gtk do not use gtk for file input and output -z, --font=NUM use font NUM -n, --no-nikud do not recognize nikud -v, --version print version information and exit libhocr-0.10.5-i686-pc-linux-gnu-12022008 http://hocr.berlios.de Copyright (C) 2005-2008 Yaacov Zamir <kzamir@walla.co.il> This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MER- CHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. SEE ALSO gocr(1), ocrad(1), unpaper(1) hocr - Hebrew OCR utility February 2008 HOCR(1)
All times are GMT -4. The time now is 09:04 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy