10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
I'm on Linux version 2.6.32-696.3.1.el6.x86_64, using the Ksh shell.
I'm working with the input file:
John Daggett, 341 King Road, Plymouth MA
Alice Ford, 22 East Broadway, Richmond VA
Orville Thomas, 11345 Oak Bridge Road, Tulsa OK
Terry Kalkas, 402 Lans Road, Beaver Falls PA
Eric Adams,... (2 Replies)
Discussion started by: prooney
2 Replies
2. Shell Programming and Scripting
I have a CSV with carriage returns in place of newlines. I am trying to use tr to remove them, but it isn't working.
Academic year,Term,Course name,Period,Last name,Nickname
2012-2013,First Semester,English 12,4th Period,Arnold,Adam
2012-2013,First Semester,English 12,4th Period,Adams,Jim... (1 Reply)
Discussion started by: nextyoyoma
1 Replies
3. UNIX for Advanced & Expert Users
I have a directory of over a hundred text files that I'm getting ready to merge with the CAT command. However there is only one space after each file; this makes the output look crowded.
I would like to add two, possibly even four carriage returns at the end of each text file to make the final... (2 Replies)
Discussion started by: tg3793
2 Replies
4. Shell Programming and Scripting
I am trying to generate some scripts to help manage an Oracle database. When I check the value returned from Oracle it has a leading carriage return in the variable. Is there a way to prevent this? Is there a way to easily strip out the carriage return. See code and output below.
... (7 Replies)
Discussion started by: Panzer993
7 Replies
5. Emergency UNIX and Linux Support
Hello,
I need help adding carriage returns at specific intervals (say 692 characters) to a text file that's one continous string. I'm working in AIX5.3. Any quick help is appreciated.
Thanks! (2 Replies)
Discussion started by: bd_joy
2 Replies
6. Shell Programming and Scripting
Hi
I have a text file that looks like this:
A
B
C
D
E
F
G
H
I
I want it to be reformatted to
A;B;C;
D;E;F;
G;H;I; (4 Replies)
Discussion started by: coolnfunky
4 Replies
7. Shell Programming and Scripting
Hello, I have read a few threads on this subject and tried a few things out, but still come up short.
There was one good example, then the last reply was something to the effect of 'Use Sed' & 'Read a book'...
Well I read a bunch of online tutorials on sed, awk, tr, but still can't get the... (2 Replies)
Discussion started by: Majiktom
2 Replies
8. Shell Programming and Scripting
I need to replace thousands of carriage returns/line breaks in a large xml file and with spaces. I hope to do so with a script, called, for example, "removeCRs." I would invoke this at the command line as
ml5003$ sed -f /Users/ml5003/removeCRs oldFile > newFile
The script, I presume, would... (4 Replies)
Discussion started by: ml5003
4 Replies
9. Shell Programming and Scripting
How do we delete all carriage returns after a particular string using sed inside a K Shell?
e.g. I have a text file named file1 below:
$ more file1
Group#=1 User=A
Role=a1
Group#=2 User=B
Role=a1
Role=b1
Group#=3 User=C
Role=b1
I want the carriage returns to be delete on the... (12 Replies)
Discussion started by: stevefox
12 Replies
10. Shell Programming and Scripting
Is there any way to remove carriage retuns between the records?
We have input records separated by TABS and have carriage returns as below:
123 456 789 ABC "1952.00" 678 "abcdef
ghik
lmno"
Above we... (10 Replies)
Discussion started by: acheepi
10 Replies
PDF2TXT(1) PDFMiner Manual PDF2TXT(1)
NAME
pdf2txt - extracts text contents of PDF files
SYNOPSIS
pdf2txt [option...] file...
DESCRIPTION
pdf2txt extracts text contents from a PDF file. It extracts all the text that is to be rendered programmatically, i.e. text represented as
ASCII or Unicode strings. It cannot recognize text drawn as images that would require optical character recognition. It also extracts the
corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a
password for protected PDF documents when its access is restricted. You cannot extract any text from a PDF document which does not have
extraction permission.
OPTIONS
-o file
Specifies the output file name. The default is to print the extracted contents to standand output in text format.
-p pageno[,pageno,...]
Specifies the comma-separated list of the page numbers to be extracted. Page numbers start at one. By default, it extracts text from
all the pages.
-c codec
Specifies the output codec.
-t type
Specifies the output format. The following formats are currently supported:
text
Text format. This is the default.
html
HTML format. It is not recommended.
xml
XML format. It provides the most information.
tag
"Tagged PDF" format. A tagged PDF has its own contents annotated with HTML-like tags. pdf2txt tries to extract its content streams
rather than inferring its text locations. Tags used here are defined in the PDF Reference, Sixth Edition[1] (S10.7 "Tagged PDF").
-D writing-mode
Specifies the writing mode of text outputs:
lr-tb
Left-to-right, top-to-bottom.
tb-rl
Top-to-bottom, right-to-left.
auto
Determine writing mode automatically
-M char-margin, -L line-margin, -W word-margin
These are the parameters used for layout analysis. In an actual PDF file, text portions might be split into several chunks in the
middle of its running, depending on the authoring software. Therefore, text extraction needs to splice text chunks. In the figure
below, two text chunks whose distance is closer than the char-margin is considered continuous and get grouped into one. Also, two lines
whose distance is closer than the line-margin is grouped as a text box, which is a rectangular area that contains a "cluster" of text
portions. Furthermore, it may be required to insert blank characters (spaces) as necessary if the distance between two words is greater
than the word-margin, as a blank between words might not be represented as a space, but indicated by the positioning of each word.
Each value is specified not as an actual length, but as a proportion of the length to the size of each character in question. The
default values are char-margin = 1.0, line-margin = 0.3, and W = 0.2, respectively.
-n
Suppress layout analysis.
-A
Force layout analysis for all the text strings, including text contained in figures.
-V
Enable detection of vertical writing.
-s scale
Specifies the output scale. This option can be used in HTML format only.
-m n
Specifies the maximum number of pages to extract. By default, all the pages in a document are extracted.
-P password
Provides the user password to access PDF contents.
-d
Increase the debug level.
EXAMPLES
Extract text as an HTML file whose filename is output.html:
$ pdf2txt -o output.html samples/naacl06-shinyama.pdf
Extract a Japanese HTML file in vertical writing:
$ pdf2txt -c euc-jp -D tb-rl -o output.html samples/jo.pdf
Extract text from an encrypted PDF file:
$ pdf2txt -P mypassword -o output.txt secret.pdf
SEE ALSO
dumppdf(1)
AUTHORS
Jakub Wilk <jwilk@debian.org>
Wrote this manual page for the Debian system.
Yusuke Shinyama <yusuke@cs.nyu.edu>
Author of PDFMiner and its original HTML documentation.
NOTES
1. PDF Reference, Sixth Edition
http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
pdf2txt 08/24/2011 PDF2TXT(1)