Sponsored Content
Top Forums UNIX for Beginners Questions & Answers Problem with extract PDFs from huge files. Post 303045990 by mrAibo on Tuesday 21st of April 2020 06:09:11 AM
Old 04-21-2020
Problem with extract PDFs from huge files.

Hello Unix experts,

we have a problem Smilie
We have some binary files ~25GB. In this files are many (millions) PDF files included.
How we can extract them from such huge files? In small files I got it with the command:
Code:
awk -v FS="(%PDF-1.4|%%EOF)" '{print $2}' FILE > OUTPUTDIR

so the PDF file begins with PDF-1.? and ends with %%EOF
but it don't works on such big files. So we need another way to extract them.

Thanks in advance!
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to extract data from a huge file?

Hi, I have a huge file of bibliographic records in some standard format.I need a script to do some repeatable task as follows: 1. Needs to create folders as the strings starts with "item_*" from the input file 2. Create a file "contents" in each folders having "license.txt(tab... (5 Replies)
Discussion started by: srsahu75
5 Replies

2. Shell Programming and Scripting

How to extract a piece of information from a huge file

Hello All, I need some assistance to extract a piece of information from a huge file. The file is like this one : database information ccccccccccccccccc ccccccccccccccccc ccccccccccccccccc ccccccccccccccccc os information cccccccccccccccccc cccccccccccccccccc... (2 Replies)
Discussion started by: Marcor
2 Replies

3. Shell Programming and Scripting

How to extract a subset from a huge dataset

Hi, All I have a huge file which has 450G. Its tab-delimited format is as below x1 A 50020 1 x1 B 50021 8 x1 C 50022 9 x1 A 50023 10 x2 D 50024 5 x2 C 50025 7 x2 F 50026 8 x2 N 50027 1 : : Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is... (3 Replies)
Discussion started by: cliffyiu
3 Replies

4. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Hi, all: I've got two folders, say, "folder1" and "folder2". Under each, there are thousands of files. It's quite obvious that there are some files missing in each. I just would like to find them. I believe this can be done by "diff" command. However, if I change the above question a... (1 Reply)
Discussion started by: jiapei100
1 Replies

5. Shell Programming and Scripting

Problem running Perl Script with huge data files

Hello Everyone, I have a perl script that reads two types of data files (txt and XML). These data files are huge and large in number. I am using something like this : foreach my $t (@text) { open TEXT, $t or die "Cannot open $t for reading: $!\n"; while(my $line=<TEXT>){ ... (4 Replies)
Discussion started by: ad23
4 Replies

6. Shell Programming and Scripting

Three Difference File Huge Data Comparison Problem.

I got three different file: Part of File 1 ARTPHDFGAA . . Part of File 2 ARTGHHYESA . . Part of File 3 ARTPOLYWEA . . (4 Replies)
Discussion started by: patrick87
4 Replies

7. Shell Programming and Scripting

Search pdfs in command line

Hi, I'm trying to search for a particular phrase in a large number of PDFs in a particular directory. What I've done so far only prints out the line, but I haven't been able to display in which file the phrase appears. find . -name '*.pdf' -exec pdftotext {} - \; | grep "search phrase" ... (2 Replies)
Discussion started by: lost.identity
2 Replies

8. UNIX for Advanced & Expert Users

Performance problem with removing duplicates in a huge file (50+ GB)

I'm trying to remove duplicate data from an input file with unsorted data which is of size >50GB and write the unique records to a new file. I'm trying and already tried out a variety of options posted in similar threads/forums. But no luck so far.. Any suggestions please ? Thanks !! (9 Replies)
Discussion started by: Kannan K
9 Replies

9. Shell Programming and Scripting

Extract few content from a huge list of files

I have a huge list of files (about 300,000) which have a pattern like this. .I 1 .U 87049087 .S Am J Emerg .M Allied Health Personnel/*; Electric Countershock/*; .T Refibrillation managed by EMT-Ds: .P ARTICLE. .W Some patients converted from ventricular fibrillation to organized... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

10. Shell Programming and Scripting

Bash script monitor directory and subdirectories for new pdfs

I need bash script that monitor folders for new pdf files and create xml file for rss feed with newest files on the list. I have some script, but it reports errors. #!/bin/bash SYSDIR="/var/www/html/Intranet" HTTPLINK="http://TYPE.IP.ADDRESS.HERE/pdfs" FEEDTITLE="Najnoviji dokumenti na... (20 Replies)
Discussion started by: markus1981
20 Replies
PDFTOIPE(1)						      General Commands Manual						       PDFTOIPE(1)

NAME
pdftoipe - Convert PDF files into editable Ipe format SYNOPSIS
pdftoipe { options } PDF file [ XML file ] DESCRIPTION
pdftoipe converts arbitrary PDF files to Ipe's XML format. Note that pdftoipe is not related to Ipe's use of the PDF file format. PDF files generated by Ipe contain an extra stream with Ipe markup information, which is necessary for Ipe to read the file again. If you wish to convert an Ipe-generated PDF-file to XML format, you should use ipetoipe -xml! pdftoipe is meant to allow you to take arbitrary PDF files and make them editable in Ipe. pdftoipe does a pretty good job on drawings, but doesn't handle text very well. Ipe's text model is based on LaTeX, which is just very different from the text found in most PDF files. -notext Ignore all text in the PDF file, convert graphics only -literal Allow Latex markup in text objects. The default is to escape all characters special in Latex. -math Use LaTeX math mode for all text in the PDF file -merge int Set the text merge level, an integer between 0 (the default) and 2. It determines how eagerly pdftoipe tries to combine consecutive text in the PDF document into a single Ipe text object. At level 0, only characters consecutively rendered in PDF are combined. At level 1, more text is combined. At level 2, all text is combined until a path or image is drawn. -unicode int Determine what should be done with non-ASCII characters in text. At level 0, all non-ASCII characters are represented as [U+XXX]. At level 1 (the default), some often used characters (such as bullets) are replaced by Latex equivalents, others are represented as [U+XXX]. At level 2, characters that are not replaced by Latex equivalents are included in UTF-8. At level 3, all characters are included as UTF-8. At level 2 and 3, UTF-8 is set as the input encoding in the Latex preamble of the generated Ipe document. Note that this only concerns characters for which the PDF file provides a mapping to Unicode. Characters from embedded fonts with- out Unicode mapping (such as symbol fonts) are always represented as [S+XX]. -f int First page to convert -l int Last page to convert -opw string Owner password for encrypted PDF files -upw string User password for encrypted PDF files -q Quiet mode (don't print any messages or errors) AUTHOR
Otfried Cheong REPORTING BUGS
Please report bugs at http://ipe7.sourceforge.net/bugzilla.html SEE ALSO
More information about Ipe can be found in The Ipe Manual, available online at http://ipe7.sourceforge.net/manual/manual.html October 13, 2009 PDFTOIPE(1)
All times are GMT -4. The time now is 07:43 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy