PDF Script to extract PDF Links MOD in Need


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting PDF Script to extract PDF Links MOD in Need
# 1  
Old 05-19-2014
PDF Script to extract PDF Links MOD in Need

In here we have a script to extract all pdf links from a single page.. any idea's in how make this read instead of a page a list of pages.. and extract all pdf links ?

Code:
#!/bin/bash

# NAME:         pdflinkextractor
# AUTHOR:       Glutanimate (http://askubuntu.com/users/81372/), 2013
# LICENSE:      GNU GPL v2
# DEPENDENCIES: wget lynx
# DESCRIPTION:  extracts PDF links from websites and dumps them to the stdout and as a textfile
#               only works for links pointing to files with the ".pdf" extension
#
# USAGE:        pdflinkextractor "www.website.com"

WEBSITE="$1"

echo "Getting link list..."

lynx -cache=0 -dump -listonly "$WEBSITE" | grep ".*\.pdf$" | awk '{print $2}' | tee pdflinks.txt

# OPTIONAL
#
# DOWNLOAD PDF FILES
#
#echo "Downloading..."    
#wget -P pdflinkextractor_files/ -i pdflinks.txt

# 2  
Old 05-20-2014
It depends on how the tool lynx accepts the pages, I think it should accept the multiple pages as a list. Better to look for its manual.

So, your page is the "$WEBSITE" variable inside the script.
For multiple pages, you could use it like

Code:
pdflinkextractor "www.website.com" "www.anotherpage.com"

Inside the script,

Code:
WEBSITE="$@"

Incase, it doesn't accept the multiple pages,

Code:
WEBSITE="$@"
for PAGE in $WEBSITE
do
 lynx -cache=0 -dump -listonly "$PAGE" | grep ".*\.pdf$" | awk '{print $2}' | tee -a pdflinks.txt
done


If your page list is long, I would prefer to put them in a file

Code:
$ cat mypages.txt
www.website.com
www.anotherpage.com
www.anotherpage2.com
www.anotherpage3.com

And use it like

Code:
pdflinkextractor mypages.txt

Inside script

Code:
PAGEFILE=$1
while read PAGE
do
 lynx -cache=0 -dump -listonly "$PAGE" | grep ".*\.pdf$" | awk '{print $2}' | tee -a pdflinks.txt
done < $PAGEFILE


Last edited by clx; 05-20-2014 at 03:14 PM.. Reason: typos
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl to extract from a pdf

The below perl script produces the metrics.txt below using the run.txt as the input. perl -ne 'BEGIN{print join("\t","R_Index", "ISP Loading", "Pre-Enrichment", "Total Reads", "Read Length", "Key Signal", "Usable Sequence", "Enrichment", "Polyclonal" ,"Low Quality" ,"Test Fragment", "Aligned... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

Converting secured pdf files to pdf using acroread

Does anybody have idea of Converting secured pdf files to pdf using acroread ? ---------- Post updated at 04:49 PM ---------- Previous update was at 04:44 PM ---------- This file is not password protected. (4 Replies)
Discussion started by: Soham
4 Replies

3. Shell Programming and Scripting

Shell Script to Dynamically Extract file content based on Parameters from a pdf file

Hi Guru's, I am new to shell scripting. I have a unique requirement: The system generates a single pdf(/tmp/ABC.pdf) file with Invoices for Multiple Customers, the format is something like this: Page1 >> Customer 1 >>Invoice1 + invoice 2 >> Page1 end Page2 >> Customer 2 >>Invoice 3 + Invoice 4... (3 Replies)
Discussion started by: DIps
3 Replies

4. Programming

help me with perl script that creat pdf

Hi, I have one xml file, I extracted some comments and saved in pdf file.I written code like this #!/usr/bin/perl use warnings; use strict; use PDF::API2; use PDF::API2::Page; use XML::LibXML::Reader; use Data::Dumper; my $file; open( $file, 'formal.xml'); my $reader =... (1 Reply)
Discussion started by: veerubiji
1 Replies

5. Shell Programming and Scripting

Script for converting a pdf to book format

Hello, excuse my English... I'm trying to do a nautilus-script to transform a normal A4 pdf to another pdf with book format, ready to be printed (double sided). I mean, the script put pages in order and also put 2 pages per horizontal A4 page (p.e.: a pdf with 8 pages would look like: 8-1, 2-7,... (2 Replies)
Discussion started by: dokan
2 Replies

6. Shell Programming and Scripting

Perl - Convert html to pdf - PDF::FromHTML

Hi, I am trying to convert html to pdf using perl module PDF::FromHTML, am getting the error as given below. not well-formed (invalid token) at line 2, column 17, byte 56 at C:/Perl/lib/XML/Parser.pm line 187 at C:/Perl/site/lib/PDF/FromHTML.pm line 140 The perl code is as given... (2 Replies)
Discussion started by: DILEEP410
2 Replies

7. Shell Programming and Scripting

Extract Table from PDF

Hi Guys! I want to extract table from PDF in HTML. Can we do this using Shell script....??. Please provide me your suggestions. Any help will be highly appreciated. Thanks! (2 Replies)
Discussion started by: parshant_bvcoe
2 Replies

8. Shell Programming and Scripting

Regarding Shell Script References,PDF and Tutorials

Hi, Could you pls guide me a reference materials or PDF or Tutorials link for Shell Scripting.I'm new to Unix Shell Scripting.want to explore as much as possible in Shell Scripting.... Thanks Sollins (2 Replies)
Discussion started by: sollins
2 Replies

9. Shell Programming and Scripting

Script To dlete PDF file s and Folders

Hi We have to delete PDF files and Folders older than five days .Can anyone help with the shell script Regards Ved (10 Replies)
Discussion started by: ved123
10 Replies
Login or Register to Ask a Question