Downloading jpgs from a gallery type website

04-29-2014

Registered User

2, 0

Join Date: Apr 2014

Last Activity: 29 April 2014, 7:35 PM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Downloading jpgs from a gallery type website

Can someone explain what this does step by step? I found this script on stackoverflow and want to customize it for personal use in downloading jpg images from a website.

Code:

# get all pages 
curl 'http://domain.com/id/[1-151468]' -o '#1.html' 

# get all images 
grep -oh 'http://pics.domain.com/pics/original/.*jpg' *.html >urls.txt 

# download all images 
sort -u urls.txt | wget -i-

1. What I think the first like does is download the pages of domain with curl but what does the '#1.html' mean?

2. Why in .*jpg is the * after the '.'? Also what is this trying to do? I attempted altering this using a different website but there's an error grep: *.html: No such file or directory even though the first command is downloading the html files just fine.

3. I think the third option is just organizing the results and wget goes to the jpg's website and downloads the jpgs.

Moderator's Comments:

Code tags for code, please.

Last edited by Corona688; 04-29-2014 at 06:47 PM..

workisnotfun

View Public Profile for workisnotfun

Find all posts by workisnotfun

04-29-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

1) From man curl:

Code:

       -o/--output <file>
              Write output to <file> instead of stdout. If you are using {} or
              []  to  fetch  multiple documents, you can use '#' followed by a
              number in the <file> specifier. That variable will  be  replaced
              with the current string for the URL being fetched. ...

So it replaces #1 with the number of the page in question.

2) Because it's a regex, not a glob. In a regex, * means "zero or more of the previous character", and . means "any character". So .*jpg means "any string ending in jpg".

3) Yes, it sorts them to download in-order. Possibly not very well since it's just a random pile of URL's but order doesn't matter too much here anyway.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

04-29-2014

Registered User

2, 0

Join Date: Apr 2014

Last Activity: 29 April 2014, 7:35 PM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

I am having the most trouble with step two I think.

I'm assuming the -oh is the option -o and -h?

Step 1 downloads the files fine but they're stored in my directory as 1.html *dot* , 2.html *dot* etc, with a dot right after (not a period), not sure if this is a problem but Step 2 doesn't seem to be able to find any .html files

and so Step 3 fails because there is no urls.txt. What could be the problem?

---------- Post updated at 05:24 PM ---------- Previous update was at 04:55 PM ----------

I think I might have found a different problem actually.

Running this in terminal works fine

Code:

wget -nd -H -p -A jpg,jpeg,png,gif -e robots=off www.url.example

but when I put this in my bash script and run it I get awaiting response... 404 Not Found. At the end of the jpg url, for some reason a %0D gets appended to the end which I'm thinking makes wget go to the wrong url.

I've been trying a different approach than my earlier one since I couldn't get that working. What could be the problem now so that I can automate the downloading?

Last edited by workisnotfun; 04-29-2014 at 08:13 PM..

workisnotfun

View Public Profile for workisnotfun

Find all posts by workisnotfun

04-29-2014

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

%OD is a carriage return has probably been appended to your files by Windows, did you edit this file on windows and transfer to unix?

dos2unix filename from unix should remove these extra characters

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

Shell Programming and Scripting

Downloading jpgs from a gallery type website

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Wget error while downloading from https website

Discussion started by: pinnacle

2. Windows & DOS: Issues & Discussions

Downloading a file from Website to a Windows Folder

Discussion started by: dohko

3. Shell Programming and Scripting

File Management: How do I move all JPGS in a folder structure to a single folder?

Discussion started by: guptaxpn

4. Shell Programming and Scripting

Downloading info from website to database

Discussion started by: vadharah

5. Programming

array type has incomplete element type

Discussion started by: jaganadh

6. Shell Programming and Scripting

String type to date type

Discussion started by: rinku

7. UNIX for Dummies Questions & Answers

Lynx - Downloading - extension handling - changing mime type?

Discussion started by: yitzle

8. UNIX Desktop Questions & Answers

Recommendations for good shell utility to resize JPGs?

Discussion started by: deckard