Multi html download. | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Multi html download.

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 11-30-2012
hoo hoo is offline
Registered User
 
Join Date: Feb 2009
Last Activity: 16 March 2013, 11:06 AM EDT
Posts: 27
Thanks: 2
Thanked 0 Times in 0 Posts
Multi html download.

Hello,

I have a url list. it is very huge. I want download them concurrently.
Aria2c is very good tool for this.(or concurrently curl command) But my server is crash for I/O process.
it is very high load. I want all download htmls(Htmls are very small) save to a single text file. Is it possible? Thank you very much.

Aria2c command:

Code:
aria2c -iurl.txt -j30

url.txt

Code:
http://www.domain.com/f34gf345g.html
http://www.domain.com/jyjk678.html
....

Sponsored Links
    #2  
Old 11-30-2012
Yoda's Avatar
Yoda Yoda is offline Forum Advisor  
Jedi Master
 
Join Date: Jan 2012
Last Activity: 26 September 2014, 6:37 PM EDT
Location: Galactic Empire
Posts: 3,385
Thanks: 234
Thanked 1,208 Times in 1,134 Posts

Code:
while read URL
do
    wget "$URL" >> download.txt # Downloading URL using wget & appending it to file: download.txt
done < urls_list.dat            # Reading from a file: urls_list.dat which has list of URLs

Sponsored Links
    #3  
Old 11-30-2012
hoo hoo is offline
Registered User
 
Join Date: Feb 2009
Last Activity: 16 March 2013, 11:06 AM EDT
Posts: 27
Thanks: 2
Thanked 0 Times in 0 Posts
Thanks but it is not concurrently download. it is very slow for huge url list.
    #4  
Old 11-30-2012
Yoda's Avatar
Yoda Yoda is offline Forum Advisor  
Jedi Master
 
Join Date: Jan 2012
Last Activity: 26 September 2014, 6:37 PM EDT
Location: Galactic Empire
Posts: 3,385
Thanks: 234
Thanked 1,208 Times in 1,134 Posts
Downloading 50 URLs at a time, you can customize as per your requirement:-

Code:
seq=1
while read URL
do
   wget "$URL" >> download_${seq}.txt & 
   seq=$( expr $seq + 1 )
   mod=$( expr $seq % 50 )
   if [ $mod -eq 0 ]
   then
         wait   
   fi
done < urls_list.dat
wait
cat download_*.txt > consolidated.txt

Sponsored Links
    #5  
Old 11-30-2012
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 1 October 2014, 1:35 PM EDT
Location: Saskatchewan
Posts: 19,480
Thanks: 796
Thanked 3,288 Times in 3,083 Posts
Quote:
Originally Posted by bipinajith View Post
Code:
while read URL
do
    wget "$URL" >> download.txt # Downloading URL using wget & appending it to file: download.txt
done < urls_list.dat            # Reading from a file: urls_list.dat which has list of URLs

Good use of while read. You can redirect the entire loop instead of reopening download.txt 1000 times though:
Code:
while read line
do
        wget ...
done > download.txt

wget also has some features which make a loop unnecessary though

wget is able to read a list of files with -i. The -nv option is also useful, to make it still print completed files without printing all the complicated junk wget usually does.


Code:
wget -nv -i urls_list.dat > download.txt

This should be much faster than calling wget 1000 times since it is able to re-use the same connection if it's connecting to the same site. Concurrency may not be necessary ( and may not be desirable in many cases -- how fast is your connection? ) but if it is, I'd split the list into parts and use wget -i on those parts.
Sponsored Links
    #6  
Old 11-30-2012
hoo hoo is offline
Registered User
 
Join Date: Feb 2009
Last Activity: 16 March 2013, 11:06 AM EDT
Posts: 27
Thanks: 2
Thanked 0 Times in 0 Posts
Thanks. it is very fast. but each file separately downloading to hdd. it is very high load for server. I want downloading but only to single file.



Quote:
Originally Posted by bipinajith View Post
Downloading 50 URLs at a time, you can customize as per your requirement:-

Code:
seq=1
while read URL
do
   wget "$URL" >> download_${seq}.txt & 
   seq=$( expr $seq + 1 )
   mod=$( expr $seq % 50 )
   if [ $mod -eq 0 ]
   then
         wait   
   fi
done < urls_list.dat
wait
cat download_*.txt > consolidated.txt

Sponsored Links
    #7  
Old 11-30-2012
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 1 October 2014, 1:35 PM EDT
Location: Saskatchewan
Posts: 19,480
Thanks: 796
Thanked 3,288 Times in 3,083 Posts
Since they're in the background, they have to be saved to independent files. It'd be almost impossible to guarantee the order of the output if they weren't.

I'd try splitting the file into many chunks for wget -i to handle independently. This will allow them to be concurrent without such an overwhelming number of files.


Code:
#!/bin/sh

# Calculate how many lines among n processes, 10 default
MAXPROC=${2:-10}
# Count lines first
LINES=$(wc -l < $1 )
# Divide lines by processes
let LINES=LINES/MAXPROC

# Split file into 10 chunks xaa, xab, ...
split -l $LINES < $1

# Loop over xaa, xab, ...
for FILE in x*
do
        # Download one set of files from $FILE into $FILE.out in background
        wget -nv -i "$FILE" -O - > $FILE.out 2> $FILE.err &
done

wait    # Wait for all processes to finish

# Assemble files in order
cat x*.out
cat x*.err >&2
# Remove temporary files
rm x*

Use it like
Code:
./multiget.sh filelist 5 2> errlog > output

for 5 simultaneous downloads.
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
How to substract selective values in multi row, multi column file (using awk or sed?) nricardo Shell Programming and Scripting 4 10-15-2012 10:13 AM
download an html file via wget and pass it to mysql and update a database mapasainfo Shell Programming and Scripting 8 05-18-2011 02:24 AM
Multi Link Interface Runtime - where to download ? vilius AIX 1 07-23-2009 03:04 PM
Multi User Multi Task Reza Nazarian UNIX for Dummies Questions & Answers 6 04-13-2006 09:23 AM
multi-file multi-edit kielitaide UNIX for Dummies Questions & Answers 12 06-28-2001 03:12 AM



All times are GMT -4. The time now is 08:20 PM.