help to parallelize work on thousands of files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting help to parallelize work on thousands of files
# 1  
Old 07-11-2010
Java help to parallelize work on thousands of files

I need to find a smarter way to process about 60,000 files in a single directory.

Every night a script runs on each file generating a output on another directory; this used to take 5 hours, but as the data grows it is taking 7 hours.

The files are of different sizes, but there are 16 cores on the box so I want to run at least 10 parallel processes. (the report generating script is not very cpu intensive)

I can manually split "ls -1" in to 10 lists, then run foreach on every file in the background. This makes the process run in 2 hours; but it isn't the smartest way because the list with the largest files (some over a gig) always takes the longest and the list with small files finishes first.

One way of solving the problem is by listing the files in order of size and somehow putting every 10th file in to a list.

Another way could be to start processing each file one after the other, but maintaining not more than 10 threads?

Finally I was also thinking of keeping a zipped up tarball. gtar or tar piped through gzip takes over 12 hours to run! It would be good to be able to create 10 smaller tarballs in a shorter time.

thanks!
-VH
# 2  
Old 07-11-2010
Hi VH

Could you give us a sample layout of your script. I mean how it is provided with the input file whether through command line args or directly?

Guru.
# 3  
Old 07-11-2010
You could use something like this to split files into 10 lists with even size distribution:
Code:
#!/bin/ksh
for i in {1..10}
do
  list[$i]=$(ls -lS | awk "NR>$i" | awk 'NR%10==1{print $9}')
done
echo "${list[1]}" #prints contents of the first list (just to show how to get to the filenames contained there)

# 4  
Old 07-11-2010
Something like this?

Code:
par=10
for i in *                                        # For every file in the directory
do 
  while [ $(ps |grep -c "[c]ommand") -ge $par ]   # wait until a slot is free if there are 10 or more processes 
  do   
     sleep 1
  done
  command "$i" &                                  # run "command" on file "$i" in background
done
wait                                              # wait for last background processes to finish

replace "command" with the command you are actually using.

This will create a crude 10 slots in which the commands can run in parallel (a single queue 10 server model).

Last edited by Scrutinizer; 07-11-2010 at 11:29 AM..
# 5  
Old 07-11-2010
script to list the files and split them

Code:
#!/bin/sh
$dest=/my/destination/directory
ls -l | sort -n | awk {`print $9'} > /tmp/all
cd /tmp ; split -l 10000 /tmp/all

Then I can go to /tmp and run
Code:
for file in `cat /tmp/xaa`
do
/usr/local/bin/genrep.pl $file /my/destination/directory
done

and the same for /tmp/xab, /tmp/xac, etc ..
# 6  
Old 07-11-2010
So I think it would become:
Code:
#!/bin/sh
$dest=/my/destination/directory
par=10
for i in *                                        # For every file in the directory
do 
  while [ $(ps |grep -c "[g]enrep") -ge $par ]    # wait until a slot is free if there are $par or more processes 
  do   
     sleep 1
  done
  /usr/local/bin/genrep.pl "$i" "$dest" &         # run "command" on file "$i" in background
done
wait                                              # wait for last background processes to finish

# 7  
Old 07-11-2010
Code:
#!/bin/bash
for i in {1..10}
do
  list[$i]=$(ls -lS | awk "NR>$i" | awk 'NR%10==1'{'print $8'})
done
for i in {1..10}
do
  for file in ${list[$i]}; do genrep.pl $file ../test2 ;done &
done

The above is working for me now. Thanks
The only drawback is when i stop the script, i need another script to kill the background processes but thats ok.Smilie

---------- Post updated at 01:00 AM ---------- Previous update was at 12:59 AM ----------

LOL! now I have to try SCRUTINIZER's post and compare results!

---------- Post updated at 01:32 AM ---------- Previous update was at 01:00 AM ----------

Thanks All,

Initially scrutinizer's script was slower, but I got rid of the sleep and it runs like a charm. Advantage is that cancelling the script is easy.

Cheers!
- VH
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parallelize bash commands/jobs

Hello, I have a bunch of jobs (cp, cat or ln -s) on big files (10xGB in size): # commands_script.sh: cp file1 path1/newfile1 cp file2 path1/newfile2 cp file3 path1/newfile3 ...... cat file11 path2/file21 path1/newfile11 cat file12 path2/file22 path1/newfile12 cat file13 path2/file23... (5 Replies)
Discussion started by: yifangt
5 Replies

2. Shell Programming and Scripting

Bash-awk to process thousands of files

Hi to all, I have thousand of files in a folder with names with format "FILE-YYYY-MM-DD-HHMM" for what I want to send the following AWK command awk '/Code.*/' FILE-2014* I'd like to separate all files that have the same date to a folder named with the corresponding date. For example, if I... (7 Replies)
Discussion started by: Ophiuchus
7 Replies

3. Shell Programming and Scripting

Search for patterns in thousands of files

Hi All, I want to search for a certain string in thousands of files and these files are distributed over different directories created daily. For that I created a small script in bash but while running it I am getting the below error: /ms.sh: xrealloc: subst.c:5173: cannot allocate... (17 Replies)
Discussion started by: danish0909
17 Replies

4. Shell Programming and Scripting

Parallelize a task that have for

Dear all, I'm a newbie in programming and I would like to know if it is possible to parallelize the script: for l in {1..1000} do cut -f$l quase2 |tr "\n" "," |sed 's/$/\ /g' |sed '/^$/d' >a_$l.t done I tried: for l in {1..1000} do cut -f$l quase2 |tr "\n" "," |sed 's/$/\ /g' |sed... (7 Replies)
Discussion started by: valente
7 Replies

5. Shell Programming and Scripting

How to calculate mean in AWK? line by line several files, thousands of lines

I'm kinda stuck on this one, I have 7 files with 30.000 lines/file like this 050 0.023 0.504336 050 0.024 0.529521 050 0.025 0.538908 050 0.026 0.537035 I want to find the mean line by line of the third column from the files named like this: Stat-f-1.dat .... Stat-f-7.dat Stat-s-1.dat... (8 Replies)
Discussion started by: AriasFco
8 Replies

6. UNIX for Advanced & Expert Users

Copying Thousands of Tiny or Empty Files?

There is a procedure I do here at work where I have to synchronize file systems. The source file system always has three or four directories of hundreds of thousands of tiny (1k or smaller) or empty files. Whenever my rsync command reaches these directories, I'm waiting for hours for those files... (3 Replies)
Discussion started by: deckard
3 Replies

7. Shell Programming and Scripting

thousands separator

Hi, Trying to represent a number with thousands separator in AWK: echo 1 12 123 1234 12345 123456 1234567 | awk --re-interval '{print gensub(/(])(]{3})/,"\\1,\\2","g")}' 1 12 123 1,234 1,2345 1,23456 1,234567 any idea what is wrong here ? (11 Replies)
Discussion started by: ynixon
11 Replies

8. Shell Programming and Scripting

trnsmiting thousands ftp files and get an error message

Im transmiting thousands ftp files to a server, when type the command mput *, an error comes and say. args list to long. set to I. So ihave to transmit them in batch or blocks, but its too sloww. what shoul i do?. i need to do a program, or with a simple command i could solve the problem? (3 Replies)
Discussion started by: alexcol
3 Replies

9. Shell Programming and Scripting

Finding a specific pattern from thousands of files ????

Hi All, I want to find a specific pattern from approximately 400000 files on solaris platform. Its very heavy for me to grep that pattern to each file individually. Can anybody suggest me some way to search for specific pattern (alpha numeric) from these forty thousand files. Please note that... (6 Replies)
Discussion started by: aarora_98
6 Replies

10. UNIX for Advanced & Expert Users

Multiple (thousands) of Cron Instances

Hey all, I have a box running SUSE SLES 8 and in the past few months the box will randomly spawn thousands of instances of /USR/SBIN/CRON to the point where the box will lock up entirely. Upwards of 14000 instances! I imagine it's using up all of the available files that can be opened at one... (10 Replies)
Discussion started by: sysera
10 Replies
Login or Register to Ask a Question