Parallelize bash commands/jobs

05-31-2016

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Parallelize bash commands/jobs

Hello,
I have a bunch of jobs (cp, cat or ln -s) on big files (10xGB in size):

Code:

# commands_script.sh: 
cp file1 path1/newfile1
cp file2 path1/newfile2
cp file3 path1/newfile3
......
cat file11 path2/file21 path1/newfile11
cat file12 path2/file22 path1/newfile12
cat file13 path2/file23 path1/newfile13
......
ln -s path3/file51 path1/newfile51
ln -s path3/file52 path1/newfile52
ln -s path3/file53 path1/newfile53
......

One command at a line!
I can send them into background by adding & at the end of each line, but then all at once the server (24 cores) get over-loaded with my 100x jobs.
Or, I can run the bash script, so that the jobs are run one-by-one, then the available cores my admin assigned me (8 ~ 16 normally) are not fully used.
The challenge is the file names of each line are so random that it is hard to script into nice format for loops or regex.
I was thinking there may be some way like but I am not sure the exact option:

Code:

cat commands_script.sh | parallel -j16 sth-not-sure

How to use the parallel for my case to spread the jobs to available cores and speed up the work?
Thanks!

Last edited by yifangt; 05-31-2016 at 01:57 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

05-31-2016

Registered User

446, 232

Join Date: May 2016

Last Activity: 12 May 2020, 4:52 AM EDT

Posts: 446

Thanks Given: 51

Thanked 232 Times in 163 Posts

Hi,

the script example looks very basic. Hard to believe this can not be compacted in several ways. Maybe you show us the real script, so there is a better idea of what you are doing.

Thought's so far

If you're on a multi-user-system it may be not such a good idea to parallel i/o-heavy processes because you may completely eat up the available i/o of your machine and cause extreme loads at within your server. This could slow down the whole server very greatly. I would suggest you keep the i/o tasks sequentially.

The "ln"-commands may be executed sequentially too, because they should be so fast, it should be a duration that is hardly noticed to create some thousand symlinks / hardlinks. (1000 Links in one second on my 10 year old desktop machine).

As I can see from your small snippet, cpu-power is not required here at all, but only raw i/o so you will not profit from parallel executing your script.

Last edited by stomp; 05-31-2016 at 05:23 PM..

stomp

View Public Profile for stomp

Find all posts by stomp

05-31-2016

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Thanks!

Quote:

If you're on a multi-user-system it may be not such a good idea to parallel i/o-heavy processes because you may completely eat up the available i/o of your machine and cause extreme loads at within your server. This could slow down the whole server very greatly.

This is exactly when I use bash-only at the background all at once, and I received warning from the admin for causing trouble with the server! So I want to restrict the jobs less than the maximum cores of the server.
At this moment I do not care too much about the efficiency yet, although the cat/cp do eat a lot the I/O capacity.
Yes, the ln -s part is not the big deal. The real challenge is the cat and cp parts, where big files are involved to make the processes slow that's why I need parallel.
My real code is pretty much the same as the example, and here is the first several rows for the portion with cat:

Code:

# cd /storage/scottJ/data/raw_reads/resequenced
cat EMS01a_Early_Rice_*_R1.fq.gz EMS01b_Early_Rice_*_R1.fq.gz > ../Early_Rice_1_R1.fq.gz
cat EMS01a_Early_Rice_*_R2.fq.gz EMS01b_Early_Rice_*_R2.fq.gz > ../Early_Rice_1_R2.fq.gz

cat EMS02a_Early_Flax_*_R1.fq.gz EMS02b_Early_Flax_*_R1.fq.gz > ../Early_Flax_1_R1.fq.gz
cat EMS02a_Early_Flax_*_R2.fq.gz EMS02b_Early_Flax_*_R2.fq.gz > ../Early_Flax_1_R2.fq.gz
......

I was thinking the option is straight-forward, and my impression is parallel is for jobs with similar pattern of the scripts/options of the commands. I went to gnu website and other parallel tutorial, but could not spot the corresponding part for this case.
Also I found this type of work is quite common for me to process hundreds of samples, which takes at least a couple of hours when I used to do it one-by-one and let the run goes overnight. This is not good if I want the results right away, which can be achievable using 16~20 cores by parallel if the scaling is proportional as 16~20x.
Thanks again if there is option for my situation that I may have missed.

---------- Post updated at 06:51 PM ---------- Previous update was at 05:47 PM ----------

Did an experiment and found out the simple answer for my example is

Code:

cat commands_script.sh | parallel -j 16

Here is my test.

Code:

commands_script.sh:
echo test1; sleep 15s
echo test2; sleep 15s 
echo test3; sleep 15s 
echo test4; sleep 15s 
echo test5; sleep 115s  

echo test6; sleep 15s
echo test7; sleep 15s 
echo test8; sleep 15s 
echo test9; sleep 15s 
echo test10; sleep 115s  

echo test11; sleep 15s
echo test12; sleep 15s 
echo test13; sleep 15s 
echo test14; sleep 15s 
echo test15; sleep 115s

Code:

$ time cat commands_script.sh | parallel -j 16
test1
test2
test3
test4
test6
test7
test8
test9
test11
test12
test13
test14
test5
test10
test15

real    1m56.053s
user    0m0.160s
sys    0m0.132s

The order of the echoed strings is what I expected!

Code:

$ time bash commands_script.sh
test1
test2
test3
test4
test5
test6
test7
test8
test9
test10
test11
test12
test13
test14
test15

real    8m45.042s
user    0m0.012s
sys    0m0.004s

Using parallel took 1m56.053s, whereas bash-only took 8m45.042s as it is sequential sum of each process.
And I appreciate any insight/comment if I missed anything!

Last edited by RudiC; 06-01-2016 at 01:08 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

05-31-2016

Registered User

446, 232

Join Date: May 2016

Last Activity: 12 May 2020, 4:52 AM EDT

Posts: 446

Thanks Given: 51

Thanked 232 Times in 163 Posts

Quote:

So I want to restrict the jobs less than the maximum cores of the server.

You have a basic misunderstanding here. The number of cores is mostly irrelevant here. The remaining relevance of the number of cores is the huge amount of processes you want (not) to spawn here. Here indeed the number of cores does help. And yes spawning some 100 processes at once puts a considerable load on the server.

What is far more of importance, is the fact, that not cpu cores matters most, but I/O capacity does. And if you spawn multiple excessive I/O hungry tasks, that will hurt server performance much more(It may even kill the server) than the use of cpu-power, because cpu power is what this machine have more than 50 times of what is needed.

Furthermore an issue of I/O is the matter of fact, that especially hard drives are not faster if you spawn multiple I/O stressing tasks. On the contrary: Parallelizing this slows things further down. If you let one process get the full I/O power the hard disk(array), it may read/write continously with maximum speed(You get maximum Throughput if you either only read or write sequentially in one process from/to one disk(array). "copy" or "cat" reads and writes simultaneously). If you spawn e. g. 12 parallel I/O stressing tasks, the head of the hard disk has to do a lot of seeking: Continuosly jumping between the different locations of all 12 different processes data files. That's VERY costly regarding performance and greatly INCREASES the runtime of your script.

----

Your testcase script does not create any suprising nor relevant result because "sleep" is not creating any I/O which is what your real script does.

---

Thanks for posting the real names in your script, because this sheds a little more light on your case.

The question arising is: What do you want to achieve by concatenating .gz files? In my experience, you render the data in the resulting files unusable, because no tool known to me can separate such files into single .gz-archives again.

---

It seems to be the best, if you go back another couple of steps and tell us what you are trying to achieve in this whole process. What do you want to do and what's your current plan to achieve your goal?

I have a hunch that the current method is more like carrying a bicycle and that you can get a lot faster towards your goal if you start riding it.

Last edited by stomp; 06-01-2016 at 07:19 AM..

This User Gave Thanks to stomp For This Post:

stomp

View Public Profile for stomp

Find all posts by stomp

06-01-2016

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Quote:

You have a basic misunderstanding here. The number of cores is mostly irrelevant here. ......What is far more of importance, is the fact, that not cpu cores matters most, but I/O capacity does.

Thank you for point this out as I was not aware of this originally.

Quote:

Furthermore an issue of I/O is the matter of fact, that especially hard drives are not faster if you spawn multiple I/O stressing tasks. ......If you spawn e. g. 12 parallel I/O stressing tasks, the head of the hard disk has to do a lot of seeking: Continuously jumping between the different locations of all 12 different processes data files. That's VERY costly regarding performance and greatly INCREASES the runtime of your script. ...... Your testcase script does not create any suprising nor relevant result because "sleep" is not creating any I/O which is what your real script does.

I think I understand the points more now with this explanation.

Quote:

The question arising is: What do you want to achieve by concatenating .gz files?

Hard to explain in short here. Those files are genomic data of different collections for each same material to get the gene expression abundance by "sequence" (string in computer term, (count number, compare to "grep "ATCG" file.rice | wc") ) at different time. They are concatenated to have summed abundance in total.

Quote:

It seems to be the best, if you go back another couple of steps and tell us what you are trying to achieve in this whole process. What do you want to do and what's your current plan to achieve your goal?

1) I was looking for the correct syntax to pipe jobs from bash-script with parallel, as I was thinking of parallel to speed up the processes, which seems not quite correct.
2) The reason for 1) is hundreds file (~10GB average size) are spread in different places but belong to single project. They must be combined to give simpler organization like usual biological experiment design.
3)

Quote:

... that you can get a lot faster towards your goal if you start riding it.

Of course I do want to ride the bicycle instead of carrying it, and my ultimate goal is to learn how to ride the bicycle, or to drive a car, as you guys are driving race cars here!
Thank you again!

Moderator's Comments:

Please use quote tags, not icode tags, when quoting other posts!

Last edited by RudiC; 06-01-2016 at 01:06 PM.. Reason: Changed icode tags to quote tags

yifangt

View Public Profile for yifangt

Find all posts by yifangt

06-01-2016

Registered User

446, 232

Join Date: May 2016

Last Activity: 12 May 2020, 4:52 AM EDT

Posts: 446

Thanks Given: 51

Thanked 232 Times in 163 Posts

Quote:

That's roughly what I meant with back some steps, but the situation is far away from a clear picture, what you try to do.

I understand for sure that you have plenty of GB of strings/sequences.

Regarding this command

Code:

cat EMS01a_Early_Rice_*_R1.fq.gz EMS01b_Early_Rice_*_R1.fq.gz > ../Early_Rice_1_R1.fq.gz

I'm sorry to say, but this command and all alikes are nonsense. What you do here is - as I said - concatenate compressed files which will not be possible to be recovered.

Maybe the following is what you try to achieve(And this needs cpu-power!):

Code:

zcat EMS01a_Early_Rice_*_R1.fq.gz EMS01b_Early_Rice_*_R1.fq.gz | gzip -9 > ../Early_Rice_1_R1.fq.gz

I guess you like to combine different sequence sets for different projects. If the sequence sets are statically - if they do not change(strict requirement), you maybe able to just use the data-set you need dynamically assigned to a project.

Example for dynamic assignment of static data to different projects

The following data/sequence-sets are available(directory structure):

Code:

data_sets
|---EMS01
|   EMS01a.txt
|   EMS01b.txt
|   EMS01c.txt
\---EMS02
    EMS02a.txt
    EMS02b.txt
    EMS02c.txt

..and maybe you have a project structure too, which is using specific files of the data set(The arrows mean the files are symbolic links to the real data files):

Code:

projects
|---project_01_data
|   a -> ../../data_sets/EMS01a.txt
|   b -> ../../data_sets/EMS01c.txt
|   c -> ../../data_sets/EMS02c.txt
\---project_02_data
    a -> ../../data_sets/EMS01a.txt
    b -> ../../data_sets/EMS02b.txt
    c -> ../../data_sets/EMS02c.txt

So if you want the data for project_01 you may just create the data on standard output(which should be fed into your processing - whatever that is) on the fly with this command:

Code:

cat projects/project_01_data/*  | research_program

The basic idea is: Don't copy data around if not really needed. Have the parts once where needed and keep them forever as they are.

--

To further reduce I/O you can compress the Data files. If you have xz available: use that! It can compress far better than gzip - it needs more cpu-power too, but if compressing speed matters that can be parallelized quite good in your scenario. And you don't even have to permanently decompress your files. As a replacement for cat there's zcat(gzip) and xzcat(xz) to decompress to stdout when you need it.

Compressing can save a lot I/O!

This User Gave Thanks to stomp For This Post:

stomp

View Public Profile for stomp

Find all posts by stomp

Shell Programming and Scripting

Parallelize bash commands/jobs

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to run several bash commands put in bash command line?

Discussion started by: abdulbadii

2. Shell Programming and Scripting

Shell script to run multiple jobs and it's dependent jobs

Discussion started by: santoshkumarkal

3. Shell Programming and Scripting

Bash scripts as commands

Discussion started by: Leo_Boon

4. Shell Programming and Scripting

Parallelize a task that have for

Discussion started by: valente

5. Shell Programming and Scripting

waiting on jobs in bash, allowing limited parallel jobs at one time, and then for all to finish

Discussion started by: srao

6. Shell Programming and Scripting

help to parallelize work on thousands of files

Discussion started by: vhope07

7. Shell Programming and Scripting

General Q: how to run/schedule a php script from cron jobs maybe via bash from shell?

Discussion started by: lowmaster

8. Shell Programming and Scripting

commands to be executed in order for a batch jobs!

Discussion started by: faizlo

9. Shell Programming and Scripting

Can BASH execute commands on a remote server when the commands are embedded in shell

Discussion started by: bash_in_my_head

10. Shell Programming and Scripting

background jobs exit status and limit the number of jobs to run

Discussion started by: GrepMe