Hello,
I have a bunch of jobs (cp, cat or ln -s) on big files (10xGB in size):
One command at a line!
I can send them into background by adding & at the end of each line, but then all at once the server (24 cores) get over-loaded with my 100x jobs.
Or, I can run the bash script, so that the jobs are run one-by-one, then the available cores my admin assigned me (8 ~ 16 normally) are not fully used.
The challenge is the file names of each line are so random that it is hard to script into nice format for loops or regex.
I was thinking there may be some way like but I am not sure the exact option:
How to use the parallel for my case to spread the jobs to available cores and speed up the work?
Thanks!
the script example looks very basic. Hard to believe this can not be compacted in several ways. Maybe you show us the real script, so there is a better idea of what you are doing.
Thought's so far
If you're on a multi-user-system it may be not such a good idea to parallel i/o-heavy processes because you may completely eat up the available i/o of your machine and cause extreme loads at within your server. This could slow down the whole server very greatly. I would suggest you keep the i/o tasks sequentially.
The "ln"-commands may be executed sequentially too, because they should be so fast, it should be a duration that is hardly noticed to create some thousand symlinks / hardlinks. (1000 Links in one second on my 10 year old desktop machine).
As I can see from your small snippet, cpu-power is not required here at all, but only raw i/o so you will not profit from parallel executing your script.
If you're on a multi-user-system it may be not such a good idea to parallel i/o-heavy processes because you may completely eat up the available i/o of your machine and cause extreme loads at within your server. This could slow down the whole server very greatly.
This is exactly when I use bash-only at the background all at once, and I received warning from the admin for causing trouble with the server! So I want to restrict the jobs less than the maximum cores of the server.
At this moment I do not care too much about the efficiency yet, although the cat/cp do eat a lot the I/O capacity.
Yes, the ln -s part is not the big deal. The real challenge is the cat and cp parts, where big files are involved to make the processes slow that's why I need parallel.
My real code is pretty much the same as the example, and here is the first several rows for the portion with cat:
I was thinking the option is straight-forward, and my impression is parallel is for jobs with similar pattern of the scripts/options of the commands. I went to gnu website and other parallel tutorial, but could not spot the corresponding part for this case.
Also I found this type of work is quite common for me to process hundreds of samples, which takes at least a couple of hours when I used to do it one-by-one and let the run goes overnight. This is not good if I want the results right away, which can be achievable using 16~20 cores by parallel if the scaling is proportional as 16~20x.
Thanks again if there is option for my situation that I may have missed.
---------- Post updated at 06:51 PM ---------- Previous update was at 05:47 PM ----------
Did an experiment and found out the simple answer for my example is
Here is my test.
The order of the echoed strings is what I expected!
Using parallel took 1m56.053s, whereas bash-only took 8m45.042s as it is sequential sum of each process.
And I appreciate any insight/comment if I missed anything!
So I want to restrict the jobs less than the maximum cores of the server.
You have a basic misunderstanding here. The number of cores is mostly irrelevant here. The remaining relevance of the number of cores is the huge amount of processes you want (not) to spawn here. Here indeed the number of cores does help. And yes spawning some 100 processes at once puts a considerable load on the server.
What is far more of importance, is the fact, that not cpu cores matters most, but I/O capacity does. And if you spawn multiple excessive I/O hungry tasks, that will hurt server performance much more(It may even kill the server) than the use of cpu-power, because cpu power is what this machine have more than 50 times of what is needed.
Furthermore an issue of I/O is the matter of fact, that especially hard drives are not faster if you spawn multiple I/O stressing tasks. On the contrary: Parallelizing this slows things further down. If you let one process get the full I/O power the hard disk(array), it may read/write continously with maximum speed(You get maximum Throughput if you either only read or write sequentially in one process from/to one disk(array). "copy" or "cat" reads and writes simultaneously). If you spawn e. g. 12 parallel I/O stressing tasks, the head of the hard disk has to do a lot of seeking: Continuosly jumping between the different locations of all 12 different processes data files. That's VERY costly regarding performance and greatly INCREASES the runtime of your script.
----
Your testcase script does not create any suprising nor relevant result because "sleep" is not creating any I/O which is what your real script does.
---
Thanks for posting the real names in your script, because this sheds a little more light on your case.
The question arising is: What do you want to achieve by concatenating .gz files? In my experience, you render the data in the resulting files unusable, because no tool known to me can separate such files into single .gz-archives again.
---
It seems to be the best, if you go back another couple of steps and tell us what you are trying to achieve in this whole process. What do you want to do and what's your current plan to achieve your goal?
I have a hunch that the current method is more like carrying a bicycle and that you can get a lot faster towards your goal if you start riding it.
You have a basic misunderstanding here. The number of cores is mostly irrelevant here. ......What is far more of importance, is the fact, that not cpu cores matters most, but I/O capacity does.
Thank you for point this out as I was not aware of this originally.
Quote:
Furthermore an issue of I/O is the matter of fact, that especially hard drives are not faster if you spawn multiple I/O stressing tasks. ......If you spawn e. g. 12 parallel I/O stressing tasks, the head of the hard disk has to do a lot of seeking: Continuously jumping between the different locations of all 12 different processes data files. That's VERY costly regarding performance and greatly INCREASES the runtime of your script. ...... Your testcase script does not create any suprising nor relevant result because "sleep" is not creating any I/O which is what your real script does.
I think I understand the points more now with this explanation.
Quote:
The question arising is: What do you want to achieve by concatenating .gz files?
Hard to explain in short here. Those files are genomic data of different collections for each same material to get the gene expression abundance by "sequence" (string in computer term, (count number, compare to "grep "ATCG" file.rice | wc") ) at different time. They are concatenated to have summed abundance in total.
Quote:
It seems to be the best, if you go back another couple of steps and tell us what you are trying to achieve in this whole process. What do you want to do and what's your current plan to achieve your goal?
1) I was looking for the correct syntax to pipe jobs from bash-script with parallel, as I was thinking of parallel to speed up the processes, which seems not quite correct.
2) The reason for 1) is hundreds file (~10GB average size) are spread in different places but belong to single project. They must be combined to give simpler organization like usual biological experiment design.
3)
Quote:
... that you can get a lot faster towards your goal if you start riding it.
Of course I do want to ride the bicycle instead of carrying it, and my ultimate goal is to learn how to ride the bicycle, or to drive a car, as you guys are driving race cars here!
Thank you again!
Moderator's Comments:
Please use quote tags, not icode tags, when quoting other posts!
Last edited by RudiC; 06-01-2016 at 01:06 PM..
Reason: Changed icode tags to quote tags
Hard to explain in short here. Those files are genomic data of different collections for each same material to get the gene expression abundance by "sequence" (string in computer term, (count number, compare to "grep "ATCG" file.rice | wc") ) at different time. They are concatenated to have summed abundance in total.
That's roughly what I meant with back some steps, but the situation is far away from a clear picture, what you try to do.
I understand for sure that you have plenty of GB of strings/sequences.
Regarding this command
I'm sorry to say, but this command and all alikes are nonsense. What you do here is - as I said - concatenate compressed files which will not be possible to be recovered.
Maybe the following is what you try to achieve(And this needs cpu-power!):
I guess you like to combine different sequence sets for different projects. If the sequence sets are statically - if they do not change(strict requirement), you maybe able to just use the data-set you need dynamically assigned to a project.
Example for dynamic assignment of static data to different projects
The following data/sequence-sets are available(directory structure):
..and maybe you have a project structure too, which is using specific files of the data set(The arrows mean the files are symbolic links to the real data files):
So if you want the data for project_01 you may just create the data on standard output(which should be fed into your processing - whatever that is) on the fly with this command:
The basic idea is: Don't copy data around if not really needed. Have the parts once where needed and keep them forever as they are.
--
To further reduce I/O you can compress the Data files. If you have xz available: use that! It can compress far better than gzip - it needs more cpu-power too, but if compressing speed matters that can be parallelized quite good in your scenario. And you don't even have to permanently decompress your files. As a replacement for cat there's zcat(gzip) and xzcat(xz) to decompress to stdout when you need it.
How to run several bash commands put in bash command line without needing and requiring a script file.
Because I'm actually a windows guy and new here so for illustration is sort of :
$ bash "echo ${PATH} & echo have a nice day!"
will do output, for example:... (4 Replies)
I have multiple jobs and each job dependent on other job.
Each Job generates a log and If job completed successfully log file end's with JOB ENDED SUCCESSFULLY message and if it failed then it will end with JOB ENDED with FAILURE.
I need an help how to start.
Attaching the JOB dependency... (3 Replies)
Hello,
the bulk of my work is run by scripts. An example is as such:
#!/bin/bash
awk '{print first line}' Input.in > Intermediate.ter
awk '{print second line}' Input.in > Intermediate_2.ter
command Intermediate.ter Intermediate_2.ter > Output.out
It works the way I want it to, but it's not... (1 Reply)
Dear all,
I'm a newbie in programming and I would like to know if it is possible to parallelize the script:
for l in {1..1000}
do
cut -f$l quase2 |tr "\n" "," |sed 's/$/\
/g' |sed '/^$/d' >a_$l.t
done
I tried:
for l in {1..1000}
do
cut -f$l quase2 |tr "\n" "," |sed 's/$/\
/g' |sed... (7 Replies)
Hello,
I am running GNU bash, version 3.2.39(1)-release (x86_64-pc-linux-gnu). I have a specific question pertaining to waiting on jobs run in sub-shells, based on the max number of parallel processes I want to allow, and then wait... (1 Reply)
I need to find a smarter way to process about 60,000 files in a single directory.
Every night a script runs on each file generating a output on another directory; this used to take 5 hours, but as the data grows it is taking 7 hours.
The files are of different sizes, but there are 16 cores... (10 Replies)
Status quo is, within a web application, which is coded completely in php (not by me, I dont know php), I have to fill out several fields, and execute it manually by clicking the "go" button in my browser, several times a day.
Thats because:
The script itself pulls data (textfiles) from a... (3 Replies)
Hi All,
I am trying to run this script. I have a small problem:
each "./goada.sh" command when done produces three files (file1, file2, file3) then they are moved to their respective directory as can be seem from this script snippet here.
The script goada.sh sends some commands for some... (1 Reply)
I want to log into a remote server transfer over a new config and then backup the existing config, replace with the new config.
I am not sure if I can do this with BASH scripting.
I have set up password less login by adding my public key to authorized_keys file, it works.
I am a little... (1 Reply)
i need to execute 5 jobs at a time in background and need to get the exit status of all the jobs i wrote small script below , i'm not sure this is right way to do it.any ideas please help.
$cat run_job.ksh
#!/usr/bin/ksh
####################################
typeset -u SCHEMA_NAME=$1
... (1 Reply)