Sponsored Content
Top Forums Shell Programming and Scripting Parallelize bash commands/jobs Post 302974629 by stomp on Wednesday 1st of June 2016 04:33:33 PM
Old 06-01-2016
Quote:
Hard to explain in short here. Those files are genomic data of different collections for each same material to get the gene expression abundance by "sequence" (string in computer term, (count number, compare to "grep "ATCG" file.rice | wc") ) at different time. They are concatenated to have summed abundance in total.
That's roughly what I meant with back some steps, but the situation is far away from a clear picture, what you try to do.

I understand for sure that you have plenty of GB of strings/sequences.

Regarding this command

Code:
cat EMS01a_Early_Rice_*_R1.fq.gz EMS01b_Early_Rice_*_R1.fq.gz > ../Early_Rice_1_R1.fq.gz

I'm sorry to say, but this command and all alikes are nonsense. What you do here is - as I said - concatenate compressed files which will not be possible to be recovered.

Maybe the following is what you try to achieve(And this needs cpu-power!):
Code:
zcat EMS01a_Early_Rice_*_R1.fq.gz EMS01b_Early_Rice_*_R1.fq.gz | gzip -9 > ../Early_Rice_1_R1.fq.gz

I guess you like to combine different sequence sets for different projects. If the sequence sets are statically - if they do not change(strict requirement), you maybe able to just use the data-set you need dynamically assigned to a project.

Example for dynamic assignment of static data to different projects

The following data/sequence-sets are available(directory structure):

Code:
data_sets
|---EMS01
|   EMS01a.txt
|   EMS01b.txt
|   EMS01c.txt
\---EMS02
    EMS02a.txt
    EMS02b.txt
    EMS02c.txt

..and maybe you have a project structure too, which is using specific files of the data set(The arrows mean the files are symbolic links to the real data files):
Code:
projects
|---project_01_data
|   a -> ../../data_sets/EMS01a.txt
|   b -> ../../data_sets/EMS01c.txt
|   c -> ../../data_sets/EMS02c.txt
\---project_02_data
    a -> ../../data_sets/EMS01a.txt
    b -> ../../data_sets/EMS02b.txt
    c -> ../../data_sets/EMS02c.txt

So if you want the data for project_01 you may just create the data on standard output(which should be fed into your processing - whatever that is) on the fly with this command:

Code:
cat projects/project_01_data/*  | research_program

The basic idea is: Don't copy data around if not really needed. Have the parts once where needed and keep them forever as they are.

--

To further reduce I/O you can compress the Data files. If you have xz available: use that! It can compress far better than gzip - it needs more cpu-power too, but if compressing speed matters that can be parallelized quite good in your scenario. And you don't even have to permanently decompress your files. As a replacement for cat there's zcat(gzip) and xzcat(xz) to decompress to stdout when you need it.

Compressing can save a lot I/O!
This User Gave Thanks to stomp For This Post:
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

background jobs exit status and limit the number of jobs to run

i need to execute 5 jobs at a time in background and need to get the exit status of all the jobs i wrote small script below , i'm not sure this is right way to do it.any ideas please help. $cat run_job.ksh #!/usr/bin/ksh #################################### typeset -u SCHEMA_NAME=$1 ... (1 Reply)
Discussion started by: GrepMe
1 Replies

2. Shell Programming and Scripting

Can BASH execute commands on a remote server when the commands are embedded in shell

I want to log into a remote server transfer over a new config and then backup the existing config, replace with the new config. I am not sure if I can do this with BASH scripting. I have set up password less login by adding my public key to authorized_keys file, it works. I am a little... (1 Reply)
Discussion started by: bash_in_my_head
1 Replies

3. Shell Programming and Scripting

commands to be executed in order for a batch jobs!

Hi All, I am trying to run this script. I have a small problem: each "./goada.sh" command when done produces three files (file1, file2, file3) then they are moved to their respective directory as can be seem from this script snippet here. The script goada.sh sends some commands for some... (1 Reply)
Discussion started by: faizlo
1 Replies

4. Shell Programming and Scripting

General Q: how to run/schedule a php script from cron jobs maybe via bash from shell?

Status quo is, within a web application, which is coded completely in php (not by me, I dont know php), I have to fill out several fields, and execute it manually by clicking the "go" button in my browser, several times a day. Thats because: The script itself pulls data (textfiles) from a... (3 Replies)
Discussion started by: lowmaster
3 Replies

5. Shell Programming and Scripting

help to parallelize work on thousands of files

I need to find a smarter way to process about 60,000 files in a single directory. Every night a script runs on each file generating a output on another directory; this used to take 5 hours, but as the data grows it is taking 7 hours. The files are of different sizes, but there are 16 cores... (10 Replies)
Discussion started by: vhope07
10 Replies

6. Shell Programming and Scripting

waiting on jobs in bash, allowing limited parallel jobs at one time, and then for all to finish

Hello, I am running GNU bash, version 3.2.39(1)-release (x86_64-pc-linux-gnu). I have a specific question pertaining to waiting on jobs run in sub-shells, based on the max number of parallel processes I want to allow, and then wait... (1 Reply)
Discussion started by: srao
1 Replies

7. Shell Programming and Scripting

Parallelize a task that have for

Dear all, I'm a newbie in programming and I would like to know if it is possible to parallelize the script: for l in {1..1000} do cut -f$l quase2 |tr "\n" "," |sed 's/$/\ /g' |sed '/^$/d' >a_$l.t done I tried: for l in {1..1000} do cut -f$l quase2 |tr "\n" "," |sed 's/$/\ /g' |sed... (7 Replies)
Discussion started by: valente
7 Replies

8. Shell Programming and Scripting

Bash scripts as commands

Hello, the bulk of my work is run by scripts. An example is as such: #!/bin/bash awk '{print first line}' Input.in > Intermediate.ter awk '{print second line}' Input.in > Intermediate_2.ter command Intermediate.ter Intermediate_2.ter > Output.out It works the way I want it to, but it's not... (1 Reply)
Discussion started by: Leo_Boon
1 Replies

9. Shell Programming and Scripting

Shell script to run multiple jobs and it's dependent jobs

I have multiple jobs and each job dependent on other job. Each Job generates a log and If job completed successfully log file end's with JOB ENDED SUCCESSFULLY message and if it failed then it will end with JOB ENDED with FAILURE. I need an help how to start. Attaching the JOB dependency... (3 Replies)
Discussion started by: santoshkumarkal
3 Replies

10. Shell Programming and Scripting

How to run several bash commands put in bash command line?

How to run several bash commands put in bash command line without needing and requiring a script file. Because I'm actually a windows guy and new here so for illustration is sort of : $ bash "echo ${PATH} & echo have a nice day!" will do output, for example:... (4 Replies)
Discussion started by: abdulbadii
4 Replies
All times are GMT -4. The time now is 02:45 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy