Unix/Linux Go Back    


Shell Programming and Scripting BSD, Linux, and UNIX shell scripting Post awk, bash, csh, ksh, perl, php, python, sed, sh, shell scripts, and other shell scripting languages questions here.

Speed up extraction od tar.bz2 files using bash

Shell Programming and Scripting


Tags
bash untar, solved

Reply    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 06-12-2017
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 22 September 2017, 1:22 PM EDT
Location: Chicago
Posts: 1,178
Thanks: 707
Thanked 15 Times in 14 Posts
Speed up extraction od tar.bz2 files using bash

The below bash will untar each tar.bz2 folder in the directory, then remove the tar.bz2.

Each of the tar.bz2 folders ranges from 40-75GB and currently takes ~2 hours to extract. Is there a way to speed up the extraction process?

I am using a xeon processor with 12 cores. Thank you Linux.


Code:
for i in /home/cmccabe/Desktop/NGS/API/*.tar.bz2; do 
    tar -xvjf "$i" -C /home/cmccabe/Desktop/NGS/API
rm $i
done

Sponsored Links
    #2  
Old Unix and Linux 06-12-2017
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 22 September 2017, 5:42 PM EDT
Location: Saskatchewan
Posts: 22,417
Thanks: 1,126
Thanked 4,235 Times in 3,915 Posts
At 10 megabytes per second it sounds like there's some room for improvement. But you can't go too crazy or you'll just slow your disk down to uselessness.

This requires the BASH shell, mostly for the ability to do wait "$ONEPARTICULARTHREAD" instead of wait #for everything


Code:
#!/bin/bash

maxproc=2 # Max number of threads.  Suggest 2, or 3 at most
i=0

# Count files
set -- /home/cmccabe/Desktop/NGS/API/*.tar.bz2
FILES="$#"

# Blank $1 $2 ...
set --

let i=1
for FILE in /home/cmccabe/Desktop/NGS/API/*.tar.bz2
do
        printf "(%2d/%2d)\tProcessing %s\n" "$i" "$FILES" "$FILE"
        let i=i+1

        tar -xvjf "$FILE" -C /home/cmccabe/Desktop/NGS/API >/dev/null &

        # Turn $1=pida $2=pidb $3=pidc $4=pidd, into
        #      $1=pida $2=pidb $3=pidc $4=pidd $5=pide
        set -- "$@" $!

        # Shift removes $1 and moves the rest down, so you get
        # $1=pidb $2=pidc $3=pidd $3=pide
        # $# is the number of arguments.
        if [ "$#" -ge $maxproc ]  ; then wait "$1" ; shift; fi
done

The Following User Says Thank You to Corona688 For This Useful Post:
cmccabe (06-13-2017)
Sponsored Links
    #3  
Old Unix and Linux 06-12-2017
drl's Unix or Linux Image
drl drl is offline Forum Advisor  
Registered Voter
 
Join Date: Apr 2007
Last Activity: 23 September 2017, 6:28 PM EDT
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,169
Thanks: 220
Thanked 400 Times in 345 Posts
Hi.

I think parallel can help you:

Code:
NAME
       parallel - build and execute shell command lines from standard input in
       parallel
...

Some details on parallel:

Code:
parallel        build and execute shell command lines from standard in... (man)
Path    : /usr/bin/parallel
Version : 20130922
Length  : 6224 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Repo    : Debian 8.7 (jessie) 
Home    : https://www.gnu.org/software/parallel/ (pm)
Modules : (for perl codes)
 IPC::Open3     1.16
 POSIX  1.38_03
 Symbol 1.07
 CGI::File::Temp        4.09
 File::Path     2.09
 Getopt::Long   2.42
 strict 1.08
 strict 1.08
 FileHandle     2.02
 POSIX  1.38_03parallel        build and execute shell command lines from standard in... (man)

Some help at:

Code:
       You can also watch the intro video for a quick introduction:
       http://tinyogg.com/watch/TORaR/ http://tinyogg.com/watch/hfxKj/ and
       http://tinyogg.com/watch/YQuXd/ or
       http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Best wishes ... cheers, drl
The Following User Says Thank You to drl For This Useful Post:
cmccabe (06-13-2017)
    #4  
Old Unix and Linux 06-13-2017
rbatte1 rbatte1 is offline Forum Staff  
Root armed
 
Join Date: Jun 2007
Last Activity: 15 September 2017, 11:35 AM EDT
Location: Lancashire, UK
Posts: 3,256
Thanks: 1,389
Thanked 630 Times in 569 Posts
Can I just check that this files are local and not NFS mounted. If they are remote, then you are dependant on the network too, along with a dollop of memory to work on the file. You will also lose any caching that could help you.

If the files are local, then ignore me.



Robin
The Following User Says Thank You to rbatte1 For This Useful Post:
cmccabe (06-13-2017)
Sponsored Links
    #5  
Old Unix and Linux 06-13-2017
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 22 September 2017, 5:42 PM EDT
Location: Saskatchewan
Posts: 22,417
Thanks: 1,126
Thanked 4,235 Times in 3,915 Posts
Yes, if the files are NFS mounted, my attempt or "parallel" will both hurt, not help!
The Following User Says Thank You to Corona688 For This Useful Post:
cmccabe (06-13-2017)
Sponsored Links
    #6  
Old Unix and Linux 06-13-2017
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 22 September 2017, 1:22 PM EDT
Location: Chicago
Posts: 1,178
Thanks: 707
Thanked 15 Times in 14 Posts
The tar.bz2 folders are local. I tried parallel using:


Code:
pbzip2 -dvc folder.tar.bz2 | tar x
pbzip2 -v -d -k -m10500 folder.tar.bz2 | tar x

Those did execute but were really no faster. The second command uses the max allowed 20MB to decompress.

This code is extremely fast but seems to extract partial files within each tar.bz2.

Contents of folder.tar.bz2


Code:
file1.bam -20GB
file2.bam -25GB
file3.bam -19GB
file1.vcf - 10MB
file2.vcf - 8MB
file3.vcf -10MB
file1.bam.bai - 1MB
file2.bam.bai - 1MB
file3.bam.bai - 1 MB


Code:
#!/bin/bash

maxproc=2 # Max number of threads.  Suggest 2, or 3 at most
i=0

# Count files
set -- /home/cmccabe/Desktop/NGS/API/*.tar.bz2
FILES="$#"

# Blank $1 $2 ...
set --

let i=1
for FILE in /home/cmccabe/Desktop/NGS/API/*.tar.bz2
do
        printf "(%2d/%2d)\tProcessing %s\n" "$i" "$FILES" "$FILE"
        let i=i+1

        tar -xvjf "$FILE" -C /home/cmccabe/Desktop/NGS/API >/dev/null &

        # Turn $1=pida $2=pidb $3=pidc $4=pidd, into
        #      $1=pida $2=pidb $3=pidc $4=pidd $5=pide
        set -- "$@" $!

        # Shift removes $1 and moves the rest down, so you get
        # $1=pidb $2=pidc $3=pidd $3=pide
        # $# is the number of arguments.
        if [ "$#" -ge $maxproc ]  ; then wait "$1" ; shift; fi
done

After the code executes:


Code:
file1.bam -15MB
file3.vcf -1MB
file1.bam.bai - 1MB

There may not always be 9 files in each folder, but the file types will always be .bam and .vcf and .bam.bai. Thank you Linux.
Sponsored Links
    #7  
Old Unix and Linux 06-13-2017
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 22 September 2017, 5:42 PM EDT
Location: Saskatchewan
Posts: 22,417
Thanks: 1,126
Thanked 4,235 Times in 3,915 Posts
The command drl suggested was not pbzip but in fact parallel. Instead of extracting multiple partial files from one tar, you can get several tars extracting at once.

Which is what my code is for, actually.

I neglected one line at the end. It shouldn't have mattered, but if the code did manage to quit while the children were running, its possible it killed them instead of waiting. So:


Code:
#!/bin/bash

maxproc=2 # Max number of threads.  Suggest 2, or 3 at most
i=0

# Count files
set -- /home/cmccabe/Desktop/NGS/API/*.tar.bz2
FILES="$#"

# Blank $1 $2 ...
set --

let i=1
for FILE in /home/cmccabe/Desktop/NGS/API/*.tar.bz2
do
        printf "(%2d/%2d)\tProcessing %s\n" "$i" "$FILES" "$FILE"
        let i=i+1

        tar -xvjf "$FILE" -C /home/cmccabe/Desktop/NGS/API >/dev/null &

        # Turn $1=pida $2=pidb $3=pidc $4=pidd, into
        #      $1=pida $2=pidb $3=pidc $4=pidd $5=pide
        set -- "$@" $!

        # Shift removes $1 and moves the rest down, so you get
        # $1=pidb $2=pidc $3=pidd $3=pide
        # $# is the number of arguments.
        if [ "$#" -ge $maxproc ]  ; then wait "$1" ; shift; fi
done

wait

The Following User Says Thank You to Corona688 For This Useful Post:
cmccabe (06-15-2017)
Sponsored Links
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Speed up bash loop? cmccabe Shell Programming and Scripting 11 11-19-2015 04:42 PM
Files extraction - any help ? Gopal_Engg Shell Programming and Scripting 5 05-03-2010 04:36 AM
data from blktrace: read speed V.S. write speed W.C.C Filesystems, Disks and Memory 1 10-26-2009 09:42 AM
any way to speed up calculations in bash script npatwardhan Shell Programming and Scripting 11 01-15-2009 08:41 PM
Optimize/speed-up perl extraction pinpe Shell Programming and Scripting 3 08-04-2007 09:13 AM



All times are GMT -4. The time now is 07:43 PM.