Unix/Linux Go Back    

Speed up extraction od tar.bz2 files using bash

Shell Programming and Scripting

Kindly Note - This is a Single User Post by Forum Member cmccabe Regarding:
Speed up extraction od tar.bz2 files using bash.
Please Follow The Primary Link Above to View the Full Discussion.

Old Unix and Linux 06-13-2017   -   Original Discussion by cmccabe
cmccabe cmccabe is offline
Registered User
Join Date: Nov 2013
Last Activity: 17 November 2017, 8:12 AM EST
Location: Chicago
Posts: 1,188
Thanks: 713
Thanked 14 Times in 13 Posts
The tar.bz2 folders are local. I tried parallel using:

pbzip2 -dvc folder.tar.bz2 | tar x
pbzip2 -v -d -k -m10500 folder.tar.bz2 | tar x

Those did execute but were really no faster. The second command uses the max allowed 20MB to decompress.

This code is extremely fast but seems to extract partial files within each tar.bz2.

Contents of folder.tar.bz2

file1.bam -20GB
file2.bam -25GB
file3.bam -19GB
file1.vcf - 10MB
file2.vcf - 8MB
file3.vcf -10MB
file1.bam.bai - 1MB
file2.bam.bai - 1MB
file3.bam.bai - 1 MB


maxproc=2 # Max number of threads.  Suggest 2, or 3 at most

# Count files
set -- /home/cmccabe/Desktop/NGS/API/*.tar.bz2

# Blank $1 $2 ...
set --

let i=1
for FILE in /home/cmccabe/Desktop/NGS/API/*.tar.bz2
        printf "(%2d/%2d)\tProcessing %s\n" "$i" "$FILES" "$FILE"
        let i=i+1

        tar -xvjf "$FILE" -C /home/cmccabe/Desktop/NGS/API >/dev/null &

        # Turn $1=pida $2=pidb $3=pidc $4=pidd, into
        #      $1=pida $2=pidb $3=pidc $4=pidd $5=pide
        set -- "$@" $!

        # Shift removes $1 and moves the rest down, so you get
        # $1=pidb $2=pidc $3=pidd $3=pide
        # $# is the number of arguments.
        if [ "$#" -ge $maxproc ]  ; then wait "$1" ; shift; fi

After the code executes:

file1.bam -15MB
file3.vcf -1MB
file1.bam.bai - 1MB

There may not always be 9 files in each folder, but the file types will always be .bam and .vcf and .bam.bai. Thank you Linux.