Speed up extraction od tar.bz2 files using bash


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Speed up extraction od tar.bz2 files using bash
# 1  
Old 06-12-2017
Speed up extraction od tar.bz2 files using bash

The below bash will untar each tar.bz2 folder in the directory, then remove the tar.bz2.

Each of the tar.bz2 folders ranges from 40-75GB and currently takes ~2 hours to extract. Is there a way to speed up the extraction process?

I am using a xeon processor with 12 cores. Thank you Smilie.

Code:
for i in /home/cmccabe/Desktop/NGS/API/*.tar.bz2; do 
    tar -xvjf "$i" -C /home/cmccabe/Desktop/NGS/API
rm $i
done

# 2  
Old 06-12-2017
At 10 megabytes per second it sounds like there's some room for improvement. But you can't go too crazy or you'll just slow your disk down to uselessness.

This requires the BASH shell, mostly for the ability to do wait "$ONEPARTICULARTHREAD" instead of wait #for everything

Code:
#!/bin/bash

maxproc=2 # Max number of threads.  Suggest 2, or 3 at most
i=0

# Count files
set -- /home/cmccabe/Desktop/NGS/API/*.tar.bz2
FILES="$#"

# Blank $1 $2 ...
set --

let i=1
for FILE in /home/cmccabe/Desktop/NGS/API/*.tar.bz2
do
        printf "(%2d/%2d)\tProcessing %s\n" "$i" "$FILES" "$FILE"
        let i=i+1

        tar -xvjf "$FILE" -C /home/cmccabe/Desktop/NGS/API >/dev/null &

        # Turn $1=pida $2=pidb $3=pidc $4=pidd, into
        #      $1=pida $2=pidb $3=pidc $4=pidd $5=pide
        set -- "$@" $!

        # Shift removes $1 and moves the rest down, so you get
        # $1=pidb $2=pidc $3=pidd $3=pide
        # $# is the number of arguments.
        if [ "$#" -ge $maxproc ]  ; then wait "$1" ; shift; fi
done

This User Gave Thanks to Corona688 For This Post:
# 3  
Old 06-12-2017
Hi.

I think parallel can help you:
Code:
NAME
       parallel - build and execute shell command lines from standard input in
       parallel
...

Some details on parallel:
Code:
parallel        build and execute shell command lines from standard in... (man)
Path    : /usr/bin/parallel
Version : 20130922
Length  : 6224 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Repo    : Debian 8.7 (jessie) 
Home    : https://www.gnu.org/software/parallel/ (pm)
Modules : (for perl codes)
 IPC::Open3     1.16
 POSIX  1.38_03
 Symbol 1.07
 CGI::File::Temp        4.09
 File::Path     2.09
 Getopt::Long   2.42
 strict 1.08
 strict 1.08
 FileHandle     2.02
 POSIX  1.38_03parallel        build and execute shell command lines from standard in... (man)

Some help at:
Code:
       You can also watch the intro video for a quick introduction:
       http://tinyogg.com/watch/TORaR/ http://tinyogg.com/watch/hfxKj/ and
       http://tinyogg.com/watch/YQuXd/ or
       http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:
# 4  
Old 06-13-2017
Can I just check that this files are local and not NFS mounted. If they are remote, then you are dependant on the network too, along with a dollop of memory to work on the file. You will also lose any caching that could help you.

If the files are local, then ignore me.



Robin
This User Gave Thanks to rbatte1 For This Post:
# 5  
Old 06-13-2017
Yes, if the files are NFS mounted, my attempt or "parallel" will both hurt, not help!
This User Gave Thanks to Corona688 For This Post:
# 6  
Old 06-13-2017
The tar.bz2 folders are local. I tried parallel using:

Code:
pbzip2 -dvc folder.tar.bz2 | tar x
pbzip2 -v -d -k -m10500 folder.tar.bz2 | tar x

Those did execute but were really no faster. The second command uses the max allowed 20MB to decompress.

This code is extremely fast but seems to extract partial files within each tar.bz2.

Contents of folder.tar.bz2

Code:
file1.bam -20GB
file2.bam -25GB
file3.bam -19GB
file1.vcf - 10MB
file2.vcf - 8MB
file3.vcf -10MB
file1.bam.bai - 1MB
file2.bam.bai - 1MB
file3.bam.bai - 1 MB

Code:
#!/bin/bash

maxproc=2 # Max number of threads.  Suggest 2, or 3 at most
i=0

# Count files
set -- /home/cmccabe/Desktop/NGS/API/*.tar.bz2
FILES="$#"

# Blank $1 $2 ...
set --

let i=1
for FILE in /home/cmccabe/Desktop/NGS/API/*.tar.bz2
do
        printf "(%2d/%2d)\tProcessing %s\n" "$i" "$FILES" "$FILE"
        let i=i+1

        tar -xvjf "$FILE" -C /home/cmccabe/Desktop/NGS/API >/dev/null &

        # Turn $1=pida $2=pidb $3=pidc $4=pidd, into
        #      $1=pida $2=pidb $3=pidc $4=pidd $5=pide
        set -- "$@" $!

        # Shift removes $1 and moves the rest down, so you get
        # $1=pidb $2=pidc $3=pidd $3=pide
        # $# is the number of arguments.
        if [ "$#" -ge $maxproc ]  ; then wait "$1" ; shift; fi
done

After the code executes:

Code:
file1.bam -15MB
file3.vcf -1MB
file1.bam.bai - 1MB

There may not always be 9 files in each folder, but the file types will always be .bam and .vcf and .bam.bai. Thank you Smilie.
# 7  
Old 06-13-2017
The command drl suggested was not pbzip but in fact parallel. Instead of extracting multiple partial files from one tar, you can get several tars extracting at once.

Which is what my code is for, actually.

I neglected one line at the end. It shouldn't have mattered, but if the code did manage to quit while the children were running, its possible it killed them instead of waiting. So:

Code:
#!/bin/bash

maxproc=2 # Max number of threads.  Suggest 2, or 3 at most
i=0

# Count files
set -- /home/cmccabe/Desktop/NGS/API/*.tar.bz2
FILES="$#"

# Blank $1 $2 ...
set --

let i=1
for FILE in /home/cmccabe/Desktop/NGS/API/*.tar.bz2
do
        printf "(%2d/%2d)\tProcessing %s\n" "$i" "$FILES" "$FILE"
        let i=i+1

        tar -xvjf "$FILE" -C /home/cmccabe/Desktop/NGS/API >/dev/null &

        # Turn $1=pida $2=pidb $3=pidc $4=pidd, into
        #      $1=pida $2=pidb $3=pidc $4=pidd $5=pide
        set -- "$@" $!

        # Shift removes $1 and moves the rest down, so you get
        # $1=pidb $2=pidc $3=pidd $3=pide
        # $# is the number of arguments.
        if [ "$#" -ge $maxproc ]  ; then wait "$1" ; shift; fi
done

wait

This User Gave Thanks to Corona688 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash not removing all .tar.bz2 files after extracting

In the bash below each .tar.bz2 (usually 2) are extracted and then the original .tar.bz2 is removed. However, only one (presumably the first extracted) is being removed, however both are extracted. I am not sure why this is? Thank you :). tar.bz2 folders in /home/cmccabe/Desktop/NGS/API ... (3 Replies)
Discussion started by: cmccabe
3 Replies

2. Programming

Aria2c to download and extract. tar.bz2

I am using aria2c to download a .tar.bz2 and trying to extract it in the same command. I can download the file but not extract it. I can also manually extract the tar.bz2., but not in the same command. Thank you :). aria2c -x8 -l log.txt -c -d /xx/xx/xxx --use-head=true --http-user "<user>" ... (8 Replies)
Discussion started by: cmccabe
8 Replies

3. Ubuntu

Error messages while extracting tar.bz2 in Ubuntu

while extracting a tar.bz2 file using the command tar xjf git.tar.bz2 I received error messages that shows Cannot hard link to and Cannot create symlink to error messages what will be the reason for those error messages. (4 Replies)
Discussion started by: saravana krishn
4 Replies

4. UNIX and Linux Applications

Download firefox-19.0.2.tar.bz2

Does anyone know a reliable source to download firefox-19.0.2.tar.bz2 from? I would think you be able to download from firefox or mozilla somewhere. I haven't gotten anything useful from my google searches. (2 Replies)
Discussion started by: cokedude
2 Replies

5. Shell Programming and Scripting

Put one tar.bz2 file to another tar.bz2

Hi experts, I have two tar.bz2 file,: a.tar.bz2 and b.tar.bz2 I want to put a.tar.bz2 in to b.tar.bz2 eg: b.tar.bz2 only have one file named "b.c" contained I want it contain "b.c and a.tar.bz2" I don't want to decompress the b.tar.bz2 to achieve this, I try with "cat a.tar.bz2 >>... (1 Reply)
Discussion started by: yanglei_fage
1 Replies

6. Shell Programming and Scripting

How to unpack and install .tar.bz2 library ?

Hi, I am trying to unpack and install .tar.bz2 library. I was told to cd / and than tar -jxvf /source-of-library-file?...tar.bz2 to get files unpacked and installed into / Darius $ pwd / $ $ tar -jxvf /tmp/local/root/ncurses-dev-addon.tar.bz2 ncurses-dev-addon/... (3 Replies)
Discussion started by: jack2
3 Replies

7. Shell Programming and Scripting

Optimize/speed-up perl extraction

Hi, Is there a way I can extract my data faster. You know my data is 1.2 GB text file with 8Million rows with 38 columns/fields. Imagine how huge this is. How I can optimized the data extraction using perl. That is why I'm creating a script to filter only those informations that I need. Is... (3 Replies)
Discussion started by: pinpe
3 Replies

8. Shell Programming and Scripting

compare two tar.bz2

Hello, I am using a bash script to archive directories of text files located in ${root}: tar cf ${root}.tar ${root}* bzip2 ${root}.tar I'd like to compare the newly produced archive two.tar.bz2 with the second latest one.tar.bz2. cmp one.tar.bz2 two.tar.bz2 returns one.tar.bz2 two.tar.bz2... (2 Replies)
Discussion started by: JCR
2 Replies

9. UNIX for Advanced & Expert Users

extracting from tar.bz2

hi could any body tell me how to extract .tar.bz2 files i tried using tar but in vain. i found bzip2 in googling but i could not find it on machine unix tru64 please suggest. (1 Reply)
Discussion started by: Raom
1 Replies
Login or Register to Ask a Question