How to make awk command faster for large amount of data?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to make awk command faster for large amount of data?
# 15  
Old 10-04-2018
Worst-case, you're adding tons more overhead to files which need processing anyway. There's a lot more to be gained by quitting early than doing extra.

12 hours is surprisingly slow, though. awk is definitely slower than gzip, but it can process 50 megs per second on one of my more ancient systems. Assuming an 8:1 compression ratio for text, you're getting closer to 20. Given that, I'm suspicious that you really are hitting disk bandwidth limits.

Are you using an SSD or a spinning disk? A spinning disk will be hit particularly hard if it has to read and write simultaneously. Its bandwidth will be more than halved. And this is a worst-case situation, where your data is so large that cache is simply no help at all. And is your disk physically attached or a NAS, NFS share, USB disk, or some other such thing? The protocol overhead of these can be horrendous in practice.

If you're not hitting disk bandwidth limits though, multiprocessing should be a big gain.

Last edited by Corona688; 10-04-2018 at 04:37 PM..
# 16  
Old 10-04-2018
I am not hitting disk limit, I saw with iotop and htop, it's using less than 10mb/s the majority of time, just in the begining it reaches 60mb/s, I guess. It's a spinning disk, SATA. I will write to another hard drive to stop using the same drive to read and write. Besides that, I`ll take a look in multiprocessing too.

By the way, it's 217 minutes. But it has overhead decompressing that shouldn't exist, I'll fix it

Last edited by brenoasrm; 10-04-2018 at 06:33 PM..
# 17  
Old 10-04-2018
Hi, here a perl script that output first and last line, maybe most faster but i'm not sure...
Code:
#!/usr/bin/env perl

use strict;
use warnings;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError);

my $line;
for my $input ( glob "./*.gz" )
{
    print $input."\n";
    my $gz = new IO::Uncompress::Gunzip $input;
    my $mm=$gz->getHeaderInfo();
    print $gz->getline();
    $gz->seek(%$mm{"ISIZE"}-80,0); # here, max size of line is 80
    $line=$gz->getline() while(!eof($gz));
    print $line;
    close($gz)
}

Example:
Code:
$ perl unco.pl 
./bbrr.gz
"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
"40";"Male";89;91;89;"179";"75.5";935863
./brain_size.csv.gz
"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
"40";"Male";89;91;89;"179";"75.5";935863

Regards.
# 18  
Old 10-04-2018
Quote:
Originally Posted by disedorgue
Hi, here a perl script that output first and last line, maybe most faster but i'm not sure...

... ... ...

Regards.
It might, or might not, be faster depending on what hardware you're using, what operating system you're using, what version of perl you're using, and what other tools you're using as a comparison. But, note that even though this is only printing the first and last lines of the compressed file, it can't avoid reading the entire compressed file and uncompressing all of the compressed data to be able to determine the contents of the last line in the file. (Decompression can't start at random places in the file; it must start at the beginning and progress byte by byte from there.)
This User Gave Thanks to Don Cragun For This Post:
# 19  
Old 10-05-2018
Quote:
Originally Posted by Don Cragun
It might, or might not, be faster depending on what hardware you're using, what operating system you're using, what version of perl you're using, and what other tools you're using as a comparison. But, note that even though this is only printing the first and last lines of the compressed file, it can't avoid reading the entire compressed file and uncompressing all of the compressed data to be able to determine the contents of the last line in the file. (Decompression can't start at random places in the file; it must start at the beginning and progress byte by byte from there.)
I know all that, but there may be a significant gain because we only create the process once instead of creating 4 * n file.
Where I have a doubt is that perl does not emulate seek by just passing through a memory buffer.
# 20  
Old 10-05-2018
Quote:
Originally Posted by disedorgue
I know all that, but there may be a significant gain because we only create the process once instead of creating 4 * n file.
How does that help?
Quote:
Where I have a doubt is that perl does not emulate seek by just passing through a memory buffer.
Perl is not magic, therefore it must decompress a file to seek inside it too.

This is because gzip doesn't have a "main dictionary" of symbols anywhere, just calculates symbols as it passes through the file. It can't skip ahead without losing track of what the data means. Most stream compressors are like this.

Last edited by Corona688; 10-05-2018 at 12:52 PM..
This User Gave Thanks to Corona688 For This Post:
# 21  
Old 10-05-2018
I said me that seek simulation work only in memory buffer of max size 64K, but after try with big file (2Go), I saw that perl script is slower that zcat and tail:

50 seconds for zcat | tail -1
1 minute 15 secondes for perl script...

I'm disappointed...

Regards.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to make awk command faster?

I have the below command which is referring a large file and it is taking 3 hours to run. Can something be done to make this command faster. awk -F ',' '{OFS=","}{ if ($13 == "9999") print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out|sort -T ${NLAP_TEMP} |uniq>... (13 Replies)
Discussion started by: Peu Mukherjee
13 Replies

2. Shell Programming and Scripting

Perl : Large amount of data put into an array

This basic code works. I have a very long list, almost 10000 lines that I am building into the array. Each line has either 2 or 3 fields as shown in the code snippit. The array elements are static (for a few reasons that out of scope of this question) the list has to be "built in". It... (5 Replies)
Discussion started by: sumguy
5 Replies

3. Shell Programming and Scripting

awk changes to make it faster

I have script like below, who is picking number from one file and and searching in another file, and printing output. Bu is is very slow to be run on huge file.can we modify it with awk #! /bin/ksh while read line1 do echo "$line1" a=`echo $line1` if then echo "$num" cat file1|nawk... (6 Replies)
Discussion started by: mirwasim
6 Replies

4. Shell Programming and Scripting

Faster way to use this awk command

awk "/May 23, 2012 /,0" /var/tmp/datafile the above command pulls out information in the datafile. the information it pulls is from the date specified to the end of the file. now, how can i make this faster if the datafile is huge? even if it wasn't huge, i feel there's a better/faster way to... (8 Replies)
Discussion started by: SkySmart
8 Replies

5. Shell Programming and Scripting

Running rename command on large files and make it faster

Hi All, I have some 80,000 files in a directory which I need to rename. Below is the command which I am currently running and it seems, it is taking fore ever to run this command. This command seems too slow. Is there any way to speed up the command. I have have GNU Parallel installed on my... (6 Replies)
Discussion started by: shoaibjameel123
6 Replies

6. Emergency UNIX and Linux Support

Help to make awk script more efficient for large files

Hello, Error awk: Internal software error in the tostring function on TS1101?05044400?.0085498227?0?.0011041461?.0034752266?.00397045?0?0?0?0?0?0?11/02/10?09/23/10???10?no??0??no?sct_det3_10_20110516_143936.txt What it is It is a unix shell script that contains an awk program as well as... (4 Replies)
Discussion started by: script_op2a
4 Replies

7. Shell Programming and Scripting

How to tar large amount of files?

Hello I have the following files VOICE_hhhh SUBSCR_llll DEL_kkkk Consider that there are 1000 VOICE files+1000 SUBSCR files+1000DEL files When i try to tar these files using tar -cvf backup.tar VOICE* SUBSCR* DEL* i get the error: ksh: /usr/bin/tar: arg list too long How can i... (9 Replies)
Discussion started by: chriss_58
9 Replies

8. AIX

amount of memory allocated to large page

We just set up a system to use large pages. I want to know if there is a command to see how much of the memory is being used for large pages. For example if we have a system with 8GB of RAm assigned and it has been set to use 4GB for large pages is there a command to show that 4GB of the *GB is... (1 Reply)
Discussion started by: daveisme
1 Replies

9. Programming

Read/Write a fairly large amount of data to a file as fast as possible

Hi, I'm trying to figure out the best solution to the following problem, and I'm not yet that much experienced like you. :-) Basically I have to read a fairly large file, composed of "messages" , in order to display all of them through an user interface (made with QT). The messages that... (3 Replies)
Discussion started by: emitrax
3 Replies

10. Shell Programming and Scripting

awk help to make my work faster

hii everyone , i have a file in which i have line numbers.. file name is file1.txt aa bb cc "12" qw xx yy zz "23" we bb qw we "123249" jh here 12,23,123249. is the line number now according to this line numbers we have to print lines from other file named... (11 Replies)
Discussion started by: kumar_amit
11 Replies
Login or Register to Ask a Question