I have a script that processes a fair amount of data -- say, 25-50 megs per run. I'd like ideas on speeding it up. The code is actually just a preprocessor -- I'm using another language to do the heavy lifting. But as it happens, the preprocessing takes much more time than the final processing so I'm optimizing this rather than that.
Here's the code. The basic idea is that, for each line of input (redirected to stdin), the program checks to see if the sequence number is in $mult and, if so, prints a line asking the other program to validate that sequence:
Code:
#!/usr/bin/perl -w
open(MULT, "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
my $terminator = $/;
undef $/;
$mult = <MULT>;
$/ = $terminator;
// Print application-specific code -- snipped for brevity
$total = 0;
while(<>) {
if (m/(A\d\d\d\d\d\d) ,((-?\d+,)*-?\d+),/) {
$nm = $1;
$seq = $2;
if ($mult =~ /$nm/) { # Replace this line?
print "go(\"$nm\", [$seq]);\n";
$total++;
}
} else {
print "print(\"Error reading line: $_\");\n";
}
}
// Print application-specific code -- snipped for brevity
The file mult.txt is a short file of about a thousand lines, each of which is guaranteed to contain at most (exactly?) one line of the form A\d\d\d\d\d\d; the rest of the line is irrelevant here.
My thought for optimizing this: make an array of the \d\d\d\d\d\d values, sort, and do a binary search rather than a regular expression at the spot marked "Replace this line?". But I'm not sure how to go about that, or even if that's the 'right' optimization. Thoughts?
Also, any suggestions on making better idiomatic use of Perl would be appreciated. I'm not at all accustomed to the language.
I am ASSUMING your mult.txt is something like this:
Code:
A123456
A654321
A024689
A987654
Example 1, As close to your original as possible without waste.
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $total;
my $multfile;
my %multhash;
my @Atmp; # for debugging & education purposes
open($multfile, "<", "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
while (<$multfile>) {
chomp;
next if /^$/; # skip blank lines
$multhash{ $_ } = $_; # add to hash, using element as the key & data
}
close($multfile);
@Atmp = (keys %multhash); # for debugging & education purposes
print "@Atmp\n"; # for debugging & education purposes
$total = 0;
while(<>) {
if (m/(A\d\d\d\d\d\d) ,((-?\d+,)*-?\d+),/) {
if (exists $multhash{ $1 }) {
print "go(\"$1\", [$2]);\n";
$total++;
}
} else {
print "print(\"Error reading line: $_\");\n";
}
}
print "Total=$total\n";
Example 2, A bit cleaner
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $total;
my $multfile;
my %multhash;
open($multfile, "<", "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
while (<$multfile>) {
chomp;
next if /^$/; # skip blank lines
$multhash{ $_ } = $_; # add to hash, using element as the key & data
}
close($multfile);
$total = 0;
while(<>) {
if (m/(A\d\d\d\d\d\d) ,((-?\d+,)*-?\d+),/ && exists $multhash{ $1 }) {
print "go(\"$1\", [$2]);\n";
$total++;
}
}
print "Total=$total\n";
Example 3, Lean and mean with the need for speed!
NOTE: The regex changes!
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $multfile;
my %multhash;
open($multfile, "<", "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
while (<$multfile>) {
chomp;
next if /^$/; # skip blank lines
$multhash{ $_ } = $_; # add to hash, using element as the key & data
}
close($multfile);
while(<>) {
m/(A\d{6}) ,(\d+,\d+)/;
print "go(\"$1\", [$2]);\n" if exists $multhash{ $1 }
}
Hope this gets things going a bit faster for ya!
-Enjoy
fh : )_~
Last edited by Festus Hagen; 09-05-2009 at 12:12 AM..
Reason: regex change Example 3 / Modify #2 for dirty data
Hello,
I am basic level shell script developer. I have developed the following script. The shell script basically tracking various files containing certain strings. I am finding options to make the script run more faster. Any help/suggestion would be appreciated :)
#! /bin/bash
# Greps for... (6 Replies)
Hi
I have written a shell script which will test 300 to 500 IPs to find which are pinging and which are not pinging.
the script which give output as
10.x.x.x is pining
10.x.x.x. is not pining
-
-
-
10.x.x.x is pining
like above.
But, this script is taking... (6 Replies)
hey guys i have a perl script wich use to compare hashes but it tookes a long time to do that so i wich i will have the soulition to do it soo fast
he is the code
<redacted> (1 Reply)
Hi,
Im quite new to scripting and would like a bit of assistance with trying to speed up the following script. At the moment it is quite slow....
Any way to improve it?
total=111120
while
do
total=`expr $total + 1`
INCREMENT=$total
firstline = "blablabla"
secondline = "blablabla"... (5 Replies)
I had written a perl script to compare two files: new and master and get the output of the first file i.e. the first file: words that are not in the master file
STRUCTURE OF THE TWO FILES
The first file is a series of names
ramesh
sushil
jonga
sudesh
lugdi
whereas the second file (could be... (4 Replies)
Hi Guys,
I have a script that I am using to convert some text files to xls files. I create multiple temp. files in the process of conversion. Other than reducing the temp. files, are there any general tricks to help speed up the script?
I am running it in the bash shell.
Thanks. (6 Replies)
I analysed disk performance with blktrace and get some data:
read:
8,3 4 2141 2.882115217 3342 Q R 195732187 + 32
8,3 4 2142 2.882116411 3342 G R 195732187 + 32
8,3 4 2144 2.882117647 3342 I R 195732187 + 32
8,3 4 2145 ... (1 Reply)
Hey together,
You should know, that I'am relatively new to shell scripting, so my solution is probably a little awkward.
Here is the script:
#!/bin/bash
live_dir=/var/lib/pokerhands/live
for limit in `find $live_dir/ -type d | sed -e s#$live_dir/##`; do
cat $live_dir/$limit/*... (19 Replies)
hi i have a script that is taking the difference of multiple columns in a file from a value from a single row..so far i have a loop to do that.. all the data is floating point..fin has the difference between array1 and array2..array1 has 700 x 300= 210000 values and array2 has 700 values..
... (11 Replies)
Hello,
I have a Supermicro server with a P4SCI mother board running Debian Sarge 3.1. This is the "dmidecode" output related to RAM info:
RAM speed information is incomplete.. "Current Speed: Unknown", is there anyway/soft to get the speed of installed RAM modules? thanks!!
Regards :)... (0 Replies)