Speed up this script!

09-02-2009

Registered User

150, 1

Join Date: Mar 2009

Last Activity: 14 July 2013, 5:00 PM EDT

Posts: 150

Thanks Given: 1

Thanked 1 Time in 1 Post

Speed up this script!

I have a script that processes a fair amount of data -- say, 25-50 megs per run. I'd like ideas on speeding it up. The code is actually just a preprocessor -- I'm using another language to do the heavy lifting. But as it happens, the preprocessing takes much more time than the final processing so I'm optimizing this rather than that.

Here's the code. The basic idea is that, for each line of input (redirected to stdin), the program checks to see if the sequence number is in $mult and, if so, prints a line asking the other program to validate that sequence:

Code:

#!/usr/bin/perl -w

open(MULT, "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
my $terminator = $/;
undef $/;
$mult = <MULT>;
$/ = $terminator;

// Print application-specific code -- snipped for brevity

$total = 0;
while(<>) {
	if (m/(A\d\d\d\d\d\d) ,((-?\d+,)*-?\d+),/) {
		$nm = $1;
		$seq = $2;
		if ($mult =~ /$nm/) { # Replace this line?
			print "go(\"$nm\", [$seq]);\n";
			$total++;
		}
	} else {
		print "print(\"Error reading line: $_\");\n";
	}
}

// Print application-specific code -- snipped for brevity

The file mult.txt is a short file of about a thousand lines, each of which is guaranteed to contain at most (exactly?) one line of the form A\d\d\d\d\d\d; the rest of the line is irrelevant here.

My thought for optimizing this: make an array of the \d\d\d\d\d\d values, sort, and do a binary search rather than a regular expression at the spot marked "Replace this line?". But I'm not sure how to go about that, or even if that's the 'right' optimization. Thoughts?

Also, any suggestions on making better idiomatic use of Perl would be appreciated. I'm not at all accustomed to the language.

CRGreathouse

View Public Profile for CRGreathouse

Find all posts by CRGreathouse

09-03-2009

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Create a hash of arrays - each array being one line of your mult.txt file.

You are searching 1000 entries with a regex - regex is a linear search, resulting in 500 lookups per average per line of stdin.

Here is Perl Programming's take on what you want to do:
Hashes of Arrays (Programming Perl)

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

09-03-2009

Registered User

150, 1

Join Date: Mar 2009

Last Activity: 14 July 2013, 5:00 PM EDT

Posts: 150

Thanks Given: 1

Thanked 1 Time in 1 Post

OK, I'll try that.

CRGreathouse

View Public Profile for CRGreathouse

Find all posts by CRGreathouse

09-04-2009

Registered User

32, 0

Join Date: Jan 2009

Last Activity: 31 December 2013, 1:03 AM EST

Posts: 32

Thanks Given: 1

Thanked 0 Times in 0 Posts

Not much of a speed thing, however.

Code:

my $terminator = $/;
undef $/;
$mult = <MULT>;
$/ = $terminator;

According 'man perlvar' this is a no no...
The proper method would be to keep it local($/) to the smallest block... ie:

Code:

{  # Begin localization block
   local($/);
  $mult = <MULT>;
} # End localization block

Hash it!

For a simple hash example check out a recent thread of mine, It's simple so hopefully easy to understand and is similar to your needs... Delete block of text in one file based on list in another file

Also for better assistance a snippit of 'mult.txt' and a snippit of data would be very helpful in providing good useful information.

-Enjoy
fh : )_~

---------- Post updated at 06:23 PM ---------- Previous update was at 12:14 AM ----------

Thought I would tweek this a bit for ya!

I am new to Perl, My first line of Perl was just over a week ago.. (08/26/2009)
Any comments are very welcome!

3 examples depending on what you really want/need!

[edit]
NOTE:
After some thought I felt it better to modify Example 2 for cases of dirty data...
[/edit]

I am ASSUMING your data looks something like:

Code:

A123456 ,789,543,MoreData
A654320 ,789,543,MoreData
A024689 ,789,543,MoreData

I am ASSUMING your mult.txt is something like this:

Code:

Example 1, As close to your original as possible without waste.

Code:

#!/usr/bin/perl

use strict;
use warnings;

my $total;
my $multfile;
my %multhash;

my @Atmp;                  # for debugging & education purposes

open($multfile, "<", "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
while (<$multfile>) {
  chomp;
  next if /^$/;            # skip blank lines
  $multhash{ $_ } = $_;    # add to hash, using element as the key & data
}
close($multfile);

@Atmp = (keys %multhash);  # for debugging & education purposes
print "@Atmp\n";           # for debugging & education purposes

$total = 0;
while(<>) {
  if (m/(A\d\d\d\d\d\d) ,((-?\d+,)*-?\d+),/) {
    if (exists $multhash{ $1 }) {
      print "go(\"$1\", [$2]);\n";
      $total++;
    }
  } else {
    print "print(\"Error reading line: $_\");\n";
  }
}
print "Total=$total\n";

Example 2, A bit cleaner

Code:

#!/usr/bin/perl

use strict;
use warnings;

my $total;
my $multfile;
my %multhash;

open($multfile, "<", "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
while (<$multfile>) {
  chomp;
  next if /^$/;            # skip blank lines
  $multhash{ $_ } = $_;    # add to hash, using element as the key & data
}
close($multfile);

$total = 0;
while(<>) {
  if (m/(A\d\d\d\d\d\d) ,((-?\d+,)*-?\d+),/ && exists $multhash{ $1 }) {
    print "go(\"$1\", [$2]);\n";
    $total++;
  }
}
print "Total=$total\n";

Example 3, Lean and mean with the need for speed!
NOTE: The regex changes!

Code:

#!/usr/bin/perl

use strict;
use warnings;

my $multfile;
my %multhash;

open($multfile, "<", "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
while (<$multfile>) {
  chomp;
  next if /^$/;            # skip blank lines
  $multhash{ $_ } = $_;    # add to hash, using element as the key & data
}
close($multfile);

while(<>) {
  m/(A\d{6}) ,(\d+,\d+)/;
  print "go(\"$1\", [$2]);\n" if exists $multhash{ $1 }
}

Hope this gets things going a bit faster for ya!

-Enjoy
fh : )_~

Last edited by Festus Hagen; 09-05-2009 at 12:12 AM.. Reason: regex change Example 3 / Modify #2 for dirty data

Festus Hagen

View Public Profile for Festus Hagen

Find all posts by Festus Hagen

Shell Programming and Scripting

Speed up this script!

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need to Speed up shell script

Discussion started by: Bhanuprasad

2. Shell Programming and Scripting

Speed up the loop in shell script

Discussion started by: kumar85shiv

3. Shell Programming and Scripting

Help me with speed up this script

Discussion started by: benga

4. Shell Programming and Scripting

How can i speed this script up?

Discussion started by: brunlea

5. Shell Programming and Scripting

Slow Perl script: how to speed up?

Discussion started by: gimley

6. Shell Programming and Scripting

Any trick to speed up script?

Discussion started by: npatwardhan

7. Filesystems, Disks and Memory

data from blktrace: read speed V.S. write speed

Discussion started by: W.C.C

8. Shell Programming and Scripting

Help to improve speed of text processing script

Discussion started by: lorus

9. Shell Programming and Scripting

any way to speed up calculations in bash script

Discussion started by: npatwardhan

10. Filesystems, Disks and Memory

dmidecode, RAM speed = "Current Speed: Unknown"

Discussion started by: Santi